AmineDiro AmineDiro

👨‍🍳

Cooking

ML engineer. Telecom - ENSAE. All things computer science

AmineDiro / perfetto_torch_profile_comm.sql

Created May 26, 2026 10:06

overlap between compute/comm

AmineDiro / repro_sonic_moe_backward_sentinel.py

Created April 30, 2026 07:49

repro_sonic_moe_backward_sentinel.py

	"""Minimal self-contained repro scaffold for the sonic-moe sentinel/backward bug.

	Dependencies: torch, kernels, nvidia-cutlass-dsl. NO transformers, NO distributed.

	Background
	----------
	The kernel documents `expert_ids >= E` as a supported sentinel value used by EP
	to mark non-local routing slots:

	functional/triton_kernels/__init__.py:174-177

AmineDiro / accelerate_pr4022_test_gist.py

Created April 28, 2026 17:23

accelerate_pr4022_test_gist.py

	"""Test for PR #4022: per-layer compile + accelerate FSDP2 — slow path FIXED.

	Same script as my slow-path repro (https://gist.github.com/AmineDiro/2457fbee70662d584a116cc3ca80dd07);
	the only change is adding the `dynamo_config` block to the accelerate yaml — that's
	the trigger for `compile_regions_fsdp2` introduced by this PR.

	Setup: Qwen3-30B-A3B (MoE, 128 experts, 48 layers) · 2x8 H100 SXM 80GB ·
	FSDP2 DP=16 · seq_len=16384 · SFTTrainer + grad ckpt + bf16 + packing.
	"""
	# accelerate_config.yaml (the only diff vs. the slow-path repro is dynamo_config):

AmineDiro / test_resume_ep_fsdp.py

Created April 28, 2026 08:17

test ep+fsdp save + resume

	"""EP+FSDP2 save/load correctness test.

	Verifies that:
	full4 losses [L0, L1, L2, L3] == save2 [L0, L1] ++ load2 [L2, L3]

	Run as three separate srun/torchrun jobs sharing a checkpoint dir:

	--phase full4 : train 4 steps from scratch
	--phase save2 : train 2 steps, save state
	--phase load2 : load state, train 2 more steps

AmineDiro / slow_accelerate_fsdp2.py

Last active April 24, 2026 08:35

accelerate fsdp2

	"""Per-layer compile + accelerate FSDP2 = 10% MFU (slow path)."""

	# accelerate_config.yaml:
	# distributed_type: FSDP
	# fsdp_config:
	# fsdp_version: 2
	# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
	# fsdp_cpu_ram_efficient_loading: true
	# fsdp_offload_params: false
	# num_machines: 2

AmineDiro / fast_raw_fully_shard.py

Last active April 24, 2026 08:34

FSDP2 Per-layer compile

	"""Per-layer compile + raw fully_shard = 32% MFU (fast path)."""

	# Run with:
	# torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$RANK \
	# --master_addr=$MASTER --master_port=29500 script.py
	#
	# Result: ~3,031 ms/step, 32.1% MFU

	import os, time, torch
	torch.backends.cuda.matmul.allow_tf32 = True

AmineDiro / test_qwen3_30B.py

Created April 14, 2026 14:47

test_qwen3 EP

	"""
	Test TP + EP + CP + FSDP2 for Qwen3 MoE.

	Validates that the base_model_ep_plan correctly shards:
	- Attention weights via TP (colwise/rowwise)
	- Expert weights via EP (grouped_gemm)
	- Router via EP (ep_router)
	- Context parallelism via torch CP (sequence splitting + ring attention)
	- FSDP2 for data parallel weight sharding

AmineDiro / benchmark_grpo.py

Created March 23, 2026 21:56

benchmark grpo liger

	"""
	Benchmark: _ChunkedLogProbFunction vs LigerFusedLinearGRPOLoss

	Compares the two chunked GRPO loss approaches without a model:
	- _ChunkedLogProbFunction: chunks along vocabulary (V)
	- LigerFusedLinearGRPOLoss: chunks along batch (B), fused fwd+bwd

	Preset model configs:
	--preset qwen3-4b → H=2560, V=151936 (Qwen3-4B-Thinking, 262k context)
	--preset qwen2.5-7b → H=3584, V=152064

AmineDiro / vllm_weight_sync_test.py

Created March 19, 2026 18:42

sync weights ok

	"""

	HOW TO RUN
	-----------
	Step 1 – Start a vLLM server with data-parallel support (replace N with the number of DP shards,
	and adjust --tensor-parallel-size / --gpu-memory-utilization as needed):

	CUDA_VISIBLE_DEVICES=2,3,4,5 VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-4B \
	--data-parallel-size 4 \
	--tensor-parallel-size 1 \

AmineDiro / async_grpo_eos.py

Last active March 29, 2026 07:07

minimal test for reward

	"""
	Minimal sanity-check for AsyncGRPOTrainer: the "Immediate EOS" test.

	The model is rewarded with R(y) = -len(completion_tokens). The optimal policy
	is to emit <EOS> immediately (reward = -1). Within a handful of steps the
	average completion length should drop and reward_mean should climb toward -1.

	Start the vLLM server:

	CUDA_VISIBLE_DEVICES=0 VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \