Skip to content

Instantly share code, notes, and snippets.

View AmineDiro's full-sized avatar
👨‍🍳
Cooking

AmineDiro AmineDiro

👨‍🍳
Cooking
View GitHub Profile
@AmineDiro
AmineDiro / perfetto_torch_profile_comm.sql
Created May 26, 2026 10:06
overlap between compute/comm
DROP VIEW IF EXISTS gpu;
CREATE VIEW gpu AS
SELECT
slice.id AS id,
TRIM(thread.name) AS stream,
slice.name AS kernel,
slice.category AS cat,
slice.depth AS depth,
slice.ts AS ts,
slice.dur AS dur,
@AmineDiro
AmineDiro / repro_sonic_moe_backward_sentinel.py
Created April 30, 2026 07:49
repro_sonic_moe_backward_sentinel.py
"""Minimal self-contained repro scaffold for the sonic-moe sentinel/backward bug.
Dependencies: torch, kernels, nvidia-cutlass-dsl. NO transformers, NO distributed.
Background
----------
The kernel documents `expert_ids >= E` as a supported sentinel value used by EP
to mark non-local routing slots:
functional/triton_kernels/__init__.py:174-177
@AmineDiro
AmineDiro / accelerate_pr4022_test_gist.py
Created April 28, 2026 17:23
accelerate_pr4022_test_gist.py
"""Test for PR #4022: per-layer compile + accelerate FSDP2 — slow path FIXED.
Same script as my slow-path repro (https://gist.github.com/AmineDiro/2457fbee70662d584a116cc3ca80dd07);
the only change is adding the `dynamo_config` block to the accelerate yaml — that's
the trigger for `compile_regions_fsdp2` introduced by this PR.
Setup: Qwen3-30B-A3B (MoE, 128 experts, 48 layers) · 2x8 H100 SXM 80GB ·
FSDP2 DP=16 · seq_len=16384 · SFTTrainer + grad ckpt + bf16 + packing.
"""
# accelerate_config.yaml (the only diff vs. the slow-path repro is dynamo_config):
@AmineDiro
AmineDiro / test_resume_ep_fsdp.py
Created April 28, 2026 08:17
test ep+fsdp save + resume
"""EP+FSDP2 save/load correctness test.
Verifies that:
full4 losses [L0, L1, L2, L3] == save2 [L0, L1] ++ load2 [L2, L3]
Run as three separate srun/torchrun jobs sharing a checkpoint dir:
--phase full4 : train 4 steps from scratch
--phase save2 : train 2 steps, save state
--phase load2 : load state, train 2 more steps
@AmineDiro
AmineDiro / slow_accelerate_fsdp2.py
Last active April 24, 2026 08:35
accelerate fsdp2
"""Per-layer compile + accelerate FSDP2 = 10% MFU (slow path)."""
# accelerate_config.yaml:
# distributed_type: FSDP
# fsdp_config:
# fsdp_version: 2
# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
# fsdp_cpu_ram_efficient_loading: true
# fsdp_offload_params: false
# num_machines: 2
@AmineDiro
AmineDiro / fast_raw_fully_shard.py
Last active April 24, 2026 08:34
FSDP2 Per-layer compile
"""Per-layer compile + raw fully_shard = 32% MFU (fast path)."""
# Run with:
# torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$RANK \
# --master_addr=$MASTER --master_port=29500 script.py
#
# Result: ~3,031 ms/step, 32.1% MFU
import os, time, torch
torch.backends.cuda.matmul.allow_tf32 = True
@AmineDiro
AmineDiro / test_qwen3_30B.py
Created April 14, 2026 14:47
test_qwen3 EP
"""
Test TP + EP + CP + FSDP2 for Qwen3 MoE.
Validates that the base_model_ep_plan correctly shards:
- Attention weights via TP (colwise/rowwise)
- Expert weights via EP (grouped_gemm)
- Router via EP (ep_router)
- Context parallelism via torch CP (sequence splitting + ring attention)
- FSDP2 for data parallel weight sharding
@AmineDiro
AmineDiro / benchmark_grpo.py
Created March 23, 2026 21:56
benchmark grpo liger
"""
Benchmark: _ChunkedLogProbFunction vs LigerFusedLinearGRPOLoss
Compares the two chunked GRPO loss approaches without a model:
- _ChunkedLogProbFunction: chunks along vocabulary (V)
- LigerFusedLinearGRPOLoss: chunks along batch (B), fused fwd+bwd
Preset model configs:
--preset qwen3-4bH=2560, V=151936 (Qwen3-4B-Thinking, 262k context)
--preset qwen2.5-7bH=3584, V=152064
"""
HOW TO RUN
-----------
Step 1 – Start a vLLM server with data-parallel support (replace N with the number of DP shards,
and adjust --tensor-parallel-size / --gpu-memory-utilization as needed):
CUDA_VISIBLE_DEVICES=2,3,4,5 VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-4B \
--data-parallel-size 4 \
--tensor-parallel-size 1 \
@AmineDiro
AmineDiro / async_grpo_eos.py
Last active March 29, 2026 07:07
minimal test for reward
"""
Minimal sanity-check for AsyncGRPOTrainer: the "Immediate EOS" test.
The model is rewarded with R(y) = -len(completion_tokens). The optimal policy
is to emit <EOS> immediately (reward = -1). Within a handful of steps the
average completion length should drop and reward_mean should climb toward -1.
Start the vLLM server:
CUDA_VISIBLE_DEVICES=0 VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \