This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| DROP VIEW IF EXISTS gpu; | |
| CREATE VIEW gpu AS | |
| SELECT | |
| slice.id AS id, | |
| TRIM(thread.name) AS stream, | |
| slice.name AS kernel, | |
| slice.category AS cat, | |
| slice.depth AS depth, | |
| slice.ts AS ts, | |
| slice.dur AS dur, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Minimal self-contained repro scaffold for the sonic-moe sentinel/backward bug. | |
| Dependencies: torch, kernels, nvidia-cutlass-dsl. NO transformers, NO distributed. | |
| Background | |
| ---------- | |
| The kernel documents `expert_ids >= E` as a supported sentinel value used by EP | |
| to mark non-local routing slots: | |
| functional/triton_kernels/__init__.py:174-177 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Test for PR #4022: per-layer compile + accelerate FSDP2 — slow path FIXED. | |
| Same script as my slow-path repro (https://gist.github.com/AmineDiro/2457fbee70662d584a116cc3ca80dd07); | |
| the only change is adding the `dynamo_config` block to the accelerate yaml — that's | |
| the trigger for `compile_regions_fsdp2` introduced by this PR. | |
| Setup: Qwen3-30B-A3B (MoE, 128 experts, 48 layers) · 2x8 H100 SXM 80GB · | |
| FSDP2 DP=16 · seq_len=16384 · SFTTrainer + grad ckpt + bf16 + packing. | |
| """ | |
| # accelerate_config.yaml (the only diff vs. the slow-path repro is dynamo_config): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """EP+FSDP2 save/load correctness test. | |
| Verifies that: | |
| full4 losses [L0, L1, L2, L3] == save2 [L0, L1] ++ load2 [L2, L3] | |
| Run as three separate srun/torchrun jobs sharing a checkpoint dir: | |
| --phase full4 : train 4 steps from scratch | |
| --phase save2 : train 2 steps, save state | |
| --phase load2 : load state, train 2 more steps |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Per-layer compile + accelerate FSDP2 = 10% MFU (slow path).""" | |
| # accelerate_config.yaml: | |
| # distributed_type: FSDP | |
| # fsdp_config: | |
| # fsdp_version: 2 | |
| # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP | |
| # fsdp_cpu_ram_efficient_loading: true | |
| # fsdp_offload_params: false | |
| # num_machines: 2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Per-layer compile + raw fully_shard = 32% MFU (fast path).""" | |
| # Run with: | |
| # torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$RANK \ | |
| # --master_addr=$MASTER --master_port=29500 script.py | |
| # | |
| # Result: ~3,031 ms/step, 32.1% MFU | |
| import os, time, torch | |
| torch.backends.cuda.matmul.allow_tf32 = True |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| Test TP + EP + CP + FSDP2 for Qwen3 MoE. | |
| Validates that the base_model_ep_plan correctly shards: | |
| - Attention weights via TP (colwise/rowwise) | |
| - Expert weights via EP (grouped_gemm) | |
| - Router via EP (ep_router) | |
| - Context parallelism via torch CP (sequence splitting + ring attention) | |
| - FSDP2 for data parallel weight sharding |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| Benchmark: _ChunkedLogProbFunction vs LigerFusedLinearGRPOLoss | |
| Compares the two chunked GRPO loss approaches without a model: | |
| - _ChunkedLogProbFunction: chunks along vocabulary (V) | |
| - LigerFusedLinearGRPOLoss: chunks along batch (B), fused fwd+bwd | |
| Preset model configs: | |
| --preset qwen3-4b → H=2560, V=151936 (Qwen3-4B-Thinking, 262k context) | |
| --preset qwen2.5-7b → H=3584, V=152064 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| HOW TO RUN | |
| ----------- | |
| Step 1 – Start a vLLM server with data-parallel support (replace N with the number of DP shards, | |
| and adjust --tensor-parallel-size / --gpu-memory-utilization as needed): | |
| CUDA_VISIBLE_DEVICES=2,3,4,5 VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-4B \ | |
| --data-parallel-size 4 \ | |
| --tensor-parallel-size 1 \ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| Minimal sanity-check for AsyncGRPOTrainer: the "Immediate EOS" test. | |
| The model is rewarded with R(y) = -len(completion_tokens). The optimal policy | |
| is to emit <EOS> immediately (reward = -1). Within a handful of steps the | |
| average completion length should drop and reward_mean should climb toward -1. | |
| Start the vLLM server: | |
| CUDA_VISIBLE_DEVICES=0 VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \ |
NewerOlder