Skip to content

Instantly share code, notes, and snippets.

View zeryx's full-sized avatar

zeryx zeryx

View GitHub Profile
@zeryx
zeryx / analyze.py
Last active June 10, 2026 19:32
JAX compute_on2: place a Transformer Engine matmul+add on a chosen CUDA stream, verified with nsys cuda_gpu_trace
"""Parse nsys cuda_gpu_trace CSV; report which CUDA stream each kernel class ran on."""
import csv, sys
from collections import defaultdict
path = sys.argv[1]
def classify(name):
n = name.lower()
if any(k in n for k in ("nccl", "allgather", "reducescatter", "sendrecv")):
return "NCCL-comm"
@zeryx
zeryx / compute_on_repro.py
Last active June 9, 2026 20:54
Reproducer: jax.experimental.compute_on gpu_stream annotation silent no-op in JAX 0.9.1
"""
Reproducer: does compute_on("gpu_stream:N") annotate a GEMM in JAX 0.9.1?
Hypothesis under test: a bare elementwise add (x + y) gets fused/elided by XLA,
leaving no standalone op to carry _xla_stream_annotation -> annotation disappears.
A GEMM is a real cuBLAS/cutlass kernel that won't be fused away, so it should
retain the annotation if the mechanism works at all.
We check BOTH:
- lowered (pre-compilation StableHLO) .lower().as_text()
@zeryx
zeryx / README.md
Last active June 3, 2026 19:29
JAX↔XLA bring-your-own-comm NCCL symmetric all-reduce — out-of-tree FFI custom call, no XLA recompile (verified jaxlib 0.10.2/NCCL 2.29.7 on 2x Blackwell)

BYO-comm symmetric all-reduce — XLA custom call, no XLA recompile

A minimal recipe for write your own CUDA communication kernel as an out-of-tree XLA FFI custom call, give it symmetric NCCL buffers, and reach peers with the NCCL device API (ncclGetLsaPointer) — all built against a released jaxlib, with no XLA source checkout and no recompile.

This is the simplified sibling of the in-tree manual recipe. The in-tree one had to be built inside /opt/xla because it used XLA's internal FFI collective

@zeryx
zeryx / README.md
Last active June 2, 2026 21:03
Recipe: automatic NCCL symmetric buffers in XLA/JAX (pure Python, no rebuild) — xla_gpu_experimental_enable_nccl_symmetric_buffers

Recipe: automatic NCCL symmetric buffers in XLA/JAX (pure Python)

Turn on XLA's automatic NCCL symmetric-buffer registration for its built-in collectives (psum, all-reduce, all-gather, …) — no custom C++, no rebuild, runs on stock jaxlib. XLA window-registers the collective buffers for you via ncclCommWindowRegister(..., NCCL_WIN_COLL_SYMMETRIC).

Verified on 2× NVIDIA RTX PRO 6000 Blackwell (sm_120) with the jax-toolbox image ghcr.io/nvidia/jax:jax-2026-06-02 (jax/jaxlib 0.10.2.dev20260602, NCCL 2.28.8).

@zeryx
zeryx / README.md
Last active June 3, 2026 14:07
Recipe: symmetric (NCCL-window-registered) buffers with XLA + JAX (ncclCommWindowRegister + operands/results_memory_spaces, openxla/xla#39742)

Recipe: NCCL symmetric buffers with XLA + JAX (manual registration)

Set up a buffer that NCCL has registered as symmetric memory (ncclCommWindowRegister(..., NCCL_WIN_COLL_SYMMETRIC)), reach it from a custom GPU kernel through the NCCL device API (ncclGetLsaPointer), and drive it from XLA/JAX via a custom call — including the JAX↔XLA bridge from openxla/xla#39742 (operands_memory_spaces / results_memory_spaces frontend attributes).

This recipe is verified end-to-end on 2× NVIDIA RTX PRO 6000 Blackwell

@zeryx
zeryx / test_nvfp4_sm12x.sh
Created May 5, 2026 15:32
NVFP4 SM12x (Blackwell) validation script for vLLM — tests CUTLASS FP4 kernels, MoE backend selection, and LoRA mixin fix
#!/usr/bin/env bash
# NVFP4 SM12x Validation Script
# Tests CUTLASS FP4 kernels and MoE backend selection on Blackwell (SM120/SM121)
# Usage: docker run --gpus all --privileged --rm -v /path/to/this:/test ghcr.io/zeryx/vllm:nvfp4-sm120-f59929f59 bash /test/test_nvfp4_sm12x.sh
# Or run directly in a vLLM environment built with TORCH_CUDA_ARCH_LIST="12.0"
set -euo pipefail
RED='\033[0;31m'
GREEN='\033[0;32m'
@zeryx
zeryx / nano_completions_check.py
Created April 18, 2026 15:49
nano prefix cache: completions vs chat comparison
"""Test prefix caching with /v1/completions (raw text, no chat template).
Sends 3 identical long prompts, scrapes /metrics pre/post each, reports
per-request prefix_cache counter delta. If hits start incrementing on
req 2+, then cache works for raw text and the chat-template is the
variance source. If hits stay 0 across all 3, the bug is deeper.
"""
import json, urllib.request, time
PORT = 8000
@zeryx
zeryx / nano_cache_check.py
Created April 18, 2026 15:39
nano prefix cache per-request diagnostic
"""Send N identical requests with sys=1000, scrape /metrics before/after
each request, and report per-request cache hit delta."""
import json, urllib.request, time
PORT = 8000
MODEL = "/models/nano-30b-nvfp4"
BASE = f"http://localhost:{PORT}"
# ~1000-token shared system prompt
SYS = ("You are a helpful assistant. Please provide detailed technical answers "
@zeryx
zeryx / nc.py
Created April 17, 2026 22:33
nano prefix cache diagnostic
import json, urllib.request
URL = "http://localhost:8000/v1/chat/completions"
MODEL = "/models/nano-30b-nvfp4"
def trial(label, sys_repeat, n=3):
sys_text = ("You are a helpful assistant. Please provide detailed technical answers. " * sys_repeat).strip()
payload = {
"model": MODEL,
"messages": [
@zeryx
zeryx / vllm_bug_report.md
Last active April 14, 2026 16:18
vLLM 0.19.0 Bug: Mamba prefix caching + MTP speculative decoding crashes on startup (NemotronH)

Bug: Mamba prefix caching + MTP speculative decoding crashes on startup

Your current environment

  • vLLM version: 0.19.0 (official Docker image vllm/vllm-openai:v0.19.0)
  • GPU: NVIDIA B200 (178 GB VRAM), tested TP=1 through TP=8
  • Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (NemotronH hybrid Mamba2-Transformer MoE)
  • Python: 3.12

Model/config