Skip to content

Instantly share code, notes, and snippets.

View zeryx's full-sized avatar

zeryx zeryx

View GitHub Profile
@zeryx
zeryx / test_nvfp4_sm12x.sh
Created May 5, 2026 15:32
NVFP4 SM12x (Blackwell) validation script for vLLM — tests CUTLASS FP4 kernels, MoE backend selection, and LoRA mixin fix
#!/usr/bin/env bash
# NVFP4 SM12x Validation Script
# Tests CUTLASS FP4 kernels and MoE backend selection on Blackwell (SM120/SM121)
# Usage: docker run --gpus all --privileged --rm -v /path/to/this:/test ghcr.io/zeryx/vllm:nvfp4-sm120-f59929f59 bash /test/test_nvfp4_sm12x.sh
# Or run directly in a vLLM environment built with TORCH_CUDA_ARCH_LIST="12.0"
set -euo pipefail
RED='\033[0;31m'
GREEN='\033[0;32m'
@zeryx
zeryx / nano_completions_check.py
Created April 18, 2026 15:49
nano prefix cache: completions vs chat comparison
"""Test prefix caching with /v1/completions (raw text, no chat template).
Sends 3 identical long prompts, scrapes /metrics pre/post each, reports
per-request prefix_cache counter delta. If hits start incrementing on
req 2+, then cache works for raw text and the chat-template is the
variance source. If hits stay 0 across all 3, the bug is deeper.
"""
import json, urllib.request, time
PORT = 8000
@zeryx
zeryx / nano_cache_check.py
Created April 18, 2026 15:39
nano prefix cache per-request diagnostic
"""Send N identical requests with sys=1000, scrape /metrics before/after
each request, and report per-request cache hit delta."""
import json, urllib.request, time
PORT = 8000
MODEL = "/models/nano-30b-nvfp4"
BASE = f"http://localhost:{PORT}"
# ~1000-token shared system prompt
SYS = ("You are a helpful assistant. Please provide detailed technical answers "
@zeryx
zeryx / nc.py
Created April 17, 2026 22:33
nano prefix cache diagnostic
import json, urllib.request
URL = "http://localhost:8000/v1/chat/completions"
MODEL = "/models/nano-30b-nvfp4"
def trial(label, sys_repeat, n=3):
sys_text = ("You are a helpful assistant. Please provide detailed technical answers. " * sys_repeat).strip()
payload = {
"model": MODEL,
"messages": [
@zeryx
zeryx / vllm_bug_report.md
Last active April 14, 2026 16:18
vLLM 0.19.0 Bug: Mamba prefix caching + MTP speculative decoding crashes on startup (NemotronH)

Bug: Mamba prefix caching + MTP speculative decoding crashes on startup

Your current environment

  • vLLM version: 0.19.0 (official Docker image vllm/vllm-openai:v0.19.0)
  • GPU: NVIDIA B200 (178 GB VRAM), tested TP=1 through TP=8
  • Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (NemotronH hybrid Mamba2-Transformer MoE)
  • Python: 3.12

Model/config

@zeryx
zeryx / algorithm.py
Last active December 13, 2021 16:53
Scikit learn Algorithmia demo using the Model Manifest system to tie model data and code together immutably
from Algorithmia import ADK
import joblib
## This function uses the model manifest `state` or `modelData` class to get model files defined in the model manifest automatically.
## No client work required, just make sure the name in `get_model` matches the name in your model manifest.
def load(state):
state['model'] = joblib.load(state.get_model("model"))
state['vectorizer'] = joblib.load(state.get_model("vectorizer"))
return state
@zeryx
zeryx / model_reloading.py
Last active September 24, 2021 17:31
An algorithm that attempts to reload it's model file (if it's been updated) every 5 minutes
import Algorithmia
from time import time
import pickle
from src.data import data
client = Algorithmia.client()
DATA_MODEL_DIR = "data://.my/example"
MODEL_NAME = "example.pkl"
TIME_0 = 0
LAST_MODIFIED = ""
@zeryx
zeryx / algorithm_with_lock.py
Created September 24, 2021 16:30
This algorithm synchronously checks a resource file by ensuing a lock file doesn't already exist.
from Algorithmia import ADK
import Algorithmia
from time import sleep, time
state_file_path = "data://.my/locking/resource.json"
lock_file_path = "data://.my/locking/lock"
client = Algorithmia.client()
class AlgorithmiaLock(object):
@zeryx
zeryx / algorithm_process_pandas_df.py
Created March 19, 2021 16:45
algorithm api calls with large pandas dataframe objects, example of using the data API
import Algorithmia
import pandas as pd
client = Algorithmia.client()
def apply(input):
input_dataframe = pd.DataFrame.from_dict(client.file(input).getJson())
...
...
@zeryx
zeryx / generative_model_finetuning.py
Created December 9, 2020 16:54
finetuning a GPT-2 model to handle a character list
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AdamW
from random import choice
from torch.nn import functional as F
import torch
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to('cuda')