Skip to content

Instantly share code, notes, and snippets.

@armand1m
Last active February 22, 2026 22:03
Show Gist options
  • Select an option

  • Save armand1m/8f354797ed39f14e14cea0ed5c52c770 to your computer and use it in GitHub Desktop.

Select an option

Save armand1m/8f354797ed39f14e14cea0ed5c52c770 to your computer and use it in GitHub Desktop.
qwen3-coder-next - vllm 0.15.1 - transformers 5 - optimized for dgx spark
#!/bin/bash
docker run -d \
--name vllm \
--restart unless-stopped \
--gpus all \
--ipc host \
--shm-size 64gb \
--memory 110g \
--memory-swap 120g \
--pids-limit 4096 \
-p 0.0.0.0:18080:8000 \
-e HF_TOKEN="${HF_TOKEN:-}" \
-e VLLM_LOGGING_LEVEL="INFO" \
-e NVIDIA_TF32_OVERRIDE="1" \
-e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1" \
-e VLLM_TORCH_COMPILE="1" \
-e VLLM_FLOAT32_MATMUL_PRECISION="high" \
-e VLLM_LOG_STATS_INTERVAL="10" \
-e VLLM_ATTENTION_BACKEND="FLASHINFER" \
-e VLLM_FLASHINFER_FORCE_TENSOR_CORES="1" \
-e VLLM_FLASHINFER_MOE_BACKEND="throughput" \
-e CUDA_VISIBLE_DEVICES="0" \
-e PYTHONHASHSEED="0" \
-e VLLM_USE_V2_MODEL_RUNNER="0" \
-e VLLM_ENABLE_PREFIX_CACHING="1" \
-e TORCH_CUDA_ARCH_LIST="12.1f" \
-v $HOME/huggingface:/root/.cache/huggingface \
scitrera/dgx-spark-vllm:0.15.1-t5 \
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
--served-model-name qwen3-coder-next \
--load-format fastsafetensors \
--attention-backend flashinfer \
--port 8000 \
--max-model-len 262144 \
--block-size 128 \
--max-num-seqs 16 \
--max-num-batched-tokens 131072 \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype auto \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--disable-uvicorn-access-log \
--kv-cache-metrics \
--cudagraph-metrics \
--enable-mfu-metrics \
-cc.max_cudagraph_capture_size 512 \
--tensor-parallel-size 1
@capitangiaco
Copy link

I stopped the docker at 114GB, I will retry with -e VLLM_TORCH_COMPILE="0".
the next steps are --max-num-batched-tokens 32K and --gpu-memory-utilization 0.75.
I’m starting to think that with 128GB, the context to use should be 128K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment