Skip to content

Instantly share code, notes, and snippets.

@armand1m
Last active February 22, 2026 22:03
Show Gist options
  • Select an option

  • Save armand1m/8f354797ed39f14e14cea0ed5c52c770 to your computer and use it in GitHub Desktop.

Select an option

Save armand1m/8f354797ed39f14e14cea0ed5c52c770 to your computer and use it in GitHub Desktop.
qwen3-coder-next - vllm 0.15.1 - transformers 5 - optimized for dgx spark
#!/bin/bash
docker run -d \
--name vllm \
--restart unless-stopped \
--gpus all \
--ipc host \
--shm-size 64gb \
--memory 110g \
--memory-swap 120g \
--pids-limit 4096 \
-p 0.0.0.0:18080:8000 \
-e HF_TOKEN="${HF_TOKEN:-}" \
-e VLLM_LOGGING_LEVEL="INFO" \
-e NVIDIA_TF32_OVERRIDE="1" \
-e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1" \
-e VLLM_TORCH_COMPILE="1" \
-e VLLM_FLOAT32_MATMUL_PRECISION="high" \
-e VLLM_LOG_STATS_INTERVAL="10" \
-e VLLM_ATTENTION_BACKEND="FLASHINFER" \
-e VLLM_FLASHINFER_FORCE_TENSOR_CORES="1" \
-e VLLM_FLASHINFER_MOE_BACKEND="throughput" \
-e CUDA_VISIBLE_DEVICES="0" \
-e PYTHONHASHSEED="0" \
-e VLLM_USE_V2_MODEL_RUNNER="0" \
-e VLLM_ENABLE_PREFIX_CACHING="1" \
-e TORCH_CUDA_ARCH_LIST="12.1f" \
-v $HOME/huggingface:/root/.cache/huggingface \
scitrera/dgx-spark-vllm:0.15.1-t5 \
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
--served-model-name qwen3-coder-next \
--load-format fastsafetensors \
--attention-backend flashinfer \
--port 8000 \
--max-model-len 262144 \
--block-size 128 \
--max-num-seqs 16 \
--max-num-batched-tokens 131072 \
--gpu-memory-utilization 0.80 \
--kv-cache-dtype auto \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--disable-uvicorn-access-log \
--kv-cache-metrics \
--cudagraph-metrics \
--enable-mfu-metrics \
-cc.max_cudagraph_capture_size 512 \
--tensor-parallel-size 1
@capitangiaco
Copy link

--gpu-memory-utilization 0.90 is too high, my spark went in OOM after one hour of coding.
with 0.80, after one day of coding I am at 170.000 tokens, the ram 117G/120G and swap 3.47GB used, but still working.

@armand1m
Copy link
Author

@capitangiaco indeed, 0.80 is safer. I reduced it as well

@armand1m
Copy link
Author

also, better to use 0.16.0-t5 at this stage most likely

@capitangiaco
Copy link

also, better to use 0.16.0-t5 at this stage most likely

I will try it
I had to use --max-num-batched-tokens 65536 \
with 131072 the system begin to swap at about 130-140K tokens

@capitangiaco
Copy link

I stopped the docker at 114GB, I will retry with -e VLLM_TORCH_COMPILE="0".
the next steps are --max-num-batched-tokens 32K and --gpu-memory-utilization 0.75.
I’m starting to think that with 128GB, the context to use should be 128K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment