LLM memory usage = model weights + optimizer/kv-cache + activations for max tokens.
With your 8B model (~16–18 GB for FP16), asking for 10,500 tokens will likely exceed 19.5 GB, especially with --gpu-memory-utilization=0.99.
Calculation example:
- Your model: 8B (~16 GB FP16 weights)
- Max tokens: 10,500
- KV-cache VRAM ≈ 10,500 × 0.0007 GB ≈ 7.35 GB additional VRAM
- Total VRAM needed ≈ 16 GB + 7.35 GB ≈ 23.35 GB → > 19.5 GB, hence the OOM