TL;DR. GLM-5.2 (glm_moe_dsa — DeepSeek Sparse Attention) does not run on Ampere (A100, sm_80) with stock vLLM: the sparse-MLA attention backend (FLASHMLA_SPARSE) and the lightning-indexer's fp8_mqa_logits (DeepGEMM) are Hopper/Blackwell-only. vLLM PR #38476 (issue #38006) adds a Triton sparse-MLA backend (TRITON_MLA_SPARSE) + a bf16 Triton indexer fallback that run on Ampere. Cherry-picking it onto current main is a Python-only change — no CUDA recompile. Result: GLM-5.2 AWQ-INT4 serves on 8× A100 at ~56 tok/s single-stream and ~625 tok/s aggregate decode (32-way), with coherent output.
This is an independent 8× A100 confirmation of PR #38476 (the author validated on 32× A100), plus a no-recompile install note. Credit to @haosdent for the PR.
- 8× A100 80GB (sm_80). ~410 GiB VRAM used at TP=8, so all 8 GPUs.