DeepSeek-V3’s engineers optimized GPU performance at the low-level by tailoring kernels and memory access patterns to NVIDIA’s hardware. A key strategy was warp specialization: they partitioned a subset of GPU threads (warps) specifically for communication tasks, allowing compute to overlap with data transfers (DeepSeek-V3 Technical Report). In practice, only ~20 of the GPU’s Streaming Multiprocessors (SMs) were reserved to handle all cross-node communications – enough to saturate both InfiniBand (IB) and NVLink bandwidth – while the remaining SMs focused purely on computation (DeepSeek-V3 Technical Report) ([DeepSeek-V3 Technical Report](https://arx