Skip to content

Instantly share code, notes, and snippets.

View syaikhipin's full-sized avatar

Nur Arifin Akbar syaikhipin

View GitHub Profile
@f0ster
f0ster / deepseek-v3-tech-dive.md
Created February 17, 2025 20:03
Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban

1. CUDA and PTX Optimizations

DeepSeek-V3’s engineers optimized GPU performance at the low-level by tailoring kernels and memory access patterns to NVIDIA’s hardware. A key strategy was warp specialization: they partitioned a subset of GPU threads (warps) specifically for communication tasks, allowing compute to overlap with data transfers (DeepSeek-V3 Technical Report). In practice, only ~20 of the GPU’s Streaming Multiprocessors (SMs) were reserved to handle all cross-node communications – enough to saturate both InfiniBand (IB) and NVLink bandwidth – while the remaining SMs focused purely on computation (DeepSeek-V3 Technical Report) ([DeepSeek-V3 Technical Report](https://arx