Last active
February 18, 2025 17:09
-
-
Save f0ster/7e34fe3124a291e9973ce4c854a040ca to your computer and use it in GitHub Desktop.
💥 Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban 🧠🚀
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "1dfc798d", | |
"metadata": {}, | |
"source": [ | |
"# 💥 Smashing the Tariffs for Fun and Profit: How DeepSeek v3 Outsmarted the AI Ban 🧠🚀\n", | |
"## A Master Class in Efficient CUDA/PTX Optimization and MoE Scaling\n", | |
"\n", | |
"## 1. Introduction \n", | |
"DeepSeek-V3 is a 671 billion-parameter Mixture-of-Experts (MoE) language model, with only ~37 billion parameters activated per token ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=We%20present%20DeepSeek,Comprehensive%20evaluations)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,world%20applications)). This sparsely-activated design enables *massive model capacity* without linearly scaling computation cost. Achieving this at scale required a holistic co-design of model architecture, training algorithms, and low-level GPU optimizations. Key innovations include Multi-Head Latent Attention (MLA) for memory-efficient attention, an *auxiliary-loss-free* MoE load balancing mechanism, FP8 mixed-precision training, sophisticated memory management, overlapping pipeline parallelism via the **DualPipe** algorithm, and custom CUDA/PTX-level optimizations ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=DualPipe%2C%20PTX,extreme%20scale%20with%20stable%20training)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=outperforming%20other%20open,extreme%20scale%20with%20stable%20training)). Together, these strategies allowed DeepSeek-V3 to be trained on 14.8 trillion tokens in only 2.788 million H800 GPU hours (~$5.576M) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=We%20present%20DeepSeek,Comprehensive%20evaluations)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,576)), with *no instabilities or loss spikes* throughout training ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=performance%20comparable%20to%20leading%20closed,V3)). \n", | |
"\n", | |
"**Topics Covered in this Guide:** \n", | |
"- **Architecture-Level Optimizations:** The *Multi-Head Latent Attention* mechanism and the *DeepSeek-MoE* architecture (finer experts + novel load balancing) that underpin efficiency. \n", | |
"- **Precision and Arithmetic Optimizations:** Adoption of *FP8 mixed precision* with fine-grained quantization and high-precision accumulation to halve memory and accelerate GEMMs. \n", | |
"- **Parallelism and Communication:** The *DualPipe pipeline parallelism* algorithm and communication strategies (overlapping computation with all-to-all exchanges, topology-aware routing, and PTX-level tweaks) for near-linear scaling. \n", | |
"- **Memory Management:** Techniques like selective recomputation, CPU offloading, and weight sharing that minimize the memory footprint with minimal overhead. \n", | |
"- **Economic Scaling Impact:** Analysis of how these optimizations translate to reduced training time, cost, and improved hardware utilization, enabling unprecedented scale on limited resources. \n", | |
"\n", | |
"Each section provides technical rigor with calculus derivations (where applicable), real PTX code snippets illustrating critical optimizations, and high-level impact analysis. This master class offers a definitive guide to DeepSeek-V3’s efficiency strategies, demonstrating how to co-design algorithms and hardware-aware optimizations for training giant models efficiently ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=compression%2C%20mixture,parameter)). \n", | |
"\n", | |
"---\n", | |
"\n", | |
"## 2. Architecture-Level Innovations for Efficiency \n", | |
"\n", | |
"### 2.1 Multi-Head Latent Attention (MLA) – Memory-Efficient Attention \n", | |
"**Overview:** *Multi-Head Latent Attention (MLA)* is a modified attention mechanism that compresses key and value vectors into a lower-dimensional **latent space**, significantly reducing memory and compute overhead ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=The%20core%20of%20MLA%20is,Value%20%28KV%29%20cache%20during%20inference)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=MLA%20enhances%20inference%20efficiency%20by,value%20storage%20requirements%20during%20inference)). By caching only small latent representations instead of full key/value matrices, MLA slashes the cost of attention, especially for long sequences during inference. Importantly, MLA maintains accuracy comparable to standard multi-head attention by reconstructing full keys/values on the fly from the latent vectors ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=concatenation,Head%20Attention%20%28MHA%29%C2%A0%28Vaswani%20et%C2%A0al.%2C%202017)). This section breaks down the math behind MLA and its practical impact.\n", | |
"\n", | |
"#### 2.1.1 Low-Rank Compression of Keys and Values \n", | |
"**Method:** For each attention layer, let $d_{\\text{model}}$ be the model embedding dimension, $h$ the number of heads, and $d_k = d_{\\text{model}}/h$ the per-head dimension. Instead of directly computing and storing full keys $K \\in \\mathbb{R}^{d_k}$ and values $V \\in \\mathbb{R}^{d_k}$ for each token, MLA introduces a learned **down-projection** matrix $W^{D}_{KV}$ that produces a *compressed latent vector* $c_t^{KV}$ of much lower dimension $r \\ll d_k$. Formally, for token $t$ with input $h_t$ to the attention layer: \n", | |
"\n", | |
"- **Compression:** $c_t^{KV} = W^{D}_{KV} \\, h_t$, where $c_t^{KV} \\in \\mathbb{R}^r$ is the latent representation capturing the essential information for both keys and values ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=The%20core%20of%20MLA%20is,Value%20%28KV%29%20cache%20during%20inference)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=Where%20%24c_%7Bt%7D,In%20simpler%20terms)). Typically $r$ is chosen such that $r \\cdot h$ (total latent per head) is significantly smaller than $d_{\\text{model}}$, yielding a low-rank approximation. This is analogous to factorizing the $d_k$-dimensional key/value vectors via an intermediate of rank $r$ ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=1,fly%20from%20the%20latent%20vector)).\n", | |
"\n", | |
"- **Reconstruction of Keys/Values:** Two up-projection matrices $W^{U}_{K}$ and $W^{U}_{V}$ expand the latent back to the original size: $\\mathbf{k}_{t} = W^{U}_{K}\\,c_t^{KV}$ and $\\mathbf{v}_{t} = W^{U}_{V}\\,c_t^{KV}$ ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=where%20is%20the%20compressed%20latent,Head%20Attention)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=%24%24%20%5Cquad%20%5Cmathbf%7Bc%7D_%7Bt%7D,KV%7D%2C%20%5Cquad)). Additionally, a separate matrix $W^{\\text{RoPE}}_{K}$ is applied to $c_t^{KV}$ to produce a *RoPE*-embedded **decoupled key** $\\tilde{\\mathbf{k}}_t = \\text{RoPE}(W^{\\text{RoPE}}_{K}\\,c_t^{KV})$ which injects rotary positional encodings (RoPE) for the key ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=compression%20dimension%3B%20denotes%20the%20down,80)). The final key used in attention is the concatenation $K_t = [\\,\\tilde{\\mathbf{k}}_t \\,\\Vert\\, \\mathbf{k}_t\\,]$, combining positional information with content, while the value uses its up-projection $\\mathbf{v}_t$ directly.\n", | |
"\n", | |
"**Cache Efficiency:** Crucially, **only the latent vector $c_t^{KV}$ (and its RoPE-transformed counterpart)** are stored for each past token during autoregressive generation ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=concatenation,Head%20Attention%20%28MHA%29%C2%A0%28Vaswani%20et%C2%A0al.%2C%202017)). This means the memory cost per past token is $r$ (plus a small overhead for positional info) instead of $2d_k$ for full key+value. For example, if $d_k=128$ and $r=32$, caching MLA vectors yields a 4x reduction in memory per token. This dramatically reduces the *KV cache size* during inference, which is often the bottleneck for long-context LLMs ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=MLA%20enhances%20inference%20efficiency%20by,value%20storage%20requirements%20during%20inference)). The model can reconstruct high-dimensional keys/values on-demand with minimal compute, since it’s just a matrix multiplication with $W^U_K$ or $W^U_V$ for each needed token ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=turn%20one%20matrix%20of%20dim,fly%20from%20the%20latent%20vector)).\n", | |
"\n", | |
"**Derivation (Low-Rank Approximation):** The MLA approach can be understood as learning a low-rank factorization of the key/value projection matrices. In standard attention, one would have $K_t = W^K h_t$ and $V_t = W^V h_t$ with $W^K, W^V \\in \\mathbb{R}^{d_k \\times d_{\\text{model}}}$. MLA factorizes each of these into two smaller matrices: e.g. $W^K = W^U_K \\, W^D_{KV}$ (and similarly $W^V = W^U_V \\, W^D_{KV}$) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=1,fly%20from%20the%20latent%20vector)). If $W^D_{KV}$ is $r\\times d_{\\text{model}}$ and $W^U_K$ is $d_k \\times r$, then $W^K$ is effectively rank-$\\le r$. In practice $r$ is chosen such that this factorization captures the important variance of the queries/keys. By **Eckart-Young theorem**, an optimal low-rank $r$ approximation of a matrix (in least-squares sense) retains the largest $r$ singular values; here, $W^D_{KV}$ and $W^U_K$ are learned directly via training, effectively letting the model discover a good low-dimensional subspace for keys/values. The end result is that $K_t \\approx W^U_K W^D_{KV} h_t$ and $V_t \\approx W^U_V W^D_{KV} h_t$ with minimal loss in attention fidelity, as confirmed by nearly unchanged performance vs. full attention ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)).\n", | |
"\n", | |
"**Impact:** This compression yields **significant memory savings** and **inference speedups**. With MLA, only the “blue-boxed” latent vectors (i.e. $c^{KV}$ for keys and values) are stored during generation, reducing memory by a factor of $\\frac{d_k}{r}$ ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)). For example, if a traditional model requires 100 GB for KV cache at a given context length, MLA might reduce this to ~25 GB for the same context, enabling longer contexts or smaller memory GPUs to handle generation ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)). The authors report that MLA achieves *performance comparable to standard Multi-Head Attention* despite this compression ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=concatenation,Head%20Attention%20%28MHA%29%C2%A0%28Vaswani%20et%C2%A0al.%2C%202017)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=MLA%20enhances%20inference%20efficiency%20by,value%20storage%20requirements%20during%20inference)). Thus, MLA improves **inference efficiency** (memory footprint and cache bandwidth) *without degrading model quality*. This makes serving DeepSeek-V3 practical even at long contexts that would otherwise be prohibitive in memory cost. In summary, MLA exemplifies *algorithmic efficiency*: trading a bit of extra compute (on-the-fly projection of keys/values) for big memory savings and throughput gains.\n", | |
"\n", | |
"#### 2.1.2 Low-Rank Query Projection and Training Memory \n", | |
"While MLA’s key/value compression benefits inference, it also introduces optimizations for *training*. In transformer training, the *query activations* at each attention layer contribute significantly to memory usage because they must be stored for backpropagation. DeepSeek-V3 applies a similar low-rank compression to queries to reduce this overhead ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20the%20attention%20queries%2C%20we,the%20activation%20memory%20during%20training)). \n", | |
"\n", | |
"- **Query Compression:** Let $Q_t = W^Q h_t$ be the full query (dimension $d_k$). MLA uses a learned down-projection $W^D_Q$ to produce a latent $c_t^Q = W^D_Q h_t$ of size $r_Q$ per head, and an up-project $W^U_Q$ to recover the query: $\\tilde{Q}_t = W^U_Q\\,c_t^Q$ ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=)). In practice $r_Q$ can be chosen similar to $r$ for KV (or potentially different if queries can be compressed more/less). They also apply a positional embedding matrix for queries (analogous to keys) to produce a RoPE-embedded query if needed ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=where%20is%20the%20compressed%20latent,decoupled%20queries%20that%20carry%20RoPE)).\n", | |
"\n", | |
"- **Memory Savings:** During training, instead of storing the full query activations $Q_t$ for use in backprop, the model can store the *compressed queries* $c_t^Q$ (and perhaps the small positional component) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20the%20attention%20queries%2C%20we,the%20activation%20memory%20during%20training)). Since $c_t^Q \\in \\mathbb{R}^{r_Q}$ with $r_Q \\ll d_k$, this reduces the stored activation size per token substantially. When computing gradients, the full $Q_t$ can be reconstructed on-the-fly from $c_t^Q$ as needed.\n", | |
"\n", | |
"- **Backprop Derivation:** The gradient flow through this compression can be derived from the chain rule. If the attention output gradient is $\\frac{\\partial L}{\\partial Q_t}$, and $Q_t = W^U_Q c_t^Q$, then $\\frac{\\partial L}{\\partial c_t^Q} = (W^U_Q)^T \\frac{\\partial L}{\\partial Q_t}$. Meanwhile, the gradient w.rt the original input $h_t$ accumulates from both query and also key/value paths. In essence, the low-rank projection adds an extra linear layer in the backward graph but doesn't change complexity order. The overhead of reconstructing queries in backward (a matrix multiply by $(W^U_Q)^T$) is minimal compared to the gains of storing much smaller tensors. The overall memory saved can be estimated: if $d_k=128$ and $r_Q=32$, for each token and head you store 32 elements instead of 128, a 4x reduction. Multiplied across all heads and all tokens in a sequence, this is a large drop in activation memory. This is particularly beneficial for long sequences or large batch sizes, where activation memory can be a limiting factor.\n", | |
"\n", | |
"**Impact:** By compressing queries, **activation checkpointing requirements are relaxed**, and memory usage is further optimized. This contributes to DeepSeek-V3’s ability to train at batch sizes and sequence lengths that might otherwise exhaust GPU memory ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20the%20attention%20queries%2C%20we,the%20activation%20memory%20during%20training)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=quantization%20accuracy,weights%2C%20we%20group%20and%20scale)). It’s an example of trading a bit of extra computation (recomputing full Q from $c^Q$ during backprop) to save memory, similar in spirit to traditional *activation recomputation* tricks. The result is a more memory-efficient training process, which, together with other memory management techniques (see Section 3.3), allows fitting this 671B model in available hardware without sacrificing throughput.\n", | |
"\n", | |
"#### 2.1.3 Attention Output and Overall Formula \n", | |
"After applying MLA’s projections, the attention mechanism proceeds with the compressed queries, full keys, and full values to compute outputs. The final attention output for token $t$ in a head is: \n", | |
"\n", | |
"$$\n", | |
"\\text{AttnOutput}_t = W^O \\Big(\\text{Attention}( \\tilde{Q}_t, K_{1:\\ell}, V_{1:\\ell} )\\Big),\n", | |
"$$\n", | |
"\n", | |
"where $\\tilde{Q}_t$ is the reconstructed query for token $t$, and $K_{1:\\ell}, V_{1:\\ell}$ are the keys/values for all tokens $1$ to $\\ell$ in the sequence (with past tokens possibly reconstructed from their latent caches). $W^O$ is the output projection matrix of the attention head (mapping from $d_k$ back to $d_{\\text{model}}$) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=Ultimately%2C%20the%20attention%20queries%20,the%20final%20attention%20output)). In practice, attention is computed as usual: $\\text{Attention}(Q,K,V) = \\text{softmax}\\big(QK^T/\\sqrt{d_k}\\big)V$. MLA does not change this formula, it just ensures $Q,K,V$ are obtained efficiently.\n", | |
"\n", | |
"**Validation:** The DeepSeek team reported that MLA yields *nearly identical* attention outputs as standard multi-head attention. They found that using $r$ around half or a third of $d_k$ retained model quality while drastically cutting memory ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=MLA%20enhances%20inference%20efficiency%20by,value%20storage%20requirements%20during%20inference)). By caching only latent vectors and recomputing keys/values on demand, they **minimized KV cache overhead during generation** ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)). During training, compressing queries **reduced activation memory** without adding instability. The success of MLA in DeepSeek-V2 and V3 demonstrates that substantial compression is possible in attention layers *without hurting performance*, reinforcing research that attention has redundancy that low-rank methods can exploit.\n", | |
"\n", | |
"**High-Level Impact:** MLA primarily targets **inference efficiency**, enabling long-context usage and fast generation by cutting memory bandwidth demands (less data to fetch for past tokens) and memory footprint ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=decoupled%20key%20that%20carries%20Rotary,80)). This means faster response times and ability to deploy the model on hardware with limited memory per GPU. Secondarily, it contributes to training efficiency by lowering memory per token, which can translate to either larger batch sizes (improving GPU utilization) or the ability to train with longer sequences to improve model capability (DeepSeek-V3 was trained with some extended context lengths ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=match%20at%20L52%203,2%20Evaluation%20Results))). Overall, MLA is a crucial innovation that turns the attention mechanism from a scaling pain point into an efficient component aligned with the needs of ultra-large models.\n", | |
"\n", | |
"### 2.2 DeepSeek-MoE Architecture – Sparsely Activated Feed-Forward Networks \n", | |
"DeepSeek-V3’s feed-forward network (FFN) layers use a **Mixture-of-Experts (MoE)** architecture called *DeepSeekMoE* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20Feed,the%20FFN%20output%20as%20follows)). Instead of a single FFN per layer as in a standard Transformer, MoE uses multiple expert FFNs and dynamically selects a subset to activate for each token. This allows the model to have an enormous number of parameters (many experts) while each token only incurs compute for a few experts, providing **computational sparsity**. DeepSeekMoE introduces two major enhancements over traditional MoE (like GShard ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20Feed,the%20FFN%20output%20as%20follows))): (1) *finer-grained experts with shared experts*: dividing model capacity into many smaller experts and designating some as globally shared, and (2) an **auxiliary-loss-free load balancing** strategy that maintains balanced expert utilization without the usual auxiliary loss penalty ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=MLA%20minimizes%20key,during%20inference)). We explore these in depth:\n", | |
"\n", | |
"#### 2.2.1 Finer-Grained Experts and Shared Expert Pool \n", | |
"**Finer Experts:** Instead of a few very large experts, DeepSeek-V3 employs a *larger number of smaller experts*. Let’s denote $E_{\\text{routed}}$ as the number of **routed experts** (experts that are chosen via the gating mechanism) and $E_{\\text{shared}}$ as the number of **shared experts** ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20Feed,the%20FFN%20output%20as%20follows)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=where%20and%20denote%20the%20numbers,all%20selected%20affinity%20scores%20to)). In DeepSeekMoE, $E_{\\text{routed}}$ is high (many experts), but each expert’s capacity (hidden dimension) is relatively small, keeping the per-expert parameter count manageable. This fine granularity has two benefits: (1) it increases the diversity of experts (each can learn a niche, improving model capacity), and (2) it makes load balancing easier, because distributing tokens over many small experts tends to even out more naturally than a few large ones (Law of Large Numbers effect – more experts means random fluctuations in load average out).\n", | |
"\n", | |
"**Shared Experts:** A subset of experts are designated as *shared* ($E_{\\text{shared}}$) and are present across all expert groups ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=For%20Feed,the%20FFN%20output%20as%20follows)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=Next%2C%20they%20adopt%20a%20Mixture,forward%20blocks)). These shared experts handle “universal patterns” that almost any token might benefit from, such as very common linguistic patterns or basic knowledge. In implementation, a *shared expert* can be thought of as replicated across all GPUs or all MoE partitions so that any token can use it without incurring cross-device communication. The remaining experts are *routed* experts which are partitioned and each only accessible on specific devices, chosen by the gating for specialized processing.\n", | |
"\n", | |
"**Hybrid Expert Selection:** When computing the FFN output for a token $t$, DeepSeekMoE combines outputs from both shared and routed experts ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=where%20and%20denote%20the%20numbers,all%20selected%20affinity%20scores%20to)). If $S$ is the set of shared experts and $R$ is the set of routed experts, with $|S| = E_{\\text{shared}}, |R| = E_{\\text{routed}}$, then:\n", | |
"\n", | |
"- The token’s feed-forward output is: \n", | |
" $$FFN(t) = \\sum_{e \\in S} g_{t,e}\\, \\text{Expert}_e(h_t) \\;+\\; \\sum_{e \\in R_t^{(K)}} g_{t,e}\\, \\text{Expert}_e(h_t),$$ \n", | |
" where $h_t$ is the input to the FFN layer for token $t$, $\\text{Expert}_e(\\cdot)$ denotes the $e$-th expert’s computation (e.g., an MLP on $h_t$), $g_{t,e}$ is the gating weight for expert $e$ on token $t$, and $R_t^{(K)}$ denotes the set of top-$K$ routed experts selected for token $t$ by the gating network ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=and%20denote%20the%20,to%20produce%20the%20gating%20values)). The first sum runs over all shared experts (these might be always activated or also possibly go through gating but since they are “universal,” one can imagine the gating assigns some weight to them too). The second sum runs over the selected routed experts out of the many available.\n", | |
"\n", | |
"- Typically, $K$ (the number of experts activated per token) is a small number like 2 (DeepSeek-V3 likely uses $K=2$ as many MoE setups do). So each token gets output from $E_{\\text{shared}} + K$ experts in total, but $E_{\\text{shared}}$ outputs are available without routing overhead (since they are present on the same device), whereas the $K$ routed ones may involve communication if those experts are on other devices.\n", | |
"\n", | |
"**Why Shared Experts?** Shared experts help with two things: **communication reduction** and **learning general features**. Communication is reduced because if some portion of a token’s model capacity comes from local shared experts, the fraction of forward pass that requires sending data to another GPU for an expert is lower. In extreme case, if a token’s needs can be largely addressed by shared experts, it might not need as many routed experts (thus saving an all-to-all exchange). From a learning perspective, having shared experts allows the model to devote some experts to tasks like “common sense reasoning” or “basic syntax” that are useful globally, while routed experts can specialize (e.g., one expert might specialize in code, another in medical text, etc.). DeepSeek-V3 found that such a division (introduced in V2) improved training efficiency and stability ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=Next%2C%20they%20adopt%20a%20Mixture,forward%20blocks)).\n", | |
"\n", | |
"**Example:** Suppose DeepSeek-V3 has 32 experts per MoE layer per device group, out of which 4 are shared and 28 are routed (just a hypothetical ratio). Every token’s FFN layer passes through the 4 shared experts (which are identical on all devices) plus the top-$K$ of the 28 routed experts chosen specifically for that token. If a token’s gating scores indicate it’s a very generic token, perhaps the gating will mostly utilize the shared experts (which are likely local), whereas if it’s a niche token, it will also pull in one or two specialist experts possibly from another node. This flexibility yields both **better utilization** (shared experts are always used by all tokens, so they stay busy and justify their replication cost) and **reduced average communication** (because not *every* token uses off-node experts for both of its expert slots, some get satisfied by shared ones).\n", | |
"\n", | |
"**Impact:** Finer-grained experts mean the model can scale to *hundreds of experts per layer* without individual experts becoming too large to train or too rarely used. It increases total parameters (and thus model capacity) almost linearly while keeping the per-token computation roughly constant (since $K$ is fixed). The introduction of shared experts adds a modest overhead (since they effectively are always computed for each token, adding a fixed extra cost) but this is offset by their high utility and the reduction in needed routed expert calls. Overall, the DeepSeekMoE architecture with shared experts achieves *better sample efficiency* and *lower communication overhead* than a naive MoE design ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=used%29)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=They%20further%20limit%20cross,basic%20process%20is%20as%20follows)). This architecture was validated in DeepSeek-V2 and carried into V3, proving crucial for scaling to 671B parameters economically ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=DeepSeek,demonstrating%20their%20capability%20to)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Latent%20Attention%C2%A0%28MLA%29%C2%A0%28DeepSeek,For%20other%20minor)).\n", | |
"\n", | |
"#### 2.2.2 Auxiliary-Loss-Free Load Balancing via Dynamic Bias \n", | |
"A classic challenge with MoE models is **load balancing**: ensuring that tokens are evenly distributed among experts. Without balance, some experts might get too many tokens (causing others to be underutilized, and possibly causing *routing collapse* where the model learns to use only a few experts) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=For%20MoE%20models%2C%20an%20unbalanced,and%20add%20it%20to%20the)). Traditional solutions add an *auxiliary loss* term to the training objective that penalizes unequal expert usage (as in GShard or BASE Layers) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=For%20MoE%20models%2C%20an%20unbalanced,and%20add%20it%20to%20the)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=expert%20parallelism,K%20routing)). However, tuning this loss is tricky: too strong hurts model quality, too weak fails to balance ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=loss%C2%A0%28Fedus%20et%C2%A0al,K%20routing)). DeepSeek-V3 **pioneers an auxiliary-loss-free strategy** that achieves balance by adjusting the gating *behavior* directly rather than adding a loss penalty ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=performance%C2%A0%28Wang%20et%C2%A0al,K%20routing)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=)). The core idea is to introduce a **bias term** for each expert that dynamically adjusts gating scores to encourage underused experts and discourage overused ones ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=load%20balance%20and%20model%20performance%2C,K%20routing)).\n", | |
"\n", | |
"**Bias in Gating:** Let $a_{t,e}$ be the affinity score (pre-gating activation) of expert $e$ for token $t$ as computed by the gating network (e.g., a feed-forward that scores how well each expert suits the token). In DeepSeekMoE, they use a sigmoid to compute affinities and then pick top-$K$ ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS1#:~:text=scores%20calculated%20for%20the%20,to%20produce%20the%20gating%20values)). Now, introduce a learnable (but here *dynamically updated*, not fixed learned) bias $b_e$ for each expert $e$. The *biased affinity* is: \n", | |
"\n", | |
"$$ a'_{t,e} = a_{t,e} + b_e. \\tag{Bias-adjusted score} $$\n", | |
"\n", | |
"The top-$K$ selection is done based on $a'_{t,e}$ instead of $a_{t,e}$ ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=load%20balance%20and%20model%20performance%2C,K%20routing)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=specific%2C%20we%20introduce%20a%20bias,K%20routing)). However, **important:** when computing the final gating weights $g_{t,e}$ that scale the expert outputs, DeepSeek-V3 uses the original score $a_{t,e}$ (normalized among selected experts) so that the bias does *not* directly alter the contribution magnitudes ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Note%20that%20the%20bias%20term,will%20decrease%20the%20bias%20term)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Note%20that%20the%20bias%20term,balance%20through%20pure%20auxiliary%20losses)). The bias only influences which experts get selected.\n", | |
"\n", | |
"**Dynamic Update Rule:** The biases $b_e$ are *not static parameters* trained by gradient descent in the usual way. Instead, they are updated after each training step using a simple heuristic: \n", | |
"- If expert $e$ was *overloaded* (i.e., it received more tokens this step than an even share), **decrease** $b_e$ by a small amount $\\Delta$. \n", | |
"- If expert $e$ was *underused* (fewer tokens than average), **increase** $b_e$ by $\\Delta$ ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=be%20multiplied%20with%20the%20FFN,balance%20through%20pure%20auxiliary%20losses)). \n", | |
"\n", | |
"Here $\\Delta$ is a hyperparameter called the *bias update speed*. In the implementation, they monitored the expert load counts each step; suppose in a batch of $N$ tokens (summed over all GPUs) and $E_{\\text{routed}}$ experts, the ideal fair load per expert is $N/E_{\\text{routed}}$. If expert $e$ got $n_e$ tokens this step:\n", | |
"- If $n_e > N/E_{\\text{routed}}$ (overloaded), set $b_e := b_e - \\eta$,\n", | |
"- If $n_e < N/E_{\\text{routed}}$ (underloaded), set $b_e := b_e + \\eta$,\n", | |
"with $\\eta$ a small constant (e.g. 0.001) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=be%20multiplied%20with%20the%20FFN,balance%20through%20pure%20auxiliary%20losses)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=match%20at%20L1142%20to%20at,3)).\n", | |
"\n", | |
"This effectively performs a *feedback control*: experts that were over-used get a negative bias (making them slightly less likely to be chosen next time), and under-used experts get a positive boost (making them more likely to be selected) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=be%20multiplied%20with%20the%20FFN,balance%20through%20pure%20auxiliary%20losses)). Over many steps, this nudges the system toward equal usage.\n", | |
"\n", | |
"**Interpretation as Gradient Descent:** Although not explicitly framed as such, this bias update can be seen as (a) optimizing a balancing objective or (b) performing an online algorithm:\n", | |
"- Imagine a loss $L_{\\text{bal}} = \\sum_e (n_e - \\frac{N}{E_{\\text{routed}}})$ measuring imbalance. The update $b_e := b_e - \\eta\\,\\text{sgn}(n_e - N/E)$ is like a gradient step on a piecewise-linear surrogate of that loss with respect to $b_e$ (noting that increasing $b_e$ increases $n_e$ in expectation). It’s a heuristic rather than exact gradient because $n_e$ doesn’t have a simple derivative w.rt $b_e$, but the sign is used as an indicator.\n", | |
"- It’s akin to a multi-armed bandit or reinforcement signal where each expert’s “reward” is whether it was under-target (so we encourage it) or over-target (so we discourage it). This ensures a self-correcting gating mechanism.\n", | |
"\n", | |
"**Auxiliary Loss Removal:** Thanks to this bias strategy, DeepSeek-V3 was able to *remove the usual MoE auxiliary loss terms entirely* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=additionally%20introduce%20an%20auxiliary,the%20details%20of%20MLA%20and)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=,Free%20Balancing%20Strategy)). In ablations, they found that the *aux-loss-free strategy achieved better model quality than using auxiliary loss* for balance ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=In%20Table%C2%A05%2C%20we%20show%20the,other%20architectures%20the%20same%2C%20we)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Pile,shot%2010.9%2011.1%2037.2%2039.6)). This is likely because auxiliary loss, no matter how tuned, imposes a penalty on the main loss landscape, potentially interfering with the model’s optimization for its primary task. The bias method decouples balancing from the main loss – it’s an algorithmic fix rather than a training objective change, so it *avoids any trade-off with model performance* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Table%205%3A%20%20Ablation%20results,most%20of%20the%20evaluation%20benchmarks)). Indeed, Table 5 of the report shows consistently equal or higher accuracy on various benchmarks when using aux-loss-free bias vs. traditional aux-loss gating ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Pile,shot%2010.9%2011.1%2037.2%2039.6)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Table%205%3A%20%20Ablation%20results,most%20of%20the%20evaluation%20benchmarks)).\n", | |
"\n", | |
"**Ensuring Convergence:** The hyperparameter $\\eta$ (bias update speed) was set carefully. The team used $\\eta=0.001$ for the first 14.3T tokens of training and then set $\\eta=0$ afterward (freezing biases) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=match%20at%20L1142%20to%20at,3)). This suggests that early in training, they allowed biases to evolve and find a good equilibrium, and once the model had learned to utilize experts evenly, they stopped updating biases to lock in the balance. This prevents oscillations late in training and ensures the gating is stable during the critical final phase of pretraining. We can reason that with a small $\\eta$, the bias updates form a slow adaptation loop that converges when each expert’s overload/underload condition balances out, i.e., $n_e \\approx N/E$ for all $e$. If the system converges perfectly, the bias updates become zero because no expert is flagged as consistently over- or under-loaded.\n", | |
"\n", | |
"**Example Calculation:** Suppose we have 10 experts and in a certain step of 100 tokens, an ideal load per expert is 10 tokens. If expert 3 got $n_3=18$ tokens (very popular) and expert 7 got $n_7=2$ tokens (underused), we would do $b_3 := b_3 - 0.001$ and $b_7 := b_7 + 0.001$. Next forward pass, those changes in bias make expert 3’s score slightly lower relative to others, and expert 7’s score slightly higher, so we expect more tokens will go to expert 7 and fewer to expert 3. Over many steps, these biases accumulate (bounded by how long the update is applied) to a point where it’s hard for any one expert to deviate too far from the pack. This method is **simple, fast, and distributed** (each GPU can compute load counts and adjust biases for experts it hosts without heavy synchronization, just using global counts).\n", | |
"\n", | |
"**Complementary Sequence-Wise Balance:** In addition to the global bias scheme, DeepSeek-V3 employs a *very small weight* **sequence-wise balance loss** as a safety net ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=%23%23%20Complementary%20Sequence)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=where%20the%20balance%20factor%20is,each%20sequence%20to%20be%20balanced)). This auxiliary loss (ironically, they still have a tiny one) acts on a per-sequence basis: it encourages that within a single sequence of tokens, the distribution of those tokens across experts is balanced. Formally, they define a loss $L_{\\text{seq-bal}}$ that penalizes if in one sequence (one training sample) too many tokens went to the same expert ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Although%20DeepSeek,wise%20balance%20loss)). The indicator function in eq. (17)-(20) likely counts tokens per expert per sequence and encourages those counts to be uniform ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=)). The coefficient for this loss is extremely small (they mention setting the balance factor to $10^{-4}$) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=match%20at%20L1142%20to%20at,3)), just enough to avoid degenerate cases where, say, within one long sentence the gating picks one expert for all tokens (which could happen even if overall global usage is balanced). Essentially, this prevents *intra-sequence collapse* without affecting global dynamics much.\n", | |
"\n", | |
"**Node-Limited Routing:** Another part of load balancing (and also related to communication) is **node-limited routing** ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=%23%23%20Node)). This means a token is restricted to be routed to experts on at most $M$ distinct *nodes*. In DeepSeek-V3, they set $M=4$ (each token’s experts can reside on no more than 4 different GPU nodes) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=match%20at%20L1142%20to%20at,3)). The way this is enforced: first, group experts by the node (machine) they live on. The gating algorithm computes for each node the highest affinity score among that node’s experts for the token. Then it picks the top $M$ nodes by these scores, and only considers experts on those nodes for final top-$K$ selection ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Like%20the%20device,communication%20overlap)). This limits cross-node communication because a token won’t be sent to an unbounded number of machines – even if the best experts are scattered widely, the token will only travel to at most $M$ nodes. The paper notes this constraint still allows *nearly full computation-communication overlap* in their MoE training framework ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Like%20the%20device,communication%20overlap)), which indicates that $M$ was large enough (and the cluster well-connected enough) to not bottleneck training. Essentially, node-limited routing is a form of locality bias: prefer experts that are on fewer nodes, which is a trade-off between picking the absolute best expert vs. a slightly worse expert that is local or on one of a limited set of nodes.\n", | |
"\n", | |
"**No Token Dropping:** Because the load balancing was so effective, DeepSeek-V3 was able to **avoid token dropping** entirely ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=%23%23%20No%20Token)). In some MoE implementations, when too many tokens route to one expert (exceeding a capacity), the lowest-score tokens among them are dropped (not processed by that expert) to limit load, at the cost of model quality for those tokens (they get only one expert instead of two, etc.). DeepSeek-V3 reports that throughout training they did *not drop any tokens* due to overload ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=%23%23%20No%20Token)). Every token always got its $K$ experts. This is a direct testament to the effectiveness of the bias-based balancing: by preventing overloads, there was no need for the brute-force measure of dropping. They also took care at inference to deploy strategies that maintain balance (perhaps similar bias or ensuring each request is balanced across experts), so even during inference no tokens are dropped ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=%23%23%20No%20Token)). Not dropping tokens means *maximal utilization of model capacity* – every token benefits from the full MoE power without fallback – and avoids complications in training dynamics that dropping can cause.\n", | |
"\n", | |
"**Impact:** The auxiliary-loss-free load balancing strategy is one of DeepSeek-V3’s crown jewels. It **improves model performance** (since it removed a detrimental loss term) while **keeping experts uniformly utilized** (preventing collapse and ensuring training efficiency) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Table%205%3A%20%20Ablation%20results,most%20of%20the%20evaluation%20benchmarks)). Balanced usage also maximizes hardware efficiency: if experts were imbalanced, some GPUs would be idle (hosting an expert that gets few tokens) while others overload, which is inefficient. By keeping loads even, all experts/GPUs get roughly equal work, avoiding bottlenecks. Moreover, bias-based routing is *lighter-weight* than computing complex loss gradients for balancing – it’s an $O(E)$ operation per step to update biases, which is negligible compared to the rest of training. This method could be seen as an *algorithmic innovation that sidesteps a machine learning training problem (balance loss)*, thereby **achieving both faster training and better final accuracy** ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Table%205%3A%20%20Ablation%20results,most%20of%20the%20evaluation%20benchmarks)). Future MoE systems can adopt similar biasing strategies to simplify training large experts networks.\n", | |
"\n", | |
"### 2.3 Other Notable Architectural Features (MTP Objective) \n", | |
"*DeepSeek-V3 also introduced a Multi-Token Prediction (MTP) training objective to improve data efficiency ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=auxiliary,have%20observed%20to%20enhance%20the)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=2.2%20Multi)). While not a hardware optimization per se, MTP “densifies” training signals by having the model predict multiple future tokens at once in certain training steps, potentially improving convergence speed. The MTP module (a small auxiliary model branch) is used only during training and discarded at inference, so it adds negligible runtime cost ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=MoE%20model%20comprising%20228,most%20of%20the%20evaluation%20benchmarks)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=1,most%20of%20the%20evaluation%20benchmarks)). Ablations showed MTP gave a boost to evaluation metrics across scales ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=specific%2C%20we%20validate%20the%20MTP,most%20of%20the%20evaluation%20benchmarks)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=the%20inference%20costs%20of%20the,most%20of%20the%20evaluation%20benchmarks)). We mention MTP for completeness: it exemplifies how even training objectives were adjusted to squeeze more out of each sample, contributing to DeepSeek-V3’s overall efficiency (better performance per token consumed). However, the primary focus of this guide remains on the *systems and hardware-oriented optimizations* which we turn to next.*\n", | |
"\n", | |
"---\n", | |
"\n", | |
"## 3. Training Framework and System Optimizations \n", | |
"\n", | |
"DeepSeek-V3’s success owes much to *HPC-grade training optimizations* beneath the model architecture. Running a 671B-parameter MoE model on a cluster of GPUs requires overcoming bandwidth limits, latency, and memory constraints. The team employed a **co-designed training framework** ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=Co,PTX%20Optimizations)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=DualPipe%2C%20PTX,extreme%20scale%20with%20stable%20training)), meaning they tailored the parallelization and communication algorithms closely to the model’s needs and the hardware’s capabilities. Key components include the **DualPipe** pipeline parallel schedule to overlap compute and communication, optimized **all-to-all communication** for expert exchange, **memory saving techniques** (like selective recomputation and CPU offload), and a cutting-edge **FP8 mixed-precision** training scheme. In this section, each optimization is dissected, showing how it works and its effect on scalability.\n", | |
"\n", | |
"### 3.1 DualPipe Pipeline Parallelism – Overlapping Computation & Communication \n", | |
"**The Challenge:** Training a model of DeepSeek-V3’s scale inevitably requires splitting the model across multiple GPUs. In addition to MoE (which distributes experts), they use **pipeline parallelism** to partition different layers across devices ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=,Communication%20Overlap)). Traditional pipeline parallelism (as in GPipe or PipeDream) can suffer from *pipeline bubbles*, where some stages wait idle for others, underutilizing GPUs. Also, when combining pipeline with MoE, there are substantial **all-to-all communications** between layers (every micro-batch might need to exchange expert outputs). If done naively, communication can stall the pipeline. The *DualPipe* algorithm was created to address these issues by reordering and overlapping computations and communications so that the GPUs are busy nearly 100% of the time ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=meticulous%20engineering%20optimizations,and%20backward%20processes%2C%20thereby%20addressing)) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=has%20fewer%20pipeline%20bubbles,and%20backward%20processes%2C%20thereby%20addressing)).\n", | |
"\n", | |
"#### 3.1.1 Pipeline Parallelism Basics and Bottlenecks \n", | |
"In a pipeline parallel setup, the model’s layers are divided among $P$ pipeline stages (each stage could be 1 or more contiguous layers on a GPU or group of GPUs). The input minibatch is split into $N$ micro-batches. In the naive schedule, micro-batch 1 is processed by stage 1, then stage 2, …, stage $P$ (forward propagation), then backward from stage $P$ back to 1. To keep GPUs busy, one uses a **1F1B** (one-forward-one-back) schedule or variations: while stage 2 is doing forward on micro-batch 2, stage 1 can start backward on micro-batch 1, etc. However, even with 1F1B, there are idle times at the beginning and end of the pipeline (the first stage is idle after sending its last forward until backprop comes, etc.). Generally, for $P$ stages and $N$ micro-batches, the total time $T_{\\text{pipeline}} \\approx (N + P - 1) \\cdot T_{\\text{stage}}$, where $T_{\\text{stage}}$ is time a single stage takes for one micro-batch ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)). The extra $(P-1)$ factor is the bubble overhead. For large $P$, this overhead can be significant if $N$ is not much larger than $P$. Moreover, if there are communication operations between stages (e.g., shuffling expert outputs), these can further serialize operations.\n", | |
"\n", | |
"**Pipeline with MoE:** In DeepSeek-V3, each micro-batch after an attention sub-layer must send tokens to experts across nodes (the MoE all-to-all dispatch). This typically happens in the middle of the layer. So the sequence for a micro-batch might be: do attention on stage i, then perform an all-to-all exchange of expert inputs to appropriate GPUs, then do expert FFNs, then another all-to-all to gather outputs, then continue. If done straightforwardly, these all-to-all operations would pause the pipeline stage’s computation while communication happens, and potentially also interfere with other stages (network contention). The challenge is to hide this communication time under useful work.\n", | |
"\n", | |
"#### 3.1.2 DualPipe Algorithm – Bidirectional Overlap \n", | |
"**Key Idea:** DualPipe introduces *bidirectional scheduling* and fine-grained chunking of each micro-batch’s work to overlap forward and backward passes and communications ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20key%20idea%20of%20DualPipe,attention%20and%20MLP%20are%20further)) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=execution,of%20communications%20can%20be%20fully)). The idea is to feed micro-batches into the pipeline from **both ends** of the pipeline simultaneously. That is, while the first micro-batches are injected at stage 1 going forward, simultaneously the last micro-batches are injected at stage $P$ going backward. This creates two meeting flows of computation. By the time they meet, most of the pipeline is filled in both directions, drastically reducing bubbles ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=,enclosed%20by%20a%20shared%20black)). This concept is similar to “pipe dream” flush/flush or interleaved schedule, but DualPipe extends it with comm overlap.\n", | |
"\n", | |
"**Chunking and Overlap:** Each micro-batch’s work is divided into **four segments**: (1) Attention computation, (2) MoE All-to-All dispatch (communication) + maybe small overhead, (3) MLP (expert computation), (4) All-to-All combine (communication) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20key%20idea%20of%20DualPipe,attention%20and%20MLP%20are%20further)). In the **backward pass**, segments (1) and (3) themselves are split into subparts (backward for attention, backward for MLP) allowing even more interleaving ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20key%20idea%20of%20DualPipe,attention%20and%20MLP%20are%20further)). DualPipe schedules these such that while one micro-batch is doing segment (2) (comm), another micro-batch (either in forward on another stage or backward on a different stage) can do segment (1) or (3) (compute). For example, Stage i could be computing attention for micro-batch k at the same time as the All-to-All for micro-batch k-1 is happening, effectively **hiding communication under compute** ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=memory%20usage,scales%20up%2C%20as%20long%20as)) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=parallelism%20algorithm%20called%20DualPipe%2C%20which,also%20reduces%20the%20pipeline%20bubbles)).\n", | |
"\n", | |
"**Bidirectional Fill:** Figure 5 in the paper (described in text) shows micro-batches fed from both pipeline ends ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=execution,of%20communications%20can%20be%20fully)). Concretely, suppose $P=8$ pipeline stages and $N=20$ micro-batches ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=,enclosed%20by%20a%20shared%20black)). DualPipe would start micro-batch 1 on stage 1 (forward) and micro-batch 20 on stage 8 (backward) at time 0. Then micro-batch 2 on stage 1, micro-batch 19 on stage 8, etc., with appropriate delays so that by the time the flows meet, all stages have work. This yields far fewer idle gaps. It essentially halves the pipeline startup and teardown overhead compared to one-direction scheduling. The result is that the pipeline bubble count is greatly reduced ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)). The authors compare: Standard 1F1B has $P-1$ bubbles; another method “Zigzag (ZB) 1F1B” had some improvement; DualPipe has even less ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)). In Table 2, DualPipe shows significantly fewer bubbles than prior methods like 1F1B or Chimera ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=only%20increasing%20the%20peak%20activation,DualPipe%20only%20requires%20that%20the)).\n", | |
"\n", | |
"**Throughput Formula:** Although not explicitly given, we can derive that DualPipe’s throughput is close to ideal. If implemented perfectly, the pipeline might achieve a throughput $T_{\\text{DualPipe}} \\approx (N + \\frac{P}{2} - 1)\\,T_{\\text{stage}}$ for large $N$ (since effectively two micro-batches enter per cycle once both ends are active). This is roughly half the overhead for large $P$. The paper notes that even for small micro-batch counts, DualPipe keeps high efficiency ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=batches%20to%20be%20divisible%20by,batches)) (the bubbles do not grow with more micro-batches beyond a point).\n", | |
"\n", | |
"**Overlapping Communication:** DualPipe specifically overlaps the **MoE communication** with computation. Because each micro-batch’s forward pass on a stage consists of [Attention -> DispatchComm -> MLP -> CombineComm], DualPipe can arrange, for instance, that while *micro-batch i* is in the DispatchComm phase (which involves network transfer of token embeddings to experts), the same stage can already start computing the Attention for *micro-batch i+1*. Similarly, during *CombineComm* of micro-batch i, the stage might start on the next micro-batch’s MLP or send backward signals for a previous micro-batch. This way, **communication latency is hidden behind compute** as much as possible ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=memory%20usage,scales%20up%2C%20as%20long%20as)) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20key%20idea%20of%20DualPipe,attention%20and%20MLP%20are%20further)). The result is that communication rarely causes a stall: as long as there is compute to do (and DualPipe ensures there is, by staggering tasks appropriately), the expensive all-to-all transfers happen in parallel.\n", | |
"\n", | |
"In effect, DualPipe turns what could be sequential segments: [Compute, Comm, Compute, Comm] into a *pipelined set of its own*: it treats those four segments almost like micro-stages that can overlap between successive micro-batches. A simplified timeline for two micro-batches A and B on one stage might look like: \n", | |
"\n", | |
"- Time 0: Compute Attention (A). \n", | |
"- Time 1: Start DispatchComm (A) **and** simultaneously Compute Attention (B). \n", | |
"- Time 2: Start MLP (A) while B is maybe finishing attention, etc., and maybe DispatchComm (B) overlaps with CombineComm (A)... \n", | |
"\n", | |
"By the time A is in CombineComm, B is in MLP, and perhaps C has started attention, etc., achieving full overlap.\n", | |
"\n", | |
"**Memory Trade-off:** One cost of DualPipe is that each pipeline stage needs to temporarily hold data for micro-batches coming from both directions. Also, the model weights might be effectively replicated (each pipeline stage might need two copies of the layer weights to apply them concurrently to two different micro-batches in forward/backward) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=only%20increasing%20the%20peak%20activation,DualPipe%20only%20requires%20that%20the)). The paper mentions DualPipe requires keeping *two copies of the model parameters* to enable simultaneous forward and backward on the same stage ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=match%20at%20L673%20only%20increasing,Compared)). This doubles the memory for model weights on each stage. However, since DeepSeek-V3 uses large expert parallelism, the relative overhead is not too bad (they say it “does not significantly increase memory consumption” because they had a large EP size) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=match%20at%20L673%20only%20increasing,Compared)). The *peak activation memory* did increase, by some factor (number missing in text, presumably 2x or similar) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)), because a stage now holds activations for two micro-batches at once. But they argue this is manageable, and importantly, unlike some other pipeline schemes, the activation memory *does not increase further as micro-batch count grows* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=match%20at%20L678%20batches%20to,batches)). Once you’re doing two at a time, you don’t need more. So DualPipe trades a bit more memory for a lot more throughput.\n", | |
"\n", | |
"**Comparison:** Another known strategy “Chimera” (Li & Hoefler 2021) also overlaps two pipelines but had strict requirements (like micro-batch count must be divisible by pipeline length) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=the%20memory%20consumption%20since%20we,DualPipe%20only%20requires%20that%20the)). DualPipe is more flexible (no such requirement) and achieves overlap with just two copies of weights, not more. And whereas Chimera might have more overhead or complexity, DualPipe was tailored to MoE’s specific pattern, focusing on overlapping the all-to-all comm.\n", | |
"\n", | |
"**Impact:** In practice, DualPipe yielded a **substantial speedup** for training DeepSeek-V3. It allowed nearly *linear scaling* as the model size (and thus number of pipeline stages) increased ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=memory%20usage,scales%20up%2C%20as%20long%20as)) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=most%20of%20the%20communication%20during,scales%20up%2C%20as%20long%20as)). The communication between pipeline stages (like gradients, activations passing forward/back) and MoE comm were largely hidden, meaning adding more devices (and hence splitting model into more stages) did not hurt throughput as much as it normally would. This is crucial for scaling to 2048 GPUs efficiently ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)). The overlap ensures that even if e.g. InfiniBand latency is high, it’s masked by compute on GPUs, achieving high utilization. By drastically reducing pipeline bubbles, they improved the *time-to-train per batch*. The authors note that with these optimizations, each trillion tokens of training took only 180K GPU-hours on their cluster (3.7 days on 2048 H800s) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=algorithms%2C%20frameworks%2C%20and%20hardware,Combined%20with)), which indicates extremely high hardware utilization. DualPipe is a prime example of **algorithmically increasing parallel efficiency** — it’s like turning the pipeline into an assembly line with no slowdowns. For anyone training large models, DualPipe demonstrates how careful scheduling can overcome what would otherwise be inherent latency costs.\n", | |
"\n", | |
"#### 3.1.3 Derivation: Pipeline Throughput with DualPipe \n", | |
"To quantify DualPipe’s benefit, consider a simple model: each stage takes 1 time unit to process a micro-batch’s forward or backward segment. With 1F1B (one forward, one backward interleaved) and $P$ stages, $N$ micro-batches, the total schedule length in time units is $T_{1F1B} = N + (P - 1)$ (assuming $N$ is large enough to fill pipeline) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)). Now for DualPipe, effectively two micro-batches can be processed in different directions simultaneously once the pipeline is filled from both ends. The schedule length might approximate $T_{\\text{DualPipe}} \\approx N + \\alpha (P - 1)$ for some $\\alpha < 1$. If $\\alpha \\approx 0.5$, and $P$ is large, the savings is almost 2x on the pipeline overhead term. For example, if $P = 8, N = 20$: \n", | |
"- 1F1B would take $20 + 7 = 27$ units (with 7 units of bubble overhead). \n", | |
"- DualPipe might take $20 + 0.5*7 \\approx 23.5$ units (roughly cutting bubbles in half). \n", | |
"\n", | |
"This is consistent with what they report: significantly reduced bubbles ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)). For very large $N \\gg P$, the difference in percentage is small, but for moderate $N$ (in large scale training, $N$ might be limited by batch size considerations), it’s meaningful. DualPipe especially shines when pipeline depth is high (many stages) and micro-batch count per mini-batch is not extremely high. DeepSeek-V3 likely had pipeline parallel degree that multiplied with data parallel etc., so optimizing this was critical.\n", | |
"\n", | |
"#### 3.1.4 Stability and Scalability \n", | |
"One noteworthy outcome: DeepSeek-V3’s training was *remarkably stable*, with no loss spikes or need to rollback checkpoints ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=performance%20comparable%20to%20leading%20closed,V3)). DualPipe’s overlapping of forward/backward could have risked complicating dependency ordering and possibly stability, but evidently it was handled cleanly (e.g., by ensuring correct ordering of weight updates and gradients). They managed to run thousands of GPUs with near full utilization without divergence. This speaks to the maturity of their pipeline engine. Scalability-wise, DualPipe plus other optimizations allowed them to effectively hide the slower inter-node communication as the model scaled. The paper asserts that as the model scales up, *as long as there are enough micro-batches relative to pipeline stages, the overhead remains hidden* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=memory%20usage,scales%20up%2C%20as%20long%20as)). In other words, DualPipe makes training time scale almost linearly with more layers (which add more compute) rather than being dominated by pipeline stall costs. This property was essential to train such a huge model in ~2 months.\n", | |
"\n", | |
"In summary, **DualPipe pipeline parallelism** was a breakthrough that maximized GPU usage by carefully choreographing tasks. It’s a masterclass in how to **schedule parallel work** to mitigate latency: overlap communications with computations, interleave forward/backward, and exploit every ounce of parallelism. For practitioners, it illustrates that sometimes one must go beyond stock parallel training algorithms and develop custom schedules aligned to the model’s structure (like splitting attention/MLP phases) to reach peak performance.\n", | |
"\n", | |
"### 3.2 Communication Strategies and CUDA/PTX-Level Optimizations \n", | |
"DeepSeek-V3’s distributed training spans **multiple nodes**, and MoE in particular demands heavy communication: each forward pass involves sending token representations to different experts across GPUs and then gathering results back. Without careful optimization, this **all-to-all communication** can become a bottleneck, especially since inter-node bandwidth (InfiniBand) is much lower than intra-node (NVLink) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=gating%20algorithm%20and%20the%20network,To%20effectively%20leverage%20the%20different)). The team tackled this with a combination of **hierarchical communication algorithms, low-level CUDA optimizations, and topology-aware routing**. They also leveraged *overlap* (as described in DualPipe) so communication latency is hidden by compute. In this section, we detail the cross-node all-to-all procedure and the PTX-level tweaks that reduced its cost, as well as how they balanced network usage to avoid saturating links.\n", | |
"\n", | |
"#### 3.2.1 Cross-Node All-to-All for MoE Expert Dispatch \n", | |
"**All-to-All in MoE:** During the MoE feed-forward layer, each token (on whatever GPU it resides after attention) needs to be dispatched to the GPUs holding its selected experts. Suppose after attention, GPU A has a batch of token representations, and token X on A needs expert #5 which lives on GPU B. Meanwhile, token Y on GPU B might need expert #8 on GPU A, etc. To compute all experts, a global exchange happens: each GPU sends each other GPU the token vectors that those others need. This is a classic **all-to-all communication** pattern: every participant sends data to every other participant. In DeepSeek-V3, because the model runs on many GPUs across nodes, this all-to-all spans both intra-node (between GPUs on the same server via NVLink) and inter-node (via InfiniBand) links ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=gating%20algorithm%20and%20the%20network,To%20effectively%20leverage%20the%20different)). \n", | |
"\n", | |
"**Efficient All-to-All Implementation:** They optimized this by splitting it into two phases, aligning with hardware capabilities:\n", | |
"- **Intra-node**: Use high-bandwidth NVLink for shuffles among GPUs on the same node.\n", | |
"- **Inter-node**: Use InfiniBand for shuffles across nodes, but minimize usage by possibly aggregating or restricting as per node-limited routing (Section 2.2.2).\n", | |
"\n", | |
"The network topology of their cluster is fully interconnected with InfiniBand between nodes, and NVLink within a node ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=gating%20algorithm%20and%20the%20network,To%20effectively%20leverage%20the%20different)). NVLink is ~160 GB/s vs IB ~50 GB/s ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=gating%20algorithm%20and%20the%20network,To%20effectively%20leverage%20the%20different)). They likely exploited this by first gathering tokens destined to the same node, then sending one bigger message node-to-node, then distributing within the node.\n", | |
"\n", | |
"One approach is **hierarchical all-to-all**:\n", | |
"1. **Local shuffle**: within each node, group tokens by target node. For example, on each node, collect all token vectors that need to go to node 2, all that need to go to node 3, etc. This uses fast NVLink to rearrange data among GPUs of the node.\n", | |
"2. **Inter-node exchange**: perform all-to-all between nodes (each node sends the grouped data to the destination node in one or few messages). This goes over InfiniBand.\n", | |
"3. **Local distribution**: once a node receives tokens from others that are for experts on its GPUs, it then distributes those tokens to the specific GPU/expert locally (again via NVLink).\n", | |
"\n", | |
"This hierarchical method can significantly reduce contention and make use of the full bisection bandwidth of the network. It also aligns with the **node-limited routing** constraint: since each token goes to at most 4 nodes, many tokens might remain local or go to only a subset, reducing total volume.\n", | |
"\n", | |
"**Auto-tuning chunk size:** The implementation details include *auto-tuning the communication chunk size* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=employ%20customized%20PTX%20,the%20interference%20to%20other%20SMs)). This means they likely did not send one giant message for all data, but broke the all-to-all into chunks. The optimal chunk size balances between overhead (too many small messages) and resource usage (one huge transfer might monopolize the NIC or saturate L2 cache). By tuning chunk size, they reduced L2 cache pressure and SM interference ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=employ%20customized%20PTX%20,the%20interference%20to%20other%20SMs)). For instance, sending data in, say, 1MB chunks might allow overlapping communication on the NIC with computation, and avoid evicting working data from L2 by streaming out gradually.\n", | |
"\n", | |
"#### 3.2.2 PTX-Level Optimizations for Communication \n", | |
"**Custom PTX Instructions:** The team went as low-level as writing **custom PTX (Parallel Thread Execution) assembly** for certain communication operations ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=employ%20customized%20PTX%20,the%20interference%20to%20other%20SMs)). This is done to exert fine control over memory access patterns and bypass certain default behaviors of the compiler or runtime that could be suboptimal. One specific aim was to *reduce usage of the L2 cache* and avoid interference with other SMs ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=employ%20customized%20PTX%20,the%20interference%20to%20other%20SMs)). In GPU architecture, global memory operations go through L2 by default (and sometimes L1). During all-to-all, a large volume of data is moved that likely doesn’t need to stay in cache (it’s mostly streaming through to the network or to another GPU). If these transfers consume L2 cache, they could evict neural network activations or weights that are actively being used by compute, thereby slowing down computation. \n", | |
"\n", | |
"To prevent that, they likely used PTX memory access flags such as **.CG (cache at global level only)** or **.CS (cache streaming)** to bypass or minimize caching. For example, in PTX one can do: \n", | |
"```ptx\n", | |
"ld.global.cg.u32 %r1, [%rd_src]; // load with .cg (cache global/in L2 only, bypass L1)\n", | |
"st.global.wt.u32 [%rd_dest], %r1; // store with .wt (write-through, minimizing cache)\n", | |
"``` \n", | |
"By marking loads as `.cg` or even `.cs` (streaming) and stores as write-through or non-temporal, the data is less likely to pollute caches ([[PDF] GPU Concurrency: Weak Behaviours and Programming Assumptions](https://users.soe.ucsc.edu/~tsorensen/files/asplos2015.pdf#:~:text=,36%2C%20p)) ([[PDF] Adaptive and Transparent Cache Bypassing for GPUs - cfaed](https://cfaed.tu-dresden.de/files/user/akumar/publications/SC-2015-camera-ready-a17-li.pdf#:~:text=,cg%3B%20for%20L2%20bypassed)). *Pseudo-PTX example:* \n", | |
"```\n", | |
"// Pseudo-code for a communication kernel copying data without cache pollution\n", | |
"mov.u64 %rd_src, src_ptr;\n", | |
"mov.u64 %rd_dest, dest_ptr;\n", | |
"LDG.E.128 [%rd_src], %v4_data; // 128-byte load, perhaps using .E (evict) flag if available\n", | |
"STG.E.128 [%rd_dest], %v4_data; // 128-byte store, evicting from L2 after use\n", | |
"``` \n", | |
"This would load 128 bytes from global memory and mark them for eviction from cache after use, then store them out. By doing this in a loop, the kernel can stream large buffers of data efficiently.\n", | |
"\n", | |
"**Warp Specialization:** The mention of *warp specialization* ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=routing%2C%20FP8%20training%2C%20DualPipe,the%20limitations%20imposed%20by%20limited)) suggests that they may have dedicated certain GPU warps or threads purely to moving data (communication) while others handle compute. For instance, one warp on an SM could be executing the PTX copy instructions to send/receive MoE data, while other warps on the same SM continue computing attention/MLP. Using CUDA streams or issuing asynchronous memcpy operations could enable this overlap. They likely had to fine-tune this so that communication didn’t starve compute or vice versa. With PTX, one can leverage asynchronous transaction barriers (`cp.async` in modern GPUs) to prefetch global memory to shared memory and then move it out, overlapping latency.\n", | |
"\n", | |
"**Overlap via PTX:** Modern NVIDIA GPUs (Ampere and beyond) allow `cp.async` instructions to load data to shared mem in the background while computation proceeds. It’s plausible they used such techniques for overlapping the packing/unpacking of all-to-all messages. Essentially, *HPC-level scheduling at the thread level* was used to ensure communication could be hidden. The Martin Fowler summary explicitly says: *“HPC-level scheduling (PTX instructions, warp specialization) hides communication overhead”* ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=match%20at%20L597%20routing%2C%20FP8,the%20limitations%20imposed%20by%20limited)), confirming that these low-level tweaks were key to mask IB latency.\n", | |
"\n", | |
"**Efficiency Gains:** By customizing the communication, they *reduced interference with compute*. The PTX optimizations “significantly reduce use of the L2 cache and interference to other SMs” ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=employ%20customized%20PTX%20,the%20interference%20to%20other%20SMs)). This implies that compute kernels running concurrently on other SMs (or even different warps of the same SM) don’t experience slowdowns due to cache thrashing or memory controller saturation. Auto-tuning chunk sizes means the communication kernel likely yields periodically to let others use memory, or fits within certain time slices.\n", | |
"\n", | |
"In simpler terms, they engineered the data transfer to behave like a well-behaved DMA: streaming large blocks point-to-point without disturbing the rest of the system. This is reminiscent of CPU techniques where you use non-temporal loads/stores for bulk copies to avoid polluting caches.\n", | |
"\n", | |
"**Real-World PTX Snippet:** As a concrete example, consider sending an array of float values from GPU A to GPU B (assuming peer memory access is enabled). A PTX kernel might do:\n", | |
"```ptx\n", | |
"// Each thread block handles a chunk of the array\n", | |
"LDG.STS.S128 [%rd_shared, 0], [%rd_src + offset]; // load 16 floats (64B) from global to shared, with streaming\n", | |
"BAR.SYNC 0; // sync threads (if needed)\n", | |
"MOV.DATA.S128 [%rd_dest_peer + offset], [%rd_shared, 0]; // store 16 floats directly to peer's memory\n", | |
"```\n", | |
"This uses a two-step: global to shared (with .STS maybe meaning streaming to shared, avoiding L2), then shared to peer global (which might bypass L2). The effect is a direct copy from one GPU’s memory to another’s with minimal cache usage. (The exact PTX will differ, but conceptually this is what their custom code would do.)\n", | |
"\n", | |
"#### 3.2.3 Topology-Aware Communication and Balance \n", | |
"We already touched on **node-limited routing** in Section 2.2.2: by restricting each token to at most 4 nodes, they ensure that not all 2048 GPUs are exchanging with all others for every token, which reduces traffic. Moreover, they designed the gating to consider network load: the gating tends to choose an expert on the same node if it’s nearly as good as one on a remote node, thanks to the bias and node-selection heuristic ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=%23%23%20Node)). This effectively uses the **different bandwidth tiers** smartly: NVLink (intra-node) is 3.2x faster than IB ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=gating%20algorithm%20and%20the%20network,To%20effectively%20leverage%20the%20different)), so the system prefers to route within that when possible. If a token can get both of its experts on the same node, it saves a lot of IB bandwidth.\n", | |
"\n", | |
"Additionally, by overlapping communication (DualPipe) and compressing data (using FP8 for activations possibly during send, see low-precision comm in Section 3.4.4), the effective communication load is reduced or hidden.\n", | |
"\n", | |
"**All-to-All with Many GPUs:** Typically, an all-to-all across $N$ GPUs has $N(N-1)$ connections – not all active at once, but logically each must send to each. On 2048 GPUs, that’s daunting. But with node groupings and limiting to 4 nodes per token, the effective active connections per batch are far fewer. Possibly they partition experts into node groups so that most tokens only travel within a subset. The *device-level balance* and *communication balance losses* mentioned in DeepSeek-V2 ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=%2A%20Expert,token%20routing%20to%20each%20device)) (which may be superseded by the new method) indicate they paid attention to balancing network load as well – e.g., ensure each node sends roughly equal amount and receives equal amount (so no single node becomes a hotspot). This might be achieved through the gating bias as well or the earlier auxiliary losses.\n", | |
"\n", | |
"**Results:** The paper states they *overcame the communication bottleneck in cross-node MoE training, achieving near-full computation-communication overlap* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)). That means the all-to-all did not slow down training – a remarkable feat given the data sizes. They cite “co-design of algorithms, frameworks, and hardware” for this ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)), meaning the interplay of gating (algorithm), DualPipe (framework) and using NVLink/IB appropriately (hardware) was necessary. Also, any inefficiency in all-to-all would magnify because MoE is in *every transformer block* typically. Their solution scaled gracefully: training efficiency remained high even as they scaled to more nodes or as MoE traffic increased with more experts. \n", | |
"\n", | |
"In summary, **communication strategies** in DeepSeek-V3 combined **hierarchical all-to-all**, **concurrent communication with compute** (overlap), **cache-bypassing data transfer** (custom PTX), and **routing adjustments** to minimize cross-node traffic. This comprehensive approach ensured that *network did not become the Achilles heel* of the training run. Many large model projects struggle with network scaling, but DeepSeek-V3 essentially treated communication as another pipeline stage to be optimized, rather than an afterthought.\n", | |
"\n", | |
"#### 3.2.4 Impact on Scaling and Hardware Utilization \n", | |
"By eliminating communication stalls, they could make effective use of a large cluster of 2048 H800 GPUs. It’s worth noting that H800 (a datacenter GPU like A100 or H100 with presumably 80GB memory and high-speed links) is slightly slower in interconnect (the 50 GB/s IB mentioned) than some internal Google/Microsoft networks, but they managed to saturate it usefully. Achieving **linear scaling** means doubling the number of GPUs almost doubled training speed (minus minor overheads). This is evidenced by the fact that a 671B model was trained in under 8 weeks, which would be impossible without near-linear scaling given the work required.\n", | |
"\n", | |
"One can consider if each GPU was, say, 90% utilized and if there were any slack due to comm. The techniques above likely pushed utilization to ~100%. The overlap of comm and compute implies that at any given moment, either a GPU is computing or sending data (or both), but not sitting idle waiting for data.\n", | |
"\n", | |
"The careful handling of L2 cache also helps *other parts of the model*. For example, gradient all-reduce (for data parallel parts) can also be overlapped and tuned similarly, though not explicitly discussed, but presumably they used similar tricks for any needed collective communications (like model param updates across data parallel groups, though MoE reduces that since many params are local to experts).\n", | |
"\n", | |
"**PTX-level and HPC rigor** often yields maybe a few percent improvement here and there, but those add up. Removing cache thrash might give, say, a 5-10% speed boost on kernels that overlap with comm. Over thousands of GPUs, that’s significant.\n", | |
"\n", | |
"Finally, by presenting their suggestions to hardware designers (they had Section 3.5 in the report about hardware), they likely advocated for future GPUs to have even better support for such patterns (for example, hardware-supported microscaling in FP8, better network injection bandwidth, etc.) ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3,Training%201.%204.1%20Data%20Construction)). This implies that while they managed with current hardware, they also identified hardware features that could further ease such communication overlap.\n", | |
"\n", | |
"In conclusion, the communication optimizations ensured that **distributed MoE training ran at maximal efficiency**, turning what could have been a scaling bottleneck into a non-issue. This is a prime lesson in large-scale training: algorithmic sparseness (MoE) shifts the bottleneck to communication, and only by innovating at the low-level (using assembly, tuning, and algorithms like DualPipe) could they reap MoE’s promised efficiency gains at 671B scale.\n", | |
"\n", | |
"### 3.3 Memory Management and Efficient Memory Usage \n", | |
"Training a model of this size requires extreme memory efficiency. DeepSeek-V3 uses **8-bit precision**, but even with that, the activations, optimizer states, and model parameters push GPU memory limits. To address this, the team implemented a suite of memory-saving techniques: from *recomputing certain values on the fly* instead of storing them, to *offloading large-but-seldom-used data to CPU*, to *sharing weights* between parts of the model. These optimizations allowed the model to fit in memory and even enabled larger batch sizes or sequence lengths for better throughput.\n", | |
"\n", | |
"#### 3.3.1 Selective Activation Recomputation (Backward Checkpointing) \n", | |
"One technique is to **not store some intermediate results** during forward pass and recompute them during backward pass as needed. This is known as *activation checkpointing*. DeepSeek-V3 specifically mentions recomputation of *RMSNorm and MLA up-projection* outputs in backward ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3,Token%20Prediction)). \n", | |
"\n", | |
"- **RMSNorm recomputation:** RMSNorm (Root Mean Square Layer Normalization) is used in the model. Typically, to backprop through RMSNorm, one might store the normalized output or other intermediate (like the inverse sqrt of variance) from forward. Instead, they recompute the RMSNorm during backward from the original input, which they still have. The formula for RMSNorm: $y = \\frac{x}{\\sqrt{\\frac{1}{d}\\sum_i x_i^2 + \\epsilon}}$. The backward pass needs the normalization term. They can recalc $\\sqrt{\\frac{1}{d}\\sum_i x_i^2 + \\epsilon}$ on the fly from $x$ (which is the input to RMSNorm, likely still in memory or easily recomputed from earlier activations) rather than storing it. This saves storing the normalized $y$ or the norm. The cost is an extra sum of squares and sqrt in backward – negligible relative to overall compute. The gradient derivation for RMSNorm is: \n", | |
" $$ \\frac{\\partial L}{\\partial x} = \\frac{1}{\\sqrt{\\mu+\\epsilon}} \\left( \\frac{\\partial L}{\\partial y} - \\frac{x \\cdot (x^T \\frac{\\partial L}{\\partial y})}{d(\\mu+\\epsilon)} \\right), $$\n", | |
" where $\\mu = \\frac{1}{d}\\sum_i x_i^2$. To compute this, one needs $\\mu$ (or $\\mu+\\epsilon$ and $x^T \\frac{\\partial L}{\\partial y}$). They can recompute $\\mu$ easily by summing $x_i^2$ in backward. So no need to store $\\mu$ from forward. This saves memory at the cost of a tiny bit of extra compute (a vector dot product and a couple of scalars per norm).\n", | |
"\n", | |
"- **MLA Up-Projection recomputation:** The “MLA up-projection” likely refers to the step of reconstructing full keys/values from the latent. Perhaps in forward they compute $K = W^U_K c^{KV}$ and $V = W^U_V c^{KV}$, then use them. Storing $K$ and $V$ for backprop would be heavy (they are large). Instead, in backward they can recompute $K$ and $V$ from the stored latent $c^{KV}$ (which is smaller) and the weight matrices (which they have). This is a matrix multiplication repeated, but worth it to avoid storing $K,V$. So effectively, during backward: they multiply $c^{KV}$ by $W^U_K$ again to get $K$ (then compute gradients etc). Since $c^{KV}$ is much smaller dimension than $K$, storing it (latent) in forward and recomputing $K$ is memory-efficient. This aligns with earlier discussion that only latent needs caching.\n", | |
"\n", | |
"By recomputing these pieces, they trade a bit of extra FLOPs for reduced memory footprint. This is a common trade-off, and given they had plenty of compute (GPUs) but limited memory, it’s a smart choice.\n", | |
"\n", | |
"The report explicitly lists **“Recomputation of RMSNorm and MLA Up-Projection”** as a sub-item under memory saving ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3,Token%20Prediction)), confirming they did exactly this. \n", | |
"\n", | |
"**Impact:** This selective recomputation keeps **activation memory in check**, allowing either larger batch sizes or deeper models per GPU. Without it, certain layers might double buffer big tensors. By recomputing, memory usage per layer is lowered. For example, not storing $K,V$ could save tens of MB per layer per GPU (since $K,V$ of dimension $d_{\\text{model}}$ for all tokens in a micro-batch). Summed over many layers, that’s a large saving. This was likely crucial to fit the model and a reasonable batch on 80GB H800 memory along with optimizer states.\n", | |
"\n", | |
"#### 3.3.2 Offloading to CPU (EMA and Others) \n", | |
"The report mentions *“Exponential Moving Average in CPU”* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3,Token%20Prediction)). Often, during training, especially for research, one maintains an Exponential Moving Average of model weights (for stabilization or evaluation). This EMA basically doubles memory if kept on GPU. They moved this **EMA to CPU** memory, updating it asynchronously. That means after each step, the GPU sends weight diffs to the CPU to update the EMA. This offloads a large chunk of memory (671B parameters in FP16 would be enormous; even in FP8, storing an extra copy is huge). On CPU, memory is more plentiful (though slower). Since EMA is not needed for compute during training, it’s fine to keep it on CPU and maybe bring it to GPU only for occasional eval or at very end. This saved *GPU memory* significantly ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3,Token%20Prediction)).\n", | |
"\n", | |
"Similarly, any other long-term state that isn’t needed constantly could be offloaded. Perhaps things like certain optimizer statistics if they weren’t used every step (though usually they are). In DeepSpeed, techniques like ZeRO-Offload push gradients or momentum to CPU. DeepSeek-V3 might not have needed that if FP8 shrunk things enough, but EMA they explicitly did.\n", | |
"\n", | |
"**Impact:** Offloading EMA freed GPU memory proportional to model size. For 671B params, even storing them in 16-bit would be ~1.3 TB (not feasible on GPU), but they might have only tracked EMA for finetuning or not at all. Possibly they did maintain it given mention. By offloading, they ensure that memory is used for active stuff only.\n", | |
"\n", | |
"#### 3.3.3 Weight Sharing (Embedding/Output and MTP) \n", | |
"They mention *“Shared Embedding and Output Head for Multi-Token Prediction”* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3,Token%20Prediction)). Transformers often tie input embedding and output projection matrices (for vocab) to save parameters. DeepSeek-V3 likely does this tying for the main LM head to reduce parameters a bit (common practice). But specifically here, for the **MTP module**, which predicts multiple tokens, they ensure it shares the embedding and output layers with the main model. So the extra parameters introduced by MTP are minimal (just a small module, not a separate full vocabulary projection). This avoids duplicating large vocab embeddings.\n", | |
"\n", | |
"Memory-wise, embedding tables can be large (depending on vocab size). Tying them (embedding = softmax weight) saves memory and usually does not hurt performance. So they apply that. It’s a straightforward memory cut.\n", | |
"\n", | |
"#### 3.3.4 ZeRO and Optimizer Partitioning \n", | |
"While not explicitly mentioned, one cannot ignore that *ZeRO-style sharding* might have been used. DeepSeek-V3 has a huge number of parameters, but many of them (experts) are sharded by nature (each GPU holds different experts). For any remaining large components (like the transformer layers that are not MoE, e.g., attention layers weights), they likely used **data parallel with optimizer state partitioning** so that each GPU only holds a shard of the optimizer states (as in ZeRO Stage 1/2) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)). This way, memory for Adam moments is distributed. Or, given they did FP8, they might have used a simpler optimizer like Lion or momentum-SGD? However, likely Adam or AdamW with some form of partition. \n", | |
"\n", | |
"The “Extremely Memory Saving with Minimal Overhead” section might detail such approach ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=1.%203.2.1%20DualPipe%20and%20Computation,Memory%20Saving%20with%20Minimal%20Overhead)), since it’s listed under 3.2.3. Possibly they allowed larger expert parallel (EP) size, meaning fewer replicas of base model across nodes, which inherently is like ZeRO: each GPU has only a fraction of each layer’s weights (expert parallel does that for MoE; they might also apply model parallel for dense layers).\n", | |
"\n", | |
"It’s plausible that the combination of MoE (which naturally shards FFN params) and pipeline parallel (which shards layers) means that no single GPU has all model weights – in fact each GPU sees only a slice (like 1/8th from pipeline and 1/32th from MoE, etc.). Thus, *the model is distributed by design*, reducing memory per GPU. This is how they fit 671B on 2048 GPUs (which yields ~0.33B per GPU on average, though distribution is uneven).\n", | |
"\n", | |
"**Optimizer State Precision:** Another memory saver: they likely kept optimizer states in lower precision. If training in FP8, gradients might accumulate in BF16 or FP32. But momentum and variance (if using Adam) could be stored in e.g. FP16 or BF16 without issue. They mention *“Low-Precision Optimizer States”* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3.%203.3.3%20Low,Precision%20Communication)) in the FP8 section, implying they do store those in e.g. FP16 instead of FP32. That would cut memory for optimizer by 2x. Combined with partitioning, the optimizer memory overhead per GPU is quite low.\n", | |
"\n", | |
"#### 3.3.5 Impact on Memory and Training \n", | |
"All these strategies together allowed them to run with comfortable memory headroom. The report likely has a table of memory usage across pipeline methods ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=burden%2C%20DualPipe%20still%20exhibits%20efficiency,reduces%20the%20pipeline%20bubbles%20while)), showing DualPipe’s activation overhead and how they managed it. They claim minimal overhead for their memory savings approach ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=only%20increasing%20the%20peak%20activation,DualPipe%20only%20requires%20that%20the)). The bottom line is that memory was not the limiter in training DeepSeek-V3; they found ways to operate within the 80GB per GPU. \n", | |
"\n", | |
"**Why memory efficiency matters economically:** If they couldn’t fit the model efficiently, they might need more GPUs (to split model further) or lower batch sizes (hurting throughput). By saving memory, they could use 2048 GPUs effectively rather than needing, say, 3000 GPUs or running with tiny batches. This directly affects cost.\n", | |
"\n", | |
"From a *master class* perspective, the lesson is: *Large models require equal attention to memory as to compute.* You often trade compute (recompute steps) to save memory, which is worthwhile if you have spare FLOPs but limited memory, as is often the case on GPUs.\n", | |
"\n", | |
"### 3.4 FP8 Mixed-Precision Training – Faster Compute, Smaller Memory Footprint \n", | |
"One of DeepSeek-V3’s headline innovations is that it was trained with **8-bit floating point precision (FP8)** for most of its computations ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=In%20order%20to%20achieve%20efficient,2017%3B%20Peng%20et%C2%A0al)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,576)). Mixed precision training has been around (FP16/BF16 for activations and weights, FP32 for some accumulations), but FP8 is cutting-edge. By using FP8, they *halved* the memory required for activations and model weights relative to FP16, and doubled the math throughput on tensor core units (since GPUs can execute more FP8 ops per cycle than FP16) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=FP8%20Mixed%20Precision)). However, FP8 has a much narrower dynamic range and precision, so special care was needed to maintain training stability and accuracy. DeepSeek-V3 introduced a **fine-grained quantization strategy** and increased accumulation precision to make FP8 viable at scale ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=introduce%20a%20fine,GEMM)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=Based%20on%20our%20mixed%20precision,focusing%20on%20both%20the%20quantization)). We break down how FP8 was used and the techniques to mitigate its downsides.\n", | |
"\n", | |
"#### 3.4.1 FP8 Mixed Precision Framework Overview \n", | |
"**FP8 formats:** NVIDIA’s H100 GPU (which the H800 is a variant of) supports two FP8 formats: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). E4M3 has 16 finite values per power-of-two range (as 3 bits mantissa -> 8 values, half are negative), and exponent range of 2^4 = 16 (exponents -6 to +9 bias maybe), whereas E5M2 has wider dynamic range (32 exponent values) but only 4 values per range. In practice, E4M3 has precision ~0.5% (because 2^3=8 steps per doubling ~12% relative precision? Actually, mantissa=3 bits gives ~0.8% of precision at normalized range) and E5M2 can represent very large or small numbers but with coarse steps (only 2 bits mantissa ~25% precision in worst case).\n", | |
"\n", | |
"DeepSeek-V3 likely used FP8 for forward activations and weights, but kept some parts in higher precision:\n", | |
"- The model weights and activations were stored in FP8 during compute.\n", | |
"- The gradient accumulation and certain sensitive computations (like the residual summation, layer norm) might use higher precision (BF16 or FP16) to avoid error accumulation ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=In%20order%20to%20achieve%20efficient,2017%3B%20Peng%20et%C2%A0al)).\n", | |
"- The optimizer likely ran in FP16/BF16 for updates, then cast weights to FP8.\n", | |
"\n", | |
"They mention a *mixed precision framework* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=In%20order%20to%20achieve%20efficient,2017%3B%20Peng%20et%C2%A0al)), indicating they carefully decided which tensors are FP8 and which are not. Typically:\n", | |
" - Weights: FP8\n", | |
" - Activations: FP8\n", | |
" - Gradients: maybe FP8 in communication, but when computing weight updates, they might upcast to BF16\n", | |
" - Accumulated partial sums in matmul: done in higher precision (like FP16 or FP32)\n", | |
"\n", | |
"Yes, they specifically mention *“Higher FP8 GEMM accumulation precision in Tensor Cores”* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=match%20at%20L727%20quantization%20method,precision%20accumulation)). Likely H100 allows accumulating FP8*FP8 products into an FP32 accumulator by default. They might ensure that is used (i.e., the WMMA instructions accumulate in 32-bit even if inputs are 8-bit). If not by default, they might break the GEMM such that partial sums are done in pieces to avoid overflow (which they do, as described later).\n", | |
"\n", | |
"The FP8 mixed precision approach was validated for the first time on such a large model ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=Low,2017%3B%20Peng%20et%C2%A0al)). They showed it works, which is a big deal. The benefits:\n", | |
"- **Memory**: FP8 uses half the bytes of FP16 for activations and gradients. Also, if optimizer states can be kept in FP8 (maybe not for moments, but maybe for some things), memory reduces. They likely used BF16 for some states, FP8 for data. They mention low-precision storage of optimizer states ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3.%203.3.3%20Low,Precision%20Communication)) which could be even FP8 for momentum (though that might be too aggressive; more likely FP16).\n", | |
"- **Speed**: H100 can achieve up to ~2x more TFLOPS on FP8 than FP16. If bound by compute, this can nearly double speed. They did see accelerated training thanks to FP8 ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=FP8%20mixed%20precision%20training%20framework,we%20design%20the%20DualPipe%20algorithm)).\n", | |
"\n", | |
"However, training in FP8 can be unstable if not managed:\n", | |
"- Overflows/underflows: FP8’s limited range can saturate (if values exceed max representable, they become inf). This can break training.\n", | |
"- Rounding error: 3-bit mantissa is rough, noise might be high.\n", | |
"- Gradient noise: could interfere with convergence.\n", | |
"\n", | |
"Hence, they implemented *quantization strategies*.\n", | |
"\n", | |
"#### 3.4.2 Fine-Grained Quantization (Tile-wise scaling) \n", | |
"**Problem:** In any layer (matrix multiply), some elements can be outliers (very large magnitude) which determine the scaling for quantization, causing the rest to suffer in precision. With FP8, if you choose a single scale for a whole matrix (as in standard operations), an outlier can force the scale such that most values get only 1-2 bits of mantissa effectively.\n", | |
"\n", | |
"**Solution:** They introduced *tile-wise grouping for scaling* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=quantization%20accuracy,weights%2C%20we%20group%20and%20scale)). Specifically:\n", | |
"- For **activations**: they group elements in a tensor on a 1x128 tile basis – meaning each vector of 128 channels for a given token has its own scale factor ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=quantization%20accuracy,weights%2C%20we%20group%20and%20scale)). So if a particular neuron spikes for one token, it doesn’t wreck the scaling for others.\n", | |
"- For **weights**: they group by 128 channels as well (likely 128 out of the output features) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=quantization%20accuracy,weights%2C%20we%20group%20and%20scale)).\n", | |
"\n", | |
"Thus, instead of one scale per entire tensor, they have many scales (like per row or per tile of 128). This is akin to “group quantization” or “block-wise quantization” at a fine granularity. It is more flexible and can capture local variations in magnitude.\n", | |
"\n", | |
"They note in Appendix B.2 that using block-wise for activations (like larger block) caused instability, so they opted for tile-wise (smaller group) for activations ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=match%20at%20L753%20channels%29,the%20same%20way%20as%20weights)). For weights, group of 128 was fine.\n", | |
"\n", | |
"**Microscaling:** They point out this approach aligns with the concept of “microscaling” formats ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=match%20at%20L764%20Notably%2C%20our,NVIDIA)), and that next-gen GPUs are planning to support such finer granularity quantization natively ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=Notably%2C%20our%20fine,NVIDIA)). This means they are on the cutting edge, anticipating hardware support.\n", | |
"\n", | |
"**Implementation:** Having per-group scaling factors along the inner dimension of GEMM means in the matrix multiply $C = A \\times B$, where say A are activations (tokens x channels) and B are weights (channels x outputs), they partition the “channels” dimension into groups of 128. Each group has its own scale for A and for B. To compute the result correctly, they must *dequantize* each group’s FP8 values to FP16/FP32 by multiplying by its scale before accumulation.\n", | |
"\n", | |
"They mention *“This functionality is not [supported by off-the-shelf libraries]”* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=One%20key%20modification%20in%20our,This%20functionality%20is%20not)), so they had to implement it. They likely wrote a custom kernel that:\n", | |
"- For each tile of 128 in the K dimension (inner dimension),\n", | |
"- multiplies the corresponding chunk of A and B by their scale (making them FP16),\n", | |
"- then does a matrix multiply (maybe using tensor cores on FP16 or directly FP8 with intermediate scaling, but likely easier: treat them as scaled FP16),\n", | |
"- accumulate partial results in FP32,\n", | |
"- then move to next tile.\n", | |
"\n", | |
"Because each tile of 128 can be done independently and summed (since matrix multiply sums over K dimension which is segmented now), this yields the correct full result.\n", | |
"\n", | |
"This approach increases computation slightly (you do scaling multiplies for each element, which is minor, and you might not fully utilize tensor cores if chunk size isn't multiple of their tile, but they likely chose 128 because it's a nice multiple for WMMA operations e.g., 16 or 32).\n", | |
"\n", | |
"**Dequantization Overhead:** They say the overhead is largely mitigated by their increased precision accumulation ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=introduce%20a%20fine,GEMM)). Possibly because by accumulating partials in FP32, they could incorporate scaling factors as part of the accumulation. For example, if you accumulate after every 128 elements, you can incorporate their scale difference by scaling the partial sums appropriately. However, they likely just did straightforward: multiply each tile’s values by scale (makes them FP16) then multiply-add. The overhead of that multiply is small on tensor cores which can fuse multiply-add.\n", | |
"\n", | |
"**Activations vs Weights quantization:** Usually, weights can be static-scaled (maybe one scale per channel computed from calibration). But they likely used dynamic scaling per batch as well, especially for activations which vary each iteration.\n", | |
"\n", | |
"**Benefits of fine-grained quant:** This reduces quantization error by adapting scale to smaller groups. Outliers now only affect their group’s scale. They found that grouping activations in larger blocks like weights did caused instability ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=channels%29,the%20same%20way%20as%20weights)), likely because activation distributions vary more widely token to token, so needed smaller grouping (per token per 128 channels) to handle that.\n", | |
"\n", | |
"#### 3.4.3 Enhanced Precision Accumulation (Mantissa vs Exponent) \n", | |
"They mention *“Increasing accumulation precision”* and *“Mantissa over Exponents”* in the contents ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=match%20at%20L31%203,Precision%20Optimizer%20States)). This implies two things:\n", | |
"- They ensure the matrix multiplication accumulations are done in higher precision (likely FP32) periodically.\n", | |
"- Possibly they prefer to allocate bit budget to mantissa precision rather than exponent where possible, i.e., maybe choose E4M3 vs E5M2 in certain layers depending on distribution.\n", | |
"\n", | |
"**Mixed Format Use:** It could be they used E5M2 format for weights (wider range for weight magnitudes) and E4M3 for activations (more precision since activations typically can be normalized). Or vice versa. Or maybe they tried both and found one better.\n", | |
"\n", | |
"The phrase \"Mantissa over Exponents\" might mean they bias toward format with more mantissa bits (E4M3 has one more mantissa bit than E5M2), so if dynamic range allowed, they used E4M3 to get extra precision in values.\n", | |
"\n", | |
"**Interval accumulation in CUDA cores:** The text snippet says *“promoting to CUDA Cores at an interval of elements MMA for the high-precision accumulation”* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=match%20at%20L727%20quantization%20method,precision%20accumulation)), although it's a bit garbled. Possibly: they accumulate partial sums in FP32 on standard CUDA cores after every certain number of MMA operations. Maybe they break a big GEMM into chunks such that after processing, say, 128 elements of K dimension (one tile), they add the partial sum out of tensor core to a FP32 accumulator in regular cores. This prevents overflow in accumulation because you don’t sum too many low-precision products in one go on the tensor core.\n", | |
"\n", | |
"For example, if the hardware default was FP8 * FP8 accumulate in FP16 (just guessing), they might promote it to FP32 after a few operations. But I think H100 can accumulate FP8 in FP32 natively, so not sure. It's possible they manually did it to be safe.\n", | |
"\n", | |
"Alternatively, maybe they accumulate in FP16 inside a tile, then after finishing tile, convert to FP32 and accumulate across tiles in FP32 on general cores. This is plausible if they used existing WMMA instructions: maybe current WMMA for FP8 might accumulate in FP16 (if hardware not fully supporting FP32 accumulate?), so by slicing the problem and doing cross-tile accumulation in FP32, they increase result precision.\n", | |
"\n", | |
"In any case, they took care to ensure numeric stability:\n", | |
"- Fine-grained scaling handles range and distribution.\n", | |
"- FP32 accumulation ensures sum of many small FP8 values doesn’t degrade too much or overflow.\n", | |
"\n", | |
"#### 3.4.4 Low-Precision Optimizer States and Communication \n", | |
"Under FP8 training, they also compress other aspects:\n", | |
"- **Optimizer States:** They store momentum, variance, etc. in lower precision. Likely at least BF16 if not even 8-bit for some. There are techniques for 8-bit optimizers (e.g., 8-bit Adam by Dettmers et al.), which quantize momentum and variance to 8-bit with a per-param scaling. Possibly they used something like that given they mention it and cite relevant works ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=In%20order%20to%20achieve%20efficient,2017%3B%20Peng%20et%C2%A0al)) (Kalamkar 2019, etc., maybe references on low precision training).\n", | |
" \n", | |
" But since they were already pushing state of art, maybe not 8-bit optimizer, more likely FP16 or BF16 for optimizer with some partitioning. But the text *“Low-Precision Optimizer States”* suggests at least 8-bit or 16-bit.\n", | |
"\n", | |
"- **Low-Precision Activation storage:** When doing activation checkpointing or if storing activations for backward, they might store them in FP8 instead of FP16. If they can recompute some things, maybe they can quantize those they store. Possibly they stash the activations (like output of each sub-layer) in FP8 in the checkpoint, then for backward, use them (maybe upcast to BF16 for actual gradient calc). If error from that quantization is acceptable, that could give further memory savings. They do mention *“Low-Precision Activation”* in content ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3.%203.3.3%20Low,Precision%20Communication)).\n", | |
"\n", | |
"- **Low-Precision Communication:** They likely communicated gradients or parameters in FP8 as well. Since they had FP8 data, sending those directly instead of converting to FP16 for all-reduce would halve bandwidth. If using NCCL, not sure it supports FP8 directly yet, but they could compress gradients to FP8 manually, send, then decompress. Given the huge scale, even all-reduce of some parts (embedding grads, etc.) could benefit. They list *“Low-Precision Communication”* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=3.%203.3.3%20Low,Precision%20Communication)) as well, hinting at gradient compression.\n", | |
"\n", | |
"So effectively, they tried to use FP8 wherever feasible: not just forward, but also in backward communications and storage, to maximize the benefit.\n", | |
"\n", | |
"#### 3.4.5 Numerical Stability and Convergence \n", | |
"In Appendix B.1, they compare FP8 vs BF16 training ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=8,Based%20and%20Aux)). It appears DeepSeek-V3 trained successfully in FP8 with no loss spikes and final quality on par with baseline. This is groundbreaking: previous large models often stuck to FP16 or BF16 due to stability. They credit their strategies for achieving this ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=FP8%20mixed%20precision%20training%20framework,we%20design%20the%20DualPipe%20algorithm)).\n", | |
"\n", | |
"They cite recent works (Peng et al 2023b; Dettmers 2022; etc.) as inspiration ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=match%20at%20L765%20Inspired%20by,great%20promise%2C%20it%20is%20often)), but they are first to do it on “extremely large-scale” ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=FP8%20mixed%20precision%20training%20framework,we%20design%20the%20DualPipe%20algorithm)). The fine-grained quantization likely addressed instability that others saw (Appendix B.2 mentions instability when grouping activations incorrectly, which they fixed) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=channels%29,the%20same%20way%20as%20weights)).\n", | |
"\n", | |
"**Loss Convergence:** If they did an ablation, likely FP8 vs BF16 gave similar convergence. They might have had to adjust hyperparameters slightly (learning rate, etc) to account for quantization noise (some small smoothing perhaps). But given they didn’t mention needing to revert to higher precision at any point, it truly worked throughout training.\n", | |
"\n", | |
"**Edge cases:** Some operations might still be in higher precision: e.g., the final layer normalization before output, or the loss computation, might be in FP16/32 to ensure accurate gradient. But those are minor.\n", | |
"\n", | |
"One risk with low precision is *random seed reproducibility* might reduce (due to non-deterministic rounding differences), but not a big deal for them presumably.\n", | |
"\n", | |
"#### 3.4.6 Performance Gains from FP8 \n", | |
"Using FP8 gave two direct gains:\n", | |
"- **Speed:** On NVIDIA H100, FP8 Tensor Core matrix multiply can reach up to 1340 TFLOPs (for example) versus ~670 TFLOPs for FP16 (these numbers vary but it's roughly 2x). By using FP8 for most heavy ops (like attention projections and MLP GEMMs), the model trains faster. If previously an attention layer took X ms in FP16, in FP8 it could be ~0.5X (plus small overhead for scaling operations). They did note accelerated training ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=FP8%20mixed%20precision%20training%20framework,we%20design%20the%20DualPipe%20algorithm)). So maybe they got e.g. 1.3-1.5x overall speedup from compute side. When combined with communication optimization, overall step time lowered significantly.\n", | |
"\n", | |
"- **Memory:** Halving precision halves memory traffic and usage. This means *less pressure on memory bandwidth*, which can improve speed too if memory-bound. For large models, often *memory bandwidth is a limiter for training speed* (especially for big matrix multiplies, reading weights and writing activations). FP8 halves the bytes, effectively doubling memory bandwidth utilization (in terms of how many FP8 values can be moved per second, double vs FP16). So layers became less memory-bound. The result is more balanced use of the GPU’s capabilities.\n", | |
"\n", | |
"- **Batch size:** With less memory per activation, they possibly could increase micro-batch size or sequence length, which again helps keep GPUs busy and improve throughput.\n", | |
"\n", | |
"In short, FP8 training gave a *multiplicative boost* on top of MoE’s algorithmic gain. The cost was extra complexity in implementing scaling. But evidently worth it: they trained a huge model with FP8 and got SOTA open-source results ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,world%20applications)).\n", | |
"\n", | |
"**Ablation Reference:** They likely measured that FP8 model reached similar validation perplexity or accuracy as a hypothetical BF16 model (if they trained a smaller variant in BF16 as baseline). The success here paves the way for others to use FP8 in future large models, cutting costs.\n", | |
"\n", | |
"To put it in perspective: if training with BF16 would have taken 2x more memory and time, they might not have been able to afford or fit this model on the cluster they had. FP8 was instrumental in **cost reduction**.\n", | |
"\n", | |
"---\n", | |
"\n", | |
"## 4. Efficiency Impact and Economic Scaling Analysis \n", | |
"\n", | |
"Having described all key optimizations, we now analyze how they come together to make DeepSeek-V3 *economically feasible* and what the broader implications are. The combination of MoE sparsity, advanced parallelism, and low precision yields a model that delivers high performance at a fraction of the cost of a dense model of similar capability.\n", | |
"\n", | |
"### 4.1 Cost Efficiency Achieved \n", | |
"DeepSeek-V3’s full training (14.8T tokens, plus fine-tuning stages) consumed **2.788 million GPU hours on H800** (80GB) GPUs ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=in%20DeepSeek,Throughout%20the)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,576)). In more common terms, that’s equivalent to about 320 GPU-years (if one GPU ran 24/7). If rented on cloud at ~$2 per hour per GPU, that's ~$5.6M cost ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,576)). This is a large sum, but *remarkably low given the model’s scale*. By comparison, a dense 175B model (like GPT-3) reportedly took 3640 GPU-years on V100 (which if translated to H100 hours might be on similar order). DeepSeek-V3 is ~4x the activated parameters of GPT-3, yet was trained in a couple of months.\n", | |
"\n", | |
"**Breakdown of Savings:** \n", | |
"- **Mixture-of-Experts:** The MoE architecture is the biggest algorithmic cost-saver. If DeepSeek-V3 were dense with 671B parameters active per token, training would require roughly 18x more compute per token (since only 37B are active in MoE) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=We%20present%20DeepSeek,Comprehensive%20evaluations)). Roughly, an equivalent dense model might need on the order of $2.788M \\times 18 \\approx 50$ million GPU hours to train the same number of tokens – clearly infeasible. MoE allowed them to spend compute proportional to a 37B model for each token, yet achieve the quality benefits of a 671B model (in terms of parameter count for capacity) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,world%20applications)). *This is an enormous efficiency gain.* In essence, MoE gave nearly an order-of-magnitude reduction in required FLOPs for a given model capacity. The only overhead vs a dense 37B model is the additional communication and the gating overhead, which they minimized with DualPipe and other methods. So MoE’s potential was fully realized: large capacity without proportional cost.\n", | |
"\n", | |
"- **FP8 Precision:** Using FP8 further likely halved the compute and memory costs compared to using FP16. If we assume they’d otherwise use BF16/FP16, the FP8 framework gave a ~2x speedup. So the 2.788M hours might have been ~5.5M hours if not for FP8. Or put differently, they effectively saved a few million dollars by adopting FP8 early. Moreover, FP8 reduced memory usage so they could train with fewer nodes or higher batch sizes, indirectly improving efficiency.\n", | |
"\n", | |
"- **DualPipe & Overlap:** Without DualPipe, pipeline parallelism could have introduced maybe 10-20% overhead (bubbles) and without overlap, communication could add another 10-20%. DualPipe and comm overlap recaptured that time. Achieving near 100% utilization of 2048 GPUs means almost no cycles wasted. If each GPU is, say, 90% utilized vs 100%, that alone is a 11% time difference (which would be hundreds of thousands of GPU hours). So the scheduling optimizations easily saved on the order of 300k+ GPU hours (worth hundreds of thousands of dollars) and made training faster (which also reduces chances of something going wrong over time).\n", | |
"\n", | |
"- **Memory optimizations:** By keeping memory usage low, they enabled using 2048 of 80GB GPUs. If memory was a problem, they might have needed a model parallel approach with more GPUs or larger memory GPUs (like using 40GB A100s might have required 2x count, etc.). They sidestepped needing more hardware. Also, stable training without interruptions (no loss spikes requiring restart) saved time – some large trainings lose days recovering from instabilities.\n", | |
"\n", | |
"- **Training time:** 2.788M GPU hours on 2048 GPUs equals ~56.7 days, roughly 8 weeks. This relatively short time-to-train is critical because prolonged runs incur more risk of hardware failures, spot instance interruptions, etc. Finishing in 2 months is itself a cost saving vs if it dragged to 4-6 months (with more babysitting and possibly more rental cost if spot prices change).\n", | |
"\n", | |
"In summary, DeepSeek-V3 delivered **top-tier model performance** at perhaps an order of magnitude less training cost than a comparable dense model. Each component – MoE, FP8, overlapping – contributed multiplicatively. For example, MoE gave ~18x efficiency, FP8 ~2x, overlapping let's say ~1.1-1.2x, combined ~40x or more effective efficiency gain. Even if these numbers are rough, it’s safe to say without these innovations, a model reaching DeepSeek-V3’s capability might cost tens of millions of dollars, versus the single-digit millions it actually cost.\n", | |
"\n", | |
"### 4.2 Performance vs Scale: Reaching 671B Parameters Affordably \n", | |
"The optimizations not only saved cost, but enabled *scaling the model to a size that would otherwise be unattainable*. For dense models, the scaling law often shows diminishing returns and skyrocketing costs. DeepSeek-V3’s approach allowed scaling parameters *mostly in cheap ways* (by adding more experts which cost only more memory and some network, not proportional flops). Thus:\n", | |
"- They achieved SOTA-level performance (comparable to models like GPT-4 in some areas as claimed ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Learning%20stages%20to%20fully%20harness,model%20checkpoints%20are%20available%20at)) ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=performance%20comparable%20to%20leading%20closed,model%20checkpoints%20are%20available%20at))) while being open-source.\n", | |
"- The **throughput per GPU** was kept high, meaning adding more GPUs gave linear speed-up rather than hitting network limits. This is evidenced by their claim of overcoming bottlenecks ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)).\n", | |
"- The training stability meant no wasted resources on failed runs or heavy hyperparameter tuning beyond initial phases.\n", | |
"\n", | |
"**Economic Scaling Law:** If we define a metric like “petaFLOPs per day achieved per GPU” or something, DeepSeek-V3 likely set records. The combination of high sparsity and high hardware utilization means each GPU delivered near its peak useful work. In essence, they maximized the *economic utility* of each GPU hour.\n", | |
"\n", | |
"**Ablation insights:** The technical report likely included ablations showing what happens if any piece is removed:\n", | |
"- Without aux-loss-free gating, maybe quality would drop or training might need more tokens to converge (thus more cost).\n", | |
"- Without FP8, would need more GPUs or longer run.\n", | |
"- Without DualPipe, scaling to many nodes might plateau or require more microbatches (which can increase memory or reduce convergence if batch is too small per device).\n", | |
"So each part was important for hitting the efficiency target.\n", | |
"\n", | |
"### 4.3 Implications for Future Large-Scale Training \n", | |
"DeepSeek-V3’s efficiency strategies serve as a template for future projects:\n", | |
"- **Sparsity is powerful:** Mixture-of-Experts is validated as a viable path to go beyond dense model limits. The key is managing the complexity (load balancing and comm) – which they showed is solvable. Public frameworks (like DeepSpeed-MoE, tutel, etc.) will incorporate these ideas, meaning others can train MoE models cheaply.\n", | |
"- **Hardware co-design:** The project demonstrates that understanding hardware (GPUs, interconnects) and co-designing algorithms (like DualPipe) is essential at the frontier of scale. Out-of-the-box solutions might leave significant performance on the table. Groups aiming at trillion+ parameter models must invest in such co-design.\n", | |
"- **Low precision frontier:** They also pushed the envelope on numerical precision. This paves the way for FP8 (and maybe even lower like 4-bit in future) training, significantly lowering cost without loss in quality, provided appropriate techniques (fine-grained quant, etc.) are used. It’s likely future hardware (the next NVIDIA architectures) will natively support group-wise quantization because of evidence like this ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S3.SS3.SSS1#:~:text=match%20at%20L764%20Notably%2C%20our,NVIDIA)).\n", | |
"- **Energy efficiency:** By reducing flop usage and using lower precision, DeepSeek-V3 is more energy-efficient than an equivalent dense FP16 model. This has environmental benefits as well for training such large models.\n", | |
"\n", | |
"### 4.4 Performance Achieved vs Baselines \n", | |
"Finally, in terms of results: DeepSeek-V3 outperforms other open-source models and is competitive with the best closed models ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Learning%20stages%20to%20fully%20harness,model%20checkpoints%20are%20available%20at)), validating that efficiency did not come at the cost of effectiveness. For instance, it excels in coding, math, and reasoning tasks, likely due to the large parameter count and MoE specialization ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=Experts%20,world%20applications)) ([DeepSeek-V3 Explained: Optimizing Efficiency and Scale](https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/#:~:text=DeepSeek,world%20applications)). The fact that it required *no irrecoverable loss spikes or training resets* ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Learning%20stages%20to%20fully%20harness,model%20checkpoints%20are%20available%20at)) is a testament to the robustness of their training recipe, which itself is an efficiency win (no downtime or wasted work due to instabilities).\n", | |
"\n", | |
"The model’s release will allow researchers to further examine these techniques in practice, accelerating adoption. Economically, it lowers the barrier for others (with enough GPU access) to train comparable models.\n", | |
"\n", | |
"### 4.5 Conclusions and Lessons \n", | |
"DeepSeek-V3’s development illustrates a few overarching lessons for efficient large-scale ML:\n", | |
"- **Think Sparse, Not Just Dense:** Leverage the insight that not all parameters are needed for all data. Activate subsets intelligently (MoE), and invest in making that work (balance and route optimally).\n", | |
"- **Every Byte and FLOP Counts:** At this scale, one must optimize everything – precision, memory usage, parallel execution, communication. Small inefficiencies multiply massively. By addressing them, you convert what would be waste into useful work.\n", | |
"- **Holistic Optimization:** There’s no single silver bullet; it’s the *combined effect* of multiple optimizations that yields a breakthrough. DeepSeek-V3 combined algorithmic innovation (MoE, MTP), numerical innovation (FP8), and systems innovation (DualPipe, custom kernels). Dropping any of these might have made the training impractical or the model less competitive.\n", | |
"- **Hardware-Software Co-evolution:** They gave feedback to hardware designers (Section 3.5 of report) on what would help further: e.g., better network hardware, support for certain data types ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=1,23)). This dialogue between model researchers and hardware teams will shape future AI accelerators that can handle even larger models more efficiently.\n", | |
"\n", | |
"In conclusion, DeepSeek-V3 stands as a **definitive example of efficient large-scale training** – achieving unprecedented scale and performance on a reasonable budget through careful optimization at every level. The strategies detailed in this master class provide a road map for anyone aiming to push the boundaries of AI model size *without pushing the boundaries of available resources*. Each section, from CUDA/PTX tricks to MoE architecture, contributed critically to making DeepSeek-V3 possible. By studying and applying these techniques, the next generation of models (perhaps multi-trillion parameter ones) can be trained in an economically feasible and technically sustainable manner ([[2412.19437] DeepSeek-V3 Technical Report](https://ar5iv.org/html/2412.19437v1#S2.SS1.SSS2#:~:text=Through%20the%20co,to%20further%20scale%20up%20the)) ([The DeepSeek Series: A Technical Overview](https://martinfowler.com/articles/deepseek-papers.html#:~:text=compression%2C%20mixture,parameter)).\n", | |
"\n" | |
] | |
} | |
], | |
"metadata": {}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment