Skip to content

Instantly share code, notes, and snippets.

View anchormath's full-sized avatar

Ankur Mathur anchormath

  • San Francisco, CA
View GitHub Profile
@anchormath
anchormath / deepseek.md
Last active February 27, 2025 03:52
DeepSeek-v3

Trying to understand the technical breakthroughs in Deep Seek, particularly the pre-training efficiency.

Perplexity thread highlights:

The pre-training efficiency of DeepSeek-V3 is attributed to several key innovations:

  1. FP8 Mixed Precision Framework: This reduces GPU memory usage and accelerates computation during training[5].
  2. DualPipe Algorithm: It optimizes pipeline parallelism by overlapping computation and communication, minimizing idle time and scaling efficiently across nodes[5].
  3. Multi-Token Prediction (MTP): This densifies training signals, improving data efficiency and model performance[3].
  4. Efficient Mixture-of-Experts (MoE) Architecture: Only a subset of parameters is activated per token, reducing computational overhead while maintaining performance[2][3].