The DeepSeek Series A Technical Overview
The DeepSeek Series A Technical Overview
html
The appearance of DeepSeek Large-Language Models has caused a lot of discussion and
angst since their latest versions appeared at the beginning of 2025. But much of the value of
DeepSeek's work comes from the papers they have published over the last year. This article
provides an overview of these papers, highlighting three main arcs in this research: a focus
on improving cost and memory efficiency, the use of HPC Co-Design to train large models on
limited hardware, and the development of emergent reasoning from large-scale
reinforcement learning
06 February 2025
Shayan Mohanty
Shayan Mohanty is the Head of AI Research at Thoughtworks, where his group focuses on
foundational research to bridge the gap between AI development and production. Previously,
he was CEO and Co-Founder of Watchful, a startup that built software to automate the
process of data labeling for AI. Shayan has a decade leading data engineering teams at
various companies including Facebook, where he led the stream processing team
responsible for processing 100% of the ads metrics data for all FB products. He is also a
Guest Scientist at Los Alamos National Laboratory and has given talks on topics ranging
from Automata Theory to Machine Teaching.
Contents
This article provides a cohesive overview of four technical reports from DeepSeek:
1/11
1. DeepSeek-LLM (Jan '24): an early investigation of scaling laws and data-model
tradeoffs.
2. DeepSeek-V2 (Jun '24): introducing Multi-Head Latent Attention (MLA) and
DeepSeekMoE to improve memory and training efficiency.
3. DeepSeek-V3 (Dec '24): scaling sparse MoE networks to 671B parameters, with FP8
mixed precision training and intricate HPC co-design
4. DeepSeek-R1 (Jan '25): building upon the efficiency foundations of the previous papers
and using large-scale reinforcement learning to incentivize emergent chain-of-thought
capabilities, including a “zero-SFT” variant.
For additional context on DeepSeek itself and the market backdrop that has caused claims
made by the DeepSeek team to be taken out of context and spread widely, please take a
look at my colleague Prasanna Pendse's post: Demystifying Deepseek. For the purposes of
this article, we'll be focusing analysis and commentary on the technical work itself, its merits,
and what it may signal for the future.
Much of this article assumes significant knowledge of the terminology and concepts of
building LLMs, more so than is typical for articles on this site. In future weeks we hope to
expand this article to provide explanations of these concepts to make this article easier to
follow for those not familiar with this world. We shall post any such updates on this site's
usual channels.
All four papers revolve around a singular challenge: building ever-larger language models
with minimal cost, memory overhead, and training instability. In each iteration, the authors
refine both architecture and infrastructure - a strategy often referred to as HPC co-design.
Cost and Memory Efficiency: Methods like Multi-Head Latent Attention (MLA)
compression, mixture-of-experts (MoE), and FP8-based optimizations all aim to make
massive-scale training and inference feasible.
Sparsity + HPC Co-Design: From V2 to V3, we see mixture-of-experts architecture
evolve alongside specialized HPC scheduling—allowing 671B-parameter models to be
trained on H800 clusters without blowing up the budget.
Emergent Reasoning: In R1, large-scale Reinforcement Learning (RL) unlocks
advanced chain-of-thought capabilities, culminating in “R1-Zero” and its purely RL-
driven approach to reasoning tasks.
2/11
The authors set out to answer an important question: Given a fixed compute budget for pre-
training, how do we choose the scale of the model and how much training data to use? Prior
studies (e.g. Chinchilla vs. GPT-3) differed on the ratio between these two factors.
DeepSeek-LLM addresses that by measuring scale in a different way. Earlier work measured
scale in terms of how many parameters were in the model, DeepSeek-LLM instead
measured scale as non-embedding FLOPs/token1 They then found they could predict
computation with:
1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second)
used for pre-training certain layers of the transformer (non-embedding). The authors found
only some layers contributed to the scaling formula.
C=M×D
This more granular representation helps them predict how a 7B or 67B model might train on
2T tokens of bilingual data.
Training Instability
A central concern they grapple with is training instability (sudden irrecoverable divergences
in the training process), which can often manifest in large-scale language models—
especially those with mixture-of-experts or very long contexts.
By carefully tuning learning rates, batch sizes, and other hyperparameters 2, DeepSeek-LLM
demonstrates that stable large-scale training is achievable, but it requires meticulous design
of the architecture of the transformer model together with the infrastructure of the High
Performance Computing (HPC) data center used to train it. This interwoven design of both
architecture and infrastructure is called HPC Co-Design.
A point the authors make is about how data quality shifts the optimal ratio—i.e., higher-
quality data can justify a bigger model for the same number of tokens. You can intuit this by
imagining two scenarios:
3/11
rich,” so the model can “afford” to use more parameters without hitting diminishing
returns prematurely.
In other words, when data is denser in useful information, scaling the model further pays off
because each parameter can learn from richer signals.
Key Takeaways
Hyperparameter Scaling: They propose simple power-law fits to pick batch size and
learning rate as compute C grows.
Bilingual Data: They train two base sizes (7B, 67B) on 2T tokens covering
English/Chinese, then do Supervised Fine Tuning (SFT) and a simpler preference-
based alignment called Direct Preference Optimization (DPO).
Results: The resulting DeepSeek-LLM67B “Outperforms LLaMA-2 70B” on math/coding
tasks, illustrating how HPC co-designed approaches can keep training stable while
efficiently pushing scale.
The seeds planted here - scaling laws and infrastructure for extremely large training - will
reappear in subsequent works.
Building on V2, DeepSeek-V3 further extends sparse models to 671B parameters (37B
activated), training on 14.8T tokens in under 2.8M H800 GPU hours. The authors credit
extensive HPC co-design:
1. Refined MLA
2. Refined DeepSeekMoE
3. Co-Designed Training & Inference Frameworks
Refined MLA
4/11
Multi-Head Latent Attention was introduced in V2 to reduce KV cache overhead. In V3, it is
further refined with several new features:
Together, these MLA refinements ensure that while DeepSeek-V3 can attend across very
long sequences, the memory overhead remains manageable.
On the MoE side, DeepSeek-V3 drops the auxiliary-loss approach from V2. Instead of an
explicit penalty term, each expert acquires a dynamic bias bi. If an expert is overloaded at a
step, bi decreases; if underloaded, bi increases. The gating decision then adds bi to the
token's affinity:
si,t′=si,t+bi
Key Improvements:
5/11
Higher Stability: By removing auxiliary losses, they avoid potential interference with
the main training objective, focusing purely on the intrinsico gating signals plus bias
adjustments.
Hence, the final feed-forward module is a combination of a small set of shared experts plus
up to 8 specialized experts chosen adaptively.
Scaling an MoE model to 671B demanded HPC-level solutions for training and inference.
The authors emphasize:
They adopt an FP8 data format for General Matrix Multiplications (GEMMs), halving memory.
The risk is reduced numeric range so they offset it with:
DualPipe Parallelism
They propose DualPipe to overlap forward/backward computation with the MoE all-to-all
dispatch. It rearranges pipeline stages to ensure that network communication (particularly
across InfiniBand) is hidden behind local matrix multiplications.
They tune warp-level instructions in PTX (a level lower than CUDA), auto-tuning the
chunk size for all-to-all dispatch.
Dynamically partition Streaming Microcontrollers into communication vs. compute tasks
so that token dispatch never stalls local GEMM.
As a result, training costs were cut to 2.8M H800 GPU hours per run - low for a 14.8T token
corpus.
6/11
Outcomes
The resulting DeepSeek-V3 excels at code, math, and some multilingual tasks,
outperforming other open-source LLMs of similar scale. Deep HPC co-design (FP8,
DualPipe, PTX-level optimization) plus refined MLA/MoE implementation achieve extreme
scale with stable training.
All prior DeepSeek releases used SFT (plus occasional RL). By contrast, DeepSeek-R1-Zero
tries an extreme: no supervised warmup, just RL from the base model. They adopt Group
Relative Policy Optimization (GRPO), which:
The reward function for the R1 models is rule-based - a simple weighted sum between 2
components
Accuracy Reward - if the task has an objective correct answer (e.g. a math problem,
coding task, etc.), correctness is verified using mathematical equation solvers for step-
by-step proof checking, and code execution & test cases for code correctness
verification
Format Reward - the model is rewarded for following a structured reasoning process
using explicit reasoning markers <think></think> and <answer></answer>
Ai=ri−mean({r1,r2,...,rG})std({r1,r2,...,rG})
7/11
where ri is the reward calculated for the given output. The model's policy is updated to
favor responses with higher rewards while constraining changes using a clipping function
which ensures that the new policy remains close to the old.
In so many words: the authors created a testing/verification harness around the model which
they exercised using reinforcement learning, and gently guided the model using simple
Accuracy and Format rewards. In doing so, emergent reasoning behaviors were observed:
R1-Zero is probably the most interesting outcome of the R1 paper for researchers because it
learned complex chain-of-thought patterns from raw reward signals alone. However, the
model exhibited notable issues:
Readability Problems: Because it never saw any human-curated language style, its
outputs were sometimes jumbled or mix multiple languages.
Instability in Non-Reasoning Tasks: Lacking SFT data for general conversation, R1-
Zero would produce valid solutions for math or code but be awkward on simpler Q&A or
safety prompts.
Limited Domain: Rule-based rewards worked well for verifiable tasks (math/coding),
but handling creative/writing tasks demanded broader coverage.
Hence, the authors concluded that while “pure RL” yields strong reasoning in verifiable tasks,
the model’s overall user-friendliness was lacking. This led them to DeepSeek-R1: an
alignment pipeline combining small cold-start data, RL, rejection sampling, and more RL, to
“fill in the gaps” from R1-Zero’s deficits.
8/11
short SFT pass on the base model. This ensures the model acquires:
In essence, the authors realized you can avoid the “brittleness” of a zero-SFT approach by
giving the model a seed of user-friendly behaviors.
Stage 2: Reasoning-Oriented RL
Next, as in R1-Zero, they apply large-scale RL for tasks like math and code. The difference is
that now the model starts from a “cold-start SFT” checkpoint—so it retains decent language
style while still learning verifiable tasks from a rule-based or tool-based reward. This RL
stage fosters the same emergent chain-of-thought expansions but without the random
“language mixing” or bizarre structure.
Once that RL converges, they generate multiple completions per prompt from the RL
checkpoint. Using a combination of automatic verifiers and some human checks, they pick
the best outputs (“rejection sampling”) and build a new SFT dataset. They also incorporate
standard writing/factual/safety data from DeepSeek-V3 to keep the model balanced in non-
verifiable tasks. Finally, they re-fine-tune the base model on this curated set.
This step addresses the “spotty coverage” problem even further: The best RL answers
become training targets, so the model improves at chain-of-thought and clarity.
Lastly, they do another RL pass on diverse prompts—not just math/code but general
helpfulness, safety, or role-playing tasks. Rewards may come from a combination of rule-
based checks and large “preference” models (trained from user preference pairs). The final
result is a model that:
9/11
Despite covering different angles - scaling laws, MoE, HPC scheduling, and large-scale RL -
DeepSeek's work consistently follows these arcs:
Taken as a whole, the DeepSeek series highlights how architecture, algorithms, frameworks,
and hardware must be co-designed to handle LLM training at trillion-token scales. Looking to
the future, it indicates that toolchain builders may want to find ways to capture some of these
HPC optimizations as part of the model compilation path or training apparatus, and AI
research teams may want to work closely with HPC expertise even in the early days of
architecture ideation.
Footnotes
1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second)
used for pre-training certain layers of the transformer (non-embedding). The authors found
only some layers contributed to the scaling formula.
2: A model consists of billions on internal variables, which are called its parameters. These
parameters gain their values (weights) during training. Before training, developers will set a
number of different variables that control the training process itself, these are called
hyperparameters
10/11
11/11