0% found this document useful (0 votes)
33 views11 pages

The DeepSeek Series A Technical Overview

The article provides an overview of the DeepSeek Large-Language Models' research papers, focusing on advancements in cost and memory efficiency, HPC co-design for training large models, and emergent reasoning through reinforcement learning. Key papers include DeepSeek-LLM, DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1, each contributing to the understanding of model scaling, training stability, and reasoning capabilities. The findings indicate that efficient architecture and infrastructure can significantly enhance the performance of large-scale language models.

Uploaded by

Skpatt Tassou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views11 pages

The DeepSeek Series A Technical Overview

The article provides an overview of the DeepSeek Large-Language Models' research papers, focusing on advancements in cost and memory efficiency, HPC co-design for training large models, and emergent reasoning through reinforcement learning. Key papers include DeepSeek-LLM, DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1, each contributing to the understanding of model scaling, training stability, and reasoning capabilities. The findings indicate that efficient architecture and infrastructure can significantly enhance the performance of large-scale language models.

Uploaded by

Skpatt Tassou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

martinfowler.com /articles/deepseek-papers.

html

The DeepSeek Series: A Technical Overview

The appearance of DeepSeek Large-Language Models has caused a lot of discussion and
angst since their latest versions appeared at the beginning of 2025. But much of the value of
DeepSeek's work comes from the papers they have published over the last year. This article
provides an overview of these papers, highlighting three main arcs in this research: a focus
on improving cost and memory efficiency, the use of HPC Co-Design to train large models on
limited hardware, and the development of emergent reasoning from large-scale
reinforcement learning

06 February 2025

Shayan Mohanty

Shayan Mohanty is the Head of AI Research at Thoughtworks, where his group focuses on
foundational research to bridge the gap between AI development and production. Previously,
he was CEO and Co-Founder of Watchful, a startup that built software to automate the
process of data labeling for AI. Shayan has a decade leading data engineering teams at
various companies including Facebook, where he led the stream processing team
responsible for processing 100% of the ads metrics data for all FB products. He is also a
Guest Scientist at Los Alamos National Laboratory and has given talks on topics ranging
from Automata Theory to Machine Teaching.

Contents

This article provides a cohesive overview of four technical reports from DeepSeek:

1/11
1. DeepSeek-LLM (Jan '24): an early investigation of scaling laws and data-model
tradeoffs.
2. DeepSeek-V2 (Jun '24): introducing Multi-Head Latent Attention (MLA) and
DeepSeekMoE to improve memory and training efficiency.
3. DeepSeek-V3 (Dec '24): scaling sparse MoE networks to 671B parameters, with FP8
mixed precision training and intricate HPC co-design
4. DeepSeek-R1 (Jan '25): building upon the efficiency foundations of the previous papers
and using large-scale reinforcement learning to incentivize emergent chain-of-thought
capabilities, including a “zero-SFT” variant.

For additional context on DeepSeek itself and the market backdrop that has caused claims
made by the DeepSeek team to be taken out of context and spread widely, please take a
look at my colleague Prasanna Pendse's post: Demystifying Deepseek. For the purposes of
this article, we'll be focusing analysis and commentary on the technical work itself, its merits,
and what it may signal for the future.

Much of this article assumes significant knowledge of the terminology and concepts of
building LLMs, more so than is typical for articles on this site. In future weeks we hope to
expand this article to provide explanations of these concepts to make this article easier to
follow for those not familiar with this world. We shall post any such updates on this site's
usual channels.

All four papers revolve around a singular challenge: building ever-larger language models
with minimal cost, memory overhead, and training instability. In each iteration, the authors
refine both architecture and infrastructure - a strategy often referred to as HPC co-design.

Key arcs in this series include:

Cost and Memory Efficiency: Methods like Multi-Head Latent Attention (MLA)
compression, mixture-of-experts (MoE), and FP8-based optimizations all aim to make
massive-scale training and inference feasible.
Sparsity + HPC Co-Design: From V2 to V3, we see mixture-of-experts architecture
evolve alongside specialized HPC scheduling—allowing 671B-parameter models to be
trained on H800 clusters without blowing up the budget.
Emergent Reasoning: In R1, large-scale Reinforcement Learning (RL) unlocks
advanced chain-of-thought capabilities, culminating in “R1-Zero” and its purely RL-
driven approach to reasoning tasks.

DeepSeek-LLM: Laying the Foundation


Motivation & Overview

2/11
The authors set out to answer an important question: Given a fixed compute budget for pre-
training, how do we choose the scale of the model and how much training data to use? Prior
studies (e.g. Chinchilla vs. GPT-3) differed on the ratio between these two factors.
DeepSeek-LLM addresses that by measuring scale in a different way. Earlier work measured
scale in terms of how many parameters were in the model, DeepSeek-LLM instead
measured scale as non-embedding FLOPs/token1 They then found they could predict
computation with:

1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second)
used for pre-training certain layers of the transformer (non-embedding). The authors found
only some layers contributed to the scaling formula.

C=M×D

where C is the compute budget, M is non-embedding FLOPs/token, and D is data size.

This more granular representation helps them predict how a 7B or 67B model might train on
2T tokens of bilingual data.

Training Instability

A central concern they grapple with is training instability (sudden irrecoverable divergences
in the training process), which can often manifest in large-scale language models—
especially those with mixture-of-experts or very long contexts.

By carefully tuning learning rates, batch sizes, and other hyperparameters 2, DeepSeek-LLM
demonstrates that stable large-scale training is achievable, but it requires meticulous design
of the architecture of the transformer model together with the infrastructure of the High
Performance Computing (HPC) data center used to train it. This interwoven design of both
architecture and infrastructure is called HPC Co-Design.

Data Quality & Model Scale

A point the authors make is about how data quality shifts the optimal ratio—i.e., higher-
quality data can justify a bigger model for the same number of tokens. You can intuit this by
imagining two scenarios:

Scenario A: You have a 100-billion-token corpus full of duplicates, spammy text, or


incomplete sentences. The model might not glean much new knowledge because the
data is partly redundant or low-value.
Scenario B: You have a carefully curated 100-billion-token corpus with broad coverage
of code, math, multi-lingual dialogues, factual text, etc. Each token is more “information-

3/11
rich,” so the model can “afford” to use more parameters without hitting diminishing
returns prematurely.

In other words, when data is denser in useful information, scaling the model further pays off
because each parameter can learn from richer signals.

Key Takeaways

Hyperparameter Scaling: They propose simple power-law fits to pick batch size and
learning rate as compute C grows.
Bilingual Data: They train two base sizes (7B, 67B) on 2T tokens covering
English/Chinese, then do Supervised Fine Tuning (SFT) and a simpler preference-
based alignment called Direct Preference Optimization (DPO).
Results: The resulting DeepSeek-LLM67B “Outperforms LLaMA-2 70B” on math/coding
tasks, illustrating how HPC co-designed approaches can keep training stable while
efficiently pushing scale.

The seeds planted here - scaling laws and infrastructure for extremely large training - will
reappear in subsequent works.

DeepSeek-V3: HPC Co-Design


Scaling MoE to 671B While Preserving Efficiency

Building on V2, DeepSeek-V3 further extends sparse models to 671B parameters (37B
activated), training on 14.8T tokens in under 2.8M H800 GPU hours. The authors credit
extensive HPC co-design:

Lastly, we emphasize again the economical training costs of DeepSeek-V3,


summarized in Table 1, achieved through our optimized co-design of algorithms,
frameworks, and hardware.

-- DeepSeek-V3 Tech. Report, p.5

The major novelties are:

1. Refined MLA
2. Refined DeepSeekMoE
3. Co-Designed Training & Inference Frameworks

Refined MLA

4/11
Multi-Head Latent Attention was introduced in V2 to reduce KV cache overhead. In V3, it is
further refined with several new features:

Dynamic Low-Rank Projection: Instead of a static compression dimension, MLA


adjusts how strongly it compresses Key/Value vectors depending on sequence length.
For shorter sequences, less compression preserves fidelity; for extremely long
sequences (32K–128K tokens), deeper compression manages memory growth.
Adaptive Query Compression: Where V2 used a fixed dc dimension, V3 employs
an adaptive scaling of the query up/down at different layer depths. Early layers use
higher-dimensional queries for expressiveness; deeper layers more aggressively
compress to save activation memory.
Improved RoPE Handling: V2 only partially decoupled keys, but V3 extends the
concept for more stable 128K context. They track a “decoupled shared key” that
reduces numerical drift in extremely long generations.
Joint KV Storage: V2 stored compressed keys and values separately. V3 merges them
into a shared compressed representation to further reduce memory traffic during multi-
node inference.
Layer-Wise Adaptive Cache: Instead of caching all past tokens for all layers, V3
prunes older KV entries at deeper layers. This helps keep memory usage in check
when dealing with 128K context windows.

Together, these MLA refinements ensure that while DeepSeek-V3 can attend across very
long sequences, the memory overhead remains manageable.

Refined DeepSeekMoE: Auxiliary-Loss-Free, Higher Capacity

On the MoE side, DeepSeek-V3 drops the auxiliary-loss approach from V2. Instead of an
explicit penalty term, each expert acquires a dynamic bias bi. If an expert is overloaded at a
step, bi decreases; if underloaded, bi increases. The gating decision then adds bi to the
token's affinity:

si,t′=si,t+bi

Key Improvements:

No Token Dropping: V2 occasionally dropped tokens if certain experts got overloaded,


but the new bias-based method keeps everything.
More Activated Experts: They raise the number of routed experts from 6 to 8 per
token, improving representational power.

5/11
Higher Stability: By removing auxiliary losses, they avoid potential interference with
the main training objective, focusing purely on the intrinsico gating signals plus bias
adjustments.

Hence, the final feed-forward module is a combination of a small set of shared experts plus
up to 8 specialized experts chosen adaptively.

Co-Designed Frameworks: FP8, DualPipe, and PTX Optimizations

Scaling an MoE model to 671B demanded HPC-level solutions for training and inference.
The authors emphasize:

Through the co-design of algorithms, frameworks, and hardware, we overcome


the communication bottleneck in cross-node MoE training, achieving near-full
computation- communication overlap.

-- DeepSeek-V3 Tech. Report, p.5

FP8 Mixed Precision

They adopt an FP8 data format for General Matrix Multiplications (GEMMs), halving memory.
The risk is reduced numeric range so they offset it with:

Block-wise scaling (e.g., 1x128 or 128x128 tiles).


Periodic “promotion” to FP32 after short accumulation intervals to avoid
overflow/underflow.

DualPipe Parallelism

They propose DualPipe to overlap forward/backward computation with the MoE all-to-all
dispatch. It rearranges pipeline stages to ensure that network communication (particularly
across InfiniBand) is hidden behind local matrix multiplications.

PTX-Level & Warp Specialization

To fully exploit InifiniBand(IB) and NVLink:

They tune warp-level instructions in PTX (a level lower than CUDA), auto-tuning the
chunk size for all-to-all dispatch.
Dynamically partition Streaming Microcontrollers into communication vs. compute tasks
so that token dispatch never stalls local GEMM.

As a result, training costs were cut to 2.8M H800 GPU hours per run - low for a 14.8T token
corpus.

6/11
Outcomes

The resulting DeepSeek-V3 excels at code, math, and some multilingual tasks,
outperforming other open-source LLMs of similar scale. Deep HPC co-design (FP8,
DualPipe, PTX-level optimization) plus refined MLA/MoE implementation achieve extreme
scale with stable training.

DeepSeek-R1: Reinforcement Learning for Deeper


Reasoning
It's worth noting that both DeepSeek R1 and DeepSeek R1-Zero are architecturally identical
to DeepSeek V3 (but uses the “only-pretrained” base version). The only difference in these
models is how post-training is handled.

Emergent Reasoning Behaviors Through RL-Only

All prior DeepSeek releases used SFT (plus occasional RL). By contrast, DeepSeek-R1-Zero
tries an extreme: no supervised warmup, just RL from the base model. They adopt Group
Relative Policy Optimization (GRPO), which:

1. Samples a group of old-policy outputs o1,...,oG


2. Scores each with a reward (in this case, rule-based)
3. Normalizes the advantage Ai by group mean/stdev
4. Optimizes a clipped PPO-like objective

The reward function for the R1 models is rule-based - a simple weighted sum between 2
components

Accuracy Reward - if the task has an objective correct answer (e.g. a math problem,
coding task, etc.), correctness is verified using mathematical equation solvers for step-
by-step proof checking, and code execution & test cases for code correctness
verification
Format Reward - the model is rewarded for following a structured reasoning process
using explicit reasoning markers <think></think> and <answer></answer>

The relative advantage Ai for a given output is calculated as:

Ai=ri−mean({r1,r2,...,rG})std({r1,r2,...,rG})

7/11
where ri is the reward calculated for the given output. The model's policy is updated to
favor responses with higher rewards while constraining changes using a clipping function
which ensures that the new policy remains close to the old.

In so many words: the authors created a testing/verification harness around the model which
they exercised using reinforcement learning, and gently guided the model using simple
Accuracy and Format rewards. In doing so, emergent reasoning behaviors were observed:

Self-verification where the model double-checks its own answers


Extended chain-of-thought where the model learns to explain its reasoning more
thoroughly
Exploratory reasoning - the model tries different approaches before converging on an
answer
Reflection - the model starts questioning its own solutions and adjusting reasoning
paths dynamically

R1-Zero is probably the most interesting outcome of the R1 paper for researchers because it
learned complex chain-of-thought patterns from raw reward signals alone. However, the
model exhibited notable issues:

Readability Problems: Because it never saw any human-curated language style, its
outputs were sometimes jumbled or mix multiple languages.
Instability in Non-Reasoning Tasks: Lacking SFT data for general conversation, R1-
Zero would produce valid solutions for math or code but be awkward on simpler Q&A or
safety prompts.
Limited Domain: Rule-based rewards worked well for verifiable tasks (math/coding),
but handling creative/writing tasks demanded broader coverage.

Hence, the authors concluded that while “pure RL” yields strong reasoning in verifiable tasks,
the model’s overall user-friendliness was lacking. This led them to DeepSeek-R1: an
alignment pipeline combining small cold-start data, RL, rejection sampling, and more RL, to
“fill in the gaps” from R1-Zero’s deficits.

Refined Reasoning Through SFT + RL

DeepSeek-R1 addresses R1-Zero's limitations by injecting a small amount of supervised


data before RL and weaving in additional alignment steps.

Stage 1: “Cold-Start” SFT

They gather a small number (~thousands) of curated, “human-friendly” chain-of-thought data


covering common sense Q&A, basic math, standard instruction tasks, etc. Then, they do a

8/11
short SFT pass on the base model. This ensures the model acquires:

Better readability: Polished language style and formatting.


Non-reasoning coverage: Some conversation, factual QA, or creative tasks not easily
rewarded purely by rule-based checks.

In essence, the authors realized you can avoid the “brittleness” of a zero-SFT approach by
giving the model a seed of user-friendly behaviors.

Stage 2: Reasoning-Oriented RL

Next, as in R1-Zero, they apply large-scale RL for tasks like math and code. The difference is
that now the model starts from a “cold-start SFT” checkpoint—so it retains decent language
style while still learning verifiable tasks from a rule-based or tool-based reward. This RL
stage fosters the same emergent chain-of-thought expansions but without the random
“language mixing” or bizarre structure.

Stage 3: Rejection Sampling + Additional SFT

Once that RL converges, they generate multiple completions per prompt from the RL
checkpoint. Using a combination of automatic verifiers and some human checks, they pick
the best outputs (“rejection sampling”) and build a new SFT dataset. They also incorporate
standard writing/factual/safety data from DeepSeek-V3 to keep the model balanced in non-
verifiable tasks. Finally, they re-fine-tune the base model on this curated set.

This step addresses the “spotty coverage” problem even further: The best RL answers
become training targets, so the model improves at chain-of-thought and clarity.

Stage 4: RL for “All Scenarios”

Lastly, they do another RL pass on diverse prompts—not just math/code but general
helpfulness, safety, or role-playing tasks. Rewards may come from a combination of rule-
based checks and large “preference” models (trained from user preference pairs). The final
result is a model that:

Retains strong chain-of-thought for verifiable tasks,


Aligns to broad user requests in everyday usage,
Maintains safer, more controlled outputs.

Connecting the Arcs: Efficiency & Emergence

9/11
Despite covering different angles - scaling laws, MoE, HPC scheduling, and large-scale RL -
DeepSeek's work consistently follows these arcs:

1. Cost and Memory Efficiency


They systematically design methods (MLA, MoE gating, device-limited routing,
FP8 training, DualPipe) to maximize hardware utilization even in constrained
environments
HPC-level scheduling (PTX instructions, warp specialization) hides communication
overhead and overcomes the limitations imposed by limited interconnect speeds
on H800s
2. Sparse + HPC Co-Design
From V2 to V3, we see an evolving mixture-of-experts approach, culminating in a
671B-parameter model feasible on H800 clusters.
The authors repeatedly stress that HPC co-design is the only path to cheaply train
multi-hundred-billion-parameter LLMs.
3. Emergent Reasoning
R1 pushes beyond standard supervised training, letting RL signals shape deep
chain-of-thought. The synergy between pre-trained scale and targeted post-
training yields advanced reasoning patterns like reflection or multi-step
verification.

Taken as a whole, the DeepSeek series highlights how architecture, algorithms, frameworks,
and hardware must be co-designed to handle LLM training at trillion-token scales. Looking to
the future, it indicates that toolchain builders may want to find ways to capture some of these
HPC optimizations as part of the model compilation path or training apparatus, and AI
research teams may want to work closely with HPC expertise even in the early days of
architecture ideation.

Footnotes
1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second)
used for pre-training certain layers of the transformer (non-embedding). The authors found
only some layers contributed to the scaling formula.

2: A model consists of billions on internal variables, which are called its parameters. These
parameters gain their values (weights) during training. Before training, developers will set a
number of different variables that control the training process itself, these are called
hyperparameters

10/11
11/11

You might also like