0% found this document useful (0 votes)

33 views11 pages

The DeepSeek Series A Technical Overview

The article provides an overview of the DeepSeek Large-Language Models' research papers, focusing on advancements in cost and memory efficiency, HPC co-design for training large models, and emergent reasoning through reinforcement learning. Key papers include DeepSeek-LLM, DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1, each contributing to the understanding of model scaling, training stability, and reasoning capabilities. The findings indicate that efficient architecture and infrastructure can significantly enhance the performance of large-scale language models.

Uploaded by

Skpatt Tassou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views11 pages

The DeepSeek Series A Technical Overview

Uploaded by

Skpatt Tassou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

martinfowler.com /articles/deepseek-papers.

html

The DeepSeek Series: A Technical Overview

The appearance of DeepSeek Large-Language Models has caused a lot of discussion and
angst since their latest versions appeared at the beginning of 2025. But much of the value of
DeepSeek's work comes from the papers they have published over the last year. This article
provides an overview of these papers, highlighting three main arcs in this research: a focus
on improving cost and memory efficiency, the use of HPC Co-Design to train large models on
limited hardware, and the development of emergent reasoning from large-scale
reinforcement learning

06 February 2025

Shayan Mohanty

Shayan Mohanty is the Head of AI Research at Thoughtworks, where his group focuses on
foundational research to bridge the gap between AI development and production. Previously,
he was CEO and Co-Founder of Watchful, a startup that built software to automate the
process of data labeling for AI. Shayan has a decade leading data engineering teams at
various companies including Facebook, where he led the stream processing team
responsible for processing 100% of the ads metrics data for all FB products. He is also a
Guest Scientist at Los Alamos National Laboratory and has given talks on topics ranging
from Automata Theory to Machine Teaching.

Contents

This article provides a cohesive overview of four technical reports from DeepSeek:

1/11
1. DeepSeek-LLM (Jan '24): an early investigation of scaling laws and data-model
tradeoffs.
2. DeepSeek-V2 (Jun '24): introducing Multi-Head Latent Attention (MLA) and
DeepSeekMoE to improve memory and training efficiency.
3. DeepSeek-V3 (Dec '24): scaling sparse MoE networks to 671B parameters, with FP8
mixed precision training and intricate HPC co-design
4. DeepSeek-R1 (Jan '25): building upon the efficiency foundations of the previous papers
and using large-scale reinforcement learning to incentivize emergent chain-of-thought
capabilities, including a “zero-SFT” variant.

For additional context on DeepSeek itself and the market backdrop that has caused claims
made by the DeepSeek team to be taken out of context and spread widely, please take a
look at my colleague Prasanna Pendse's post: Demystifying Deepseek. For the purposes of
this article, we'll be focusing analysis and commentary on the technical work itself, its merits,
and what it may signal for the future.

Much of this article assumes significant knowledge of the terminology and concepts of
building LLMs, more so than is typical for articles on this site. In future weeks we hope to
expand this article to provide explanations of these concepts to make this article easier to
follow for those not familiar with this world. We shall post any such updates on this site's
usual channels.

All four papers revolve around a singular challenge: building ever-larger language models
with minimal cost, memory overhead, and training instability. In each iteration, the authors
refine both architecture and infrastructure - a strategy often referred to as HPC co-design.

Key arcs in this series include:

Cost and Memory Efficiency: Methods like Multi-Head Latent Attention (MLA)
compression, mixture-of-experts (MoE), and FP8-based optimizations all aim to make
massive-scale training and inference feasible.
Sparsity + HPC Co-Design: From V2 to V3, we see mixture-of-experts architecture
evolve alongside specialized HPC scheduling—allowing 671B-parameter models to be
trained on H800 clusters without blowing up the budget.
Emergent Reasoning: In R1, large-scale Reinforcement Learning (RL) unlocks
advanced chain-of-thought capabilities, culminating in “R1-Zero” and its purely RL-
driven approach to reasoning tasks.

DeepSeek-LLM: Laying the Foundation

Motivation & Overview

2/11
The authors set out to answer an important question: Given a fixed compute budget for pre-
training, how do we choose the scale of the model and how much training data to use? Prior
studies (e.g. Chinchilla vs. GPT-3) differed on the ratio between these two factors.
DeepSeek-LLM addresses that by measuring scale in a different way. Earlier work measured
scale in terms of how many parameters were in the model, DeepSeek-LLM instead
measured scale as non-embedding FLOPs/token1 They then found they could predict
computation with:

1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second)
used for pre-training certain layers of the transformer (non-embedding). The authors found
only some layers contributed to the scaling formula.

C=M×D

where C is the compute budget, M is non-embedding FLOPs/token, and D is data size.

This more granular representation helps them predict how a 7B or 67B model might train on
2T tokens of bilingual data.

Training Instability

A central concern they grapple with is training instability (sudden irrecoverable divergences
in the training process), which can often manifest in large-scale language models—
especially those with mixture-of-experts or very long contexts.

By carefully tuning learning rates, batch sizes, and other hyperparameters 2, DeepSeek-LLM
demonstrates that stable large-scale training is achievable, but it requires meticulous design
of the architecture of the transformer model together with the infrastructure of the High
Performance Computing (HPC) data center used to train it. This interwoven design of both
architecture and infrastructure is called HPC Co-Design.

Data Quality & Model Scale

A point the authors make is about how data quality shifts the optimal ratio—i.e., higher-
quality data can justify a bigger model for the same number of tokens. You can intuit this by
imagining two scenarios:

Scenario A: You have a 100-billion-token corpus full of duplicates, spammy text, or

incomplete sentences. The model might not glean much new knowledge because the
data is partly redundant or low-value.
Scenario B: You have a carefully curated 100-billion-token corpus with broad coverage
of code, math, multi-lingual dialogues, factual text, etc. Each token is more “information-

3/11
rich,” so the model can “afford” to use more parameters without hitting diminishing
returns prematurely.

In other words, when data is denser in useful information, scaling the model further pays off
because each parameter can learn from richer signals.

Key Takeaways

Hyperparameter Scaling: They propose simple power-law fits to pick batch size and
learning rate as compute C grows.
Bilingual Data: They train two base sizes (7B, 67B) on 2T tokens covering
English/Chinese, then do Supervised Fine Tuning (SFT) and a simpler preference-
based alignment called Direct Preference Optimization (DPO).
Results: The resulting DeepSeek-LLM67B “Outperforms LLaMA-2 70B” on math/coding
tasks, illustrating how HPC co-designed approaches can keep training stable while
efficiently pushing scale.

The seeds planted here - scaling laws and infrastructure for extremely large training - will
reappear in subsequent works.

DeepSeek-V3: HPC Co-Design

Scaling MoE to 671B While Preserving Efficiency

Building on V2, DeepSeek-V3 further extends sparse models to 671B parameters (37B
activated), training on 14.8T tokens in under 2.8M H800 GPU hours. The authors credit
extensive HPC co-design:

Lastly, we emphasize again the economical training costs of DeepSeek-V3,

summarized in Table 1, achieved through our optimized co-design of algorithms,
frameworks, and hardware.

-- DeepSeek-V3 Tech. Report, p.5

The major novelties are:

1. Refined MLA
2. Refined DeepSeekMoE
3. Co-Designed Training & Inference Frameworks

Refined MLA

4/11
Multi-Head Latent Attention was introduced in V2 to reduce KV cache overhead. In V3, it is
further refined with several new features:

Dynamic Low-Rank Projection: Instead of a static compression dimension, MLA

adjusts how strongly it compresses Key/Value vectors depending on sequence length.
For shorter sequences, less compression preserves fidelity; for extremely long
sequences (32K–128K tokens), deeper compression manages memory growth.
Adaptive Query Compression: Where V2 used a fixed dc dimension, V3 employs
an adaptive scaling of the query up/down at different layer depths. Early layers use
higher-dimensional queries for expressiveness; deeper layers more aggressively
compress to save activation memory.
Improved RoPE Handling: V2 only partially decoupled keys, but V3 extends the
concept for more stable 128K context. They track a “decoupled shared key” that
reduces numerical drift in extremely long generations.
Joint KV Storage: V2 stored compressed keys and values separately. V3 merges them
into a shared compressed representation to further reduce memory traffic during multi-
node inference.
Layer-Wise Adaptive Cache: Instead of caching all past tokens for all layers, V3
prunes older KV entries at deeper layers. This helps keep memory usage in check
when dealing with 128K context windows.

Together, these MLA refinements ensure that while DeepSeek-V3 can attend across very
long sequences, the memory overhead remains manageable.

Refined DeepSeekMoE: Auxiliary-Loss-Free, Higher Capacity

On the MoE side, DeepSeek-V3 drops the auxiliary-loss approach from V2. Instead of an
explicit penalty term, each expert acquires a dynamic bias bi. If an expert is overloaded at a
step, bi decreases; if underloaded, bi increases. The gating decision then adds bi to the
token's affinity:

si,t′=si,t+bi

Key Improvements:

No Token Dropping: V2 occasionally dropped tokens if certain experts got overloaded,

but the new bias-based method keeps everything.
More Activated Experts: They raise the number of routed experts from 6 to 8 per
token, improving representational power.

5/11
Higher Stability: By removing auxiliary losses, they avoid potential interference with
the main training objective, focusing purely on the intrinsico gating signals plus bias
adjustments.

Hence, the final feed-forward module is a combination of a small set of shared experts plus
up to 8 specialized experts chosen adaptively.

Co-Designed Frameworks: FP8, DualPipe, and PTX Optimizations

Scaling an MoE model to 671B demanded HPC-level solutions for training and inference.
The authors emphasize:

Through the co-design of algorithms, frameworks, and hardware, we overcome

the communication bottleneck in cross-node MoE training, achieving near-full
computation- communication overlap.

-- DeepSeek-V3 Tech. Report, p.5

FP8 Mixed Precision

They adopt an FP8 data format for General Matrix Multiplications (GEMMs), halving memory.
The risk is reduced numeric range so they offset it with:

Block-wise scaling (e.g., 1x128 or 128x128 tiles).

Periodic “promotion” to FP32 after short accumulation intervals to avoid
overflow/underflow.

DualPipe Parallelism

They propose DualPipe to overlap forward/backward computation with the MoE all-to-all
dispatch. It rearranges pipeline stages to ensure that network communication (particularly
across InfiniBand) is hidden behind local matrix multiplications.

PTX-Level & Warp Specialization

To fully exploit InifiniBand(IB) and NVLink:

They tune warp-level instructions in PTX (a level lower than CUDA), auto-tuning the
chunk size for all-to-all dispatch.
Dynamically partition Streaming Microcontrollers into communication vs. compute tasks
so that token dispatch never stalls local GEMM.

As a result, training costs were cut to 2.8M H800 GPU hours per run - low for a 14.8T token
corpus.

6/11
Outcomes

The resulting DeepSeek-V3 excels at code, math, and some multilingual tasks,
outperforming other open-source LLMs of similar scale. Deep HPC co-design (FP8,
DualPipe, PTX-level optimization) plus refined MLA/MoE implementation achieve extreme
scale with stable training.

DeepSeek-R1: Reinforcement Learning for Deeper

Reasoning
It's worth noting that both DeepSeek R1 and DeepSeek R1-Zero are architecturally identical
to DeepSeek V3 (but uses the “only-pretrained” base version). The only difference in these
models is how post-training is handled.

Emergent Reasoning Behaviors Through RL-Only

All prior DeepSeek releases used SFT (plus occasional RL). By contrast, DeepSeek-R1-Zero
tries an extreme: no supervised warmup, just RL from the base model. They adopt Group
Relative Policy Optimization (GRPO), which:

1. Samples a group of old-policy outputs o1,...,oG

2. Scores each with a reward (in this case, rule-based)
3. Normalizes the advantage Ai by group mean/stdev
4. Optimizes a clipped PPO-like objective

The reward function for the R1 models is rule-based - a simple weighted sum between 2
components

Accuracy Reward - if the task has an objective correct answer (e.g. a math problem,
coding task, etc.), correctness is verified using mathematical equation solvers for step-
by-step proof checking, and code execution & test cases for code correctness
verification
Format Reward - the model is rewarded for following a structured reasoning process
using explicit reasoning markers <think></think> and <answer></answer>

The relative advantage Ai for a given output is calculated as:

Ai=ri−mean({r1,r2,...,rG})std({r1,r2,...,rG})

7/11
where ri is the reward calculated for the given output. The model's policy is updated to
favor responses with higher rewards while constraining changes using a clipping function
which ensures that the new policy remains close to the old.

In so many words: the authors created a testing/verification harness around the model which
they exercised using reinforcement learning, and gently guided the model using simple
Accuracy and Format rewards. In doing so, emergent reasoning behaviors were observed:

Self-verification where the model double-checks its own answers

Extended chain-of-thought where the model learns to explain its reasoning more
thoroughly
Exploratory reasoning - the model tries different approaches before converging on an
answer
Reflection - the model starts questioning its own solutions and adjusting reasoning
paths dynamically

R1-Zero is probably the most interesting outcome of the R1 paper for researchers because it
learned complex chain-of-thought patterns from raw reward signals alone. However, the
model exhibited notable issues:

Readability Problems: Because it never saw any human-curated language style, its
outputs were sometimes jumbled or mix multiple languages.
Instability in Non-Reasoning Tasks: Lacking SFT data for general conversation, R1-
Zero would produce valid solutions for math or code but be awkward on simpler Q&A or
safety prompts.
Limited Domain: Rule-based rewards worked well for verifiable tasks (math/coding),
but handling creative/writing tasks demanded broader coverage.

Hence, the authors concluded that while “pure RL” yields strong reasoning in verifiable tasks,
the model’s overall user-friendliness was lacking. This led them to DeepSeek-R1: an
alignment pipeline combining small cold-start data, RL, rejection sampling, and more RL, to
“fill in the gaps” from R1-Zero’s deficits.

Refined Reasoning Through SFT + RL

DeepSeek-R1 addresses R1-Zero's limitations by injecting a small amount of supervised

data before RL and weaving in additional alignment steps.

Stage 1: “Cold-Start” SFT

They gather a small number (~thousands) of curated, “human-friendly” chain-of-thought data

covering common sense Q&A, basic math, standard instruction tasks, etc. Then, they do a

8/11
short SFT pass on the base model. This ensures the model acquires:

Better readability: Polished language style and formatting.

Non-reasoning coverage: Some conversation, factual QA, or creative tasks not easily
rewarded purely by rule-based checks.

In essence, the authors realized you can avoid the “brittleness” of a zero-SFT approach by
giving the model a seed of user-friendly behaviors.

Stage 2: Reasoning-Oriented RL

Next, as in R1-Zero, they apply large-scale RL for tasks like math and code. The difference is
that now the model starts from a “cold-start SFT” checkpoint—so it retains decent language
style while still learning verifiable tasks from a rule-based or tool-based reward. This RL
stage fosters the same emergent chain-of-thought expansions but without the random
“language mixing” or bizarre structure.

Stage 3: Rejection Sampling + Additional SFT

Once that RL converges, they generate multiple completions per prompt from the RL
checkpoint. Using a combination of automatic verifiers and some human checks, they pick
the best outputs (“rejection sampling”) and build a new SFT dataset. They also incorporate
standard writing/factual/safety data from DeepSeek-V3 to keep the model balanced in non-
verifiable tasks. Finally, they re-fine-tune the base model on this curated set.

This step addresses the “spotty coverage” problem even further: The best RL answers
become training targets, so the model improves at chain-of-thought and clarity.

Stage 4: RL for “All Scenarios”

Lastly, they do another RL pass on diverse prompts—not just math/code but general
helpfulness, safety, or role-playing tasks. Rewards may come from a combination of rule-
based checks and large “preference” models (trained from user preference pairs). The final
result is a model that:

Retains strong chain-of-thought for verifiable tasks,

Aligns to broad user requests in everyday usage,
Maintains safer, more controlled outputs.

Connecting the Arcs: Efficiency & Emergence

9/11
Despite covering different angles - scaling laws, MoE, HPC scheduling, and large-scale RL -
DeepSeek's work consistently follows these arcs:

1. Cost and Memory Efficiency

They systematically design methods (MLA, MoE gating, device-limited routing,
FP8 training, DualPipe) to maximize hardware utilization even in constrained
environments
HPC-level scheduling (PTX instructions, warp specialization) hides communication
overhead and overcomes the limitations imposed by limited interconnect speeds
on H800s
2. Sparse + HPC Co-Design
From V2 to V3, we see an evolving mixture-of-experts approach, culminating in a
671B-parameter model feasible on H800 clusters.
The authors repeatedly stress that HPC co-design is the only path to cheaply train
multi-hundred-billion-parameter LLMs.
3. Emergent Reasoning
R1 pushes beyond standard supervised training, letting RL signals shape deep
chain-of-thought. The synergy between pre-trained scale and targeted post-
training yields advanced reasoning patterns like reflection or multi-step
verification.

Taken as a whole, the DeepSeek series highlights how architecture, algorithms, frameworks,
and hardware must be co-designed to handle LLM training at trillion-token scales. Looking to
the future, it indicates that toolchain builders may want to find ways to capture some of these
HPC optimizations as part of the model compilation path or training apparatus, and AI
research teams may want to work closely with HPC expertise even in the early days of
architecture ideation.

Footnotes
1: Non-embedding FLOPs are the amount of FLOPs (Floating Point Operations per Second)
used for pre-training certain layers of the transformer (non-embedding). The authors found
only some layers contributed to the scaling formula.

2: A model consists of billions on internal variables, which are called its parameters. These
parameters gain their values (weights) during training. Before training, developers will set a
number of different variables that control the training process itself, these are called
hyperparameters

10/11
11/11

How To Use Deepseek? Manual Full en
No ratings yet
How To Use Deepseek? Manual Full en
105 pages
A Comparative Study of DeepSeek and Other Ai Tools
No ratings yet
A Comparative Study of DeepSeek and Other Ai Tools
8 pages
DeepSeek Modelss
No ratings yet
DeepSeek Modelss
52 pages
DeepSeek V3
No ratings yet
DeepSeek V3
53 pages
1.1. Background On Reasoning in Large Language Models (LLMS)
No ratings yet
1.1. Background On Reasoning in Large Language Models (LLMS)
64 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
2025 03 28 AI Updates
No ratings yet
2025 03 28 AI Updates
24 pages
Machine Learning Systems With Reduced Memory Requirements
No ratings yet
Machine Learning Systems With Reduced Memory Requirements
41 pages
DeepSeek A Comprehensive Study of Open Source Multilingual Large Language Models For Scalable AI
No ratings yet
DeepSeek A Comprehensive Study of Open Source Multilingual Large Language Models For Scalable AI
5 pages
Preprints202503 1887 v1
No ratings yet
Preprints202503 1887 v1
30 pages
SlideEgg 301128-DeepSeek
No ratings yet
SlideEgg 301128-DeepSeek
22 pages
DeepSeek R1 Dual
No ratings yet
DeepSeek R1 Dual
44 pages
Garmin G5000
No ratings yet
Garmin G5000
382 pages
Insights Into DeepSeek-V3 - Scaling Challenges and Reflections On
No ratings yet
Insights Into DeepSeek-V3 - Scaling Challenges and Reflections On
14 pages
DeepSeek R1
No ratings yet
DeepSeek R1
23 pages
DeepSeek-Coder-v2 - The BEST Opensource Coding LLM! (Beats GPT-4o and Claude 3.5 Sonnet) (DownSub - Com)
No ratings yet
DeepSeek-Coder-v2 - The BEST Opensource Coding LLM! (Beats GPT-4o and Claude 3.5 Sonnet) (DownSub - Com)
14 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
How DeepSeek-R1 Was Built - Architecture and Training Explained
No ratings yet
How DeepSeek-R1 Was Built - Architecture and Training Explained
12 pages
A Technical Primer On Deepseek
No ratings yet
A Technical Primer On Deepseek
18 pages
Nem Se Tu Viver 100 Anos Consegue Termina Esse Material
No ratings yet
Nem Se Tu Viver 100 Anos Consegue Termina Esse Material
29 pages
DeepSeek Models Explained
No ratings yet
DeepSeek Models Explained
11 pages
9733 31211 3 PB
No ratings yet
9733 31211 3 PB
3 pages
Deep Seek
No ratings yet
Deep Seek
6 pages
AshniSingh DeepSeek-R1+FP8Training 02-16-25
No ratings yet
AshniSingh DeepSeek-R1+FP8Training 02-16-25
3 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
Affan 1
No ratings yet
Affan 1
24 pages
Clinical Informatics Board Review and Self Assessment
100% (1)
Clinical Informatics Board Review and Self Assessment
339 pages
DeepSeek Proves AI Comes For All Jobs - Even AI Jobs
No ratings yet
DeepSeek Proves AI Comes For All Jobs - Even AI Jobs
6 pages
Deepseek Meeting Points
No ratings yet
Deepseek Meeting Points
12 pages
A Survey of DeepSeek Models
No ratings yet
A Survey of DeepSeek Models
6 pages
AI Report
No ratings yet
AI Report
3 pages
Deep Learning Cookbook
No ratings yet
Deep Learning Cookbook
24 pages
DeepSeek-V2: High-Performing Open-Source LLM With MoE Architecture
No ratings yet
DeepSeek-V2: High-Performing Open-Source LLM With MoE Architecture
10 pages
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
No ratings yet
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
52 pages
Review DeepSeek Method
No ratings yet
Review DeepSeek Method
11 pages
Deepseek v2 Tech Report
No ratings yet
Deepseek v2 Tech Report
50 pages
Startup List
No ratings yet
Startup List
37 pages
DeepSeek-V3: Efficient and Scalable AI With Mixture-Of-Experts
No ratings yet
DeepSeek-V3: Efficient and Scalable AI With Mixture-Of-Experts
9 pages
MSME Declaration Template 3
No ratings yet
MSME Declaration Template 3
3 pages
BioTime Installation Guide
No ratings yet
BioTime Installation Guide
12 pages
DeepSeek Texte
No ratings yet
DeepSeek Texte
4 pages
Deepseek-V2: A Strong, Economical, and Efficient Mixture-Of-Experts Language Model
No ratings yet
Deepseek-V2: A Strong, Economical, and Efficient Mixture-Of-Experts Language Model
50 pages
30 Free APIs For Developers
No ratings yet
30 Free APIs For Developers
6 pages
Etabs Tutorial
No ratings yet
Etabs Tutorial
68 pages
957 4 1 A 57607 ECAD VLSI Lab Manual PDF
No ratings yet
957 4 1 A 57607 ECAD VLSI Lab Manual PDF
151 pages
Input-Output Model
No ratings yet
Input-Output Model
10 pages
Tutor Resume Example
100% (1)
Tutor Resume Example
4 pages
Surgical Nursing Lecture Notes
No ratings yet
Surgical Nursing Lecture Notes
203 pages
RM-145 RM-123: Exploded View and Component Disposal
No ratings yet
RM-145 RM-123: Exploded View and Component Disposal
0 pages
DL Benchmark C 2014 1-55
No ratings yet
DL Benchmark C 2014 1-55
7 pages
The Art of Testing: The Mindset of A Tester
No ratings yet
The Art of Testing: The Mindset of A Tester
15 pages
Architecting Solutions For The Enterprise
No ratings yet
Architecting Solutions For The Enterprise
34 pages
Assignment and Project
100% (1)
Assignment and Project
3 pages
5 Considering The Importance of User Profiles in Interface Design
No ratings yet
5 Considering The Importance of User Profiles in Interface Design
23 pages
2019 Sri Lanka Easter Bombings
No ratings yet
2019 Sri Lanka Easter Bombings
60 pages
A Cheatsheet On Comparing Key-Value Stores
No ratings yet
A Cheatsheet On Comparing Key-Value Stores
5 pages
Supreme Court Easter Sunday Attacks Judgement
No ratings yet
Supreme Court Easter Sunday Attacks Judgement
59 pages
What Is MCP?
No ratings yet
What Is MCP?
5 pages
Material Balance Planning
No ratings yet
Material Balance Planning
2 pages
How Amazon S3 Stores 350 Trillion Objects With 11 Nines of Durability
No ratings yet
How Amazon S3 Stores 350 Trillion Objects With 11 Nines of Durability
12 pages
Facebook Cassandra
No ratings yet
Facebook Cassandra
10 pages
How Instagram Scaled Its Infrastructure To Support A Billion Users
No ratings yet
How Instagram Scaled Its Infrastructure To Support A Billion Users
11 pages
How Uber Built Odin To Handle 3.8 Million Containers
No ratings yet
How Uber Built Odin To Handle 3.8 Million Containers
8 pages
How Netflix Stores 140 Million Hours of Viewing Data Per Day
No ratings yet
How Netflix Stores 140 Million Hours of Viewing Data Per Day
10 pages
Unknown Title
No ratings yet
Unknown Title
11 pages
12 Algorithms For System Design Interviews
No ratings yet
12 Algorithms For System Design Interviews
8 pages
EP155: The Shopify Tech Stack: What Is SSO (Single Sign-On) ?
No ratings yet
EP155: The Shopify Tech Stack: What Is SSO (Single Sign-On) ?
6 pages
Resource Estimation: Professor Emmanuel Chanda Curtin University of Technology/UNZA, SCH of Mines
No ratings yet
Resource Estimation: Professor Emmanuel Chanda Curtin University of Technology/UNZA, SCH of Mines
32 pages
Akademset
No ratings yet
Akademset
3 pages
Unit 4
No ratings yet
Unit 4
36 pages
The Society of Biblical Literature
No ratings yet
The Society of Biblical Literature
3 pages
GW - Approved Battery Options Statement-En - 2
No ratings yet
GW - Approved Battery Options Statement-En - 2
5 pages
Pbs Nature
No ratings yet
Pbs Nature
4 pages
On Training
No ratings yet
On Training
43 pages
00 Agenda Apnic42
No ratings yet
00 Agenda Apnic42
11 pages
URC Total Control Roku IG Rev3.0 03152024
No ratings yet
URC Total Control Roku IG Rev3.0 03152024
27 pages
Jokes Advanced
No ratings yet
Jokes Advanced
6 pages
ET Week1
No ratings yet
ET Week1
46 pages
Scratch-Catch Game-1-1
No ratings yet
Scratch-Catch Game-1-1
16 pages
Components of Space Complexity
No ratings yet
Components of Space Complexity
5 pages
Algorithms and Data Structures Exercises: Antonio Carzaniga University of Lugano Edition 1.2 January 2009
No ratings yet
Algorithms and Data Structures Exercises: Antonio Carzaniga University of Lugano Edition 1.2 January 2009
13 pages
Klu Coa I 2 Sem Home Assignment
No ratings yet
Klu Coa I 2 Sem Home Assignment
9 pages
Java Midlet Spec
No ratings yet
Java Midlet Spec
7 pages
Technical Specifications
No ratings yet
Technical Specifications
4 pages
LSB Based Digital Watermarking Technique
No ratings yet
LSB Based Digital Watermarking Technique
4 pages
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
From Everand
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
From Everand
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
William Smith
No ratings yet
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
From Everand
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
OneFlow for Parallel and Distributed Deep Learning Systems: The Complete Guide for Developers and Engineers
From Everand
OneFlow for Parallel and Distributed Deep Learning Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C++ Mastery: Advanced Techniques and Strategies
From Everand
C++ Mastery: Advanced Techniques and Strategies
Adam Jones
No ratings yet
Mastering C: Advanced Techniques and Best Practices
From Everand
Mastering C: Advanced Techniques and Best Practices
Adam Jones
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

The DeepSeek Series A Technical Overview

Uploaded by

The DeepSeek Series A Technical Overview

Uploaded by

martinfowler.com /articles/deepseek-papers.

The DeepSeek Series: A Technical Overview

Key arcs in this series include:

DeepSeek-LLM: Laying the Foundation

where C is the compute budget, M is non-embedding FLOPs/token, and D is data size.

Data Quality & Model Scale

Scenario A: You have a 100-billion-token corpus full of duplicates, spammy text, or

DeepSeek-V3: HPC Co-Design

Lastly, we emphasize again the economical training costs of DeepSeek-V3,

-- DeepSeek-V3 Tech. Report, p.5

The major novelties are:

Dynamic Low-Rank Projection: Instead of a static compression dimension, MLA

Refined DeepSeekMoE: Auxiliary-Loss-Free, Higher Capacity

No Token Dropping: V2 occasionally dropped tokens if certain experts got overloaded,

Co-Designed Frameworks: FP8, DualPipe, and PTX Optimizations

Through the co-design of algorithms, frameworks, and hardware, we overcome

-- DeepSeek-V3 Tech. Report, p.5

FP8 Mixed Precision

Block-wise scaling (e.g., 1x128 or 128x128 tiles).

PTX-Level & Warp Specialization

To fully exploit InifiniBand(IB) and NVLink:

DeepSeek-R1: Reinforcement Learning for Deeper

Emergent Reasoning Behaviors Through RL-Only

1. Samples a group of old-policy outputs o1,...,oG

The relative advantage Ai for a given output is calculated as:

Self-verification where the model double-checks its own answers

Refined Reasoning Through SFT + RL

DeepSeek-R1 addresses R1-Zero's limitations by injecting a small amount of supervised

Stage 1: “Cold-Start” SFT

They gather a small number (~thousands) of curated, “human-friendly” chain-of-thought data

Better readability: Polished language style and formatting.

Stage 3: Rejection Sampling + Additional SFT

Stage 4: RL for “All Scenarios”

Retains strong chain-of-thought for verifiable tasks,

Connecting the Arcs: Efficiency & Emergence

1. Cost and Memory Efficiency

You might also like