Deep Seek
Deep Seek
Srinivas Katharguppe
January 2025
1 Introduction
Here is a step-by-step, consolidated example that starts from the logits of the
neural network and goes all the way to the weight update in the context of
DeepSeek-R1-Zero and its reinforcement learning (RL) framework. This exam-
ple assumes only one output (o1 ) for simplicity and avoids any assumptions
about the old policy (πold ).
—
Step 1: Neural Network Output (Logits)
The neural network outputs logits for each token in the vocabulary. For
simplicity, assume:
- Vocabulary size = 3 tokens: [A, B, C],
- Logits for the current question q: logits = [2.0, 1.0, 0.5].
—
Step 2: Convert Logits to Probabilities Using Softmax
To compute the probabilities of generating each token, apply the softmax
function:
1
—
Step 3: Sample an Output
Using the probability distribution, sample an output o1 . Assume the model
generates o1 = A.
—
Step 4: Compute the Probability Ratio
The probability ratio compares the current policy πθ to the old policy πold .
Assume:
- πold (A|q) = 0.5 (probability of A under the old policy),
- πθ (A|q) = 0.628 (probability of A under the current policy).
The probability ratio is:
πθ (A|q) 0.628
r1 = πold (A|q) = 0.5 = 1.256.
—
Step 5: Assign Rewards
Assign a reward r1 based on the quality of the output. Assume:
- r1 = 1.0 (high reward for correct output).
—
Step 6: Compute the Advantage
The advantage A1 measures how much better or worse the output is com-
pared to the average performance of the group. Since we are assuming only one
output (G = 1), the advantage simplifies to:
A1 = r1 − mean({r1 }) = r1 − r1 = 0.
—
Step 8: Compute the Objective Term
The clipped objective term for o1 is:
2
min(r1 A1 , clip(r1 , 1 − ϵ, 1 + ϵ)A1 ).
—
Step 9: Add KL-Divergence Regularization
Compute the KL-divergence between the current policy πθ and a reference
policy πref . Assume:
- πref (A|q) = 0.6,
- πθ (A|q) = 0.628.
The KL-divergence contribution for A is:
πθ (A|q)
DKL (A) = πθ (A|q) · log πref (A|q) −1 .
0.628
DKL (A) = 0.628 · log 0.6 −1 .
0.628
First, compute log 0.6 :
0.628
log 0.6 = log(1.0467) ≈ 0.0457.
3
DKL (A) = 0.628 · (0.0457 − 1) = 0.628 · (−0.9543) ≈ −0.599.
Scale by β = 0.01:
—
Step 10: Combine Terms
The GRPO objective is:
—
Step 11: Compute the Gradient
To update the weights, compute the gradient of the objective with respect
to the model parameters θ. Assume the gradient of the objective term is:
—
Step 12: Update the Weights
Use gradient ascent to update the weights. Assume a learning rate η = 0.01:
4
Substitute the values:
—
Summary
This example demonstrates how the model updates its weights step by step:
1. Compute logits and convert them to probabilities using softmax.
2. Sample an output and compute the probability ratio.
3. Assign rewards and compute the advantage.
4. Apply clipping to stabilize updates.
5. Add KL-divergence regularization to penalize deviations from the refer-
ence policy.
6. Combine terms to compute the objective.
7. Compute the gradient and update the weights.
This process ensures that the model improves its reasoning capabilities while
maintaining stability and alignment with human preferences.