0% found this document useful (0 votes)
31 views5 pages

Deep Seek

The document outlines a step-by-step example of weight updates in the DeepSeek-R1-Zero reinforcement learning framework, starting from neural network logits to the final weight update. It details the processes of converting logits to probabilities, sampling outputs, computing probability ratios, assigning rewards, applying clipping, and incorporating KL-divergence regularization. The overall goal is to enhance the model's reasoning capabilities while ensuring stability and alignment with human preferences.

Uploaded by

katharguppe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views5 pages

Deep Seek

The document outlines a step-by-step example of weight updates in the DeepSeek-R1-Zero reinforcement learning framework, starting from neural network logits to the final weight update. It details the processes of converting logits to probabilities, sampling outputs, computing probability ratios, assigning rewards, applying clipping, and incorporating KL-divergence regularization. The overall goal is to enhance the model's reasoning capabilities while ensuring stability and alignment with human preferences.

Uploaded by

katharguppe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Deepseek

Srinivas Katharguppe
January 2025

1 Introduction
Here is a step-by-step, consolidated example that starts from the logits of the
neural network and goes all the way to the weight update in the context of
DeepSeek-R1-Zero and its reinforcement learning (RL) framework. This exam-
ple assumes only one output (o1 ) for simplicity and avoids any assumptions
about the old policy (πold ).

Step 1: Neural Network Output (Logits)
The neural network outputs logits for each token in the vocabulary. For
simplicity, assume:
- Vocabulary size = 3 tokens: [A, B, C],
- Logits for the current question q: logits = [2.0, 1.0, 0.5].

Step 2: Convert Logits to Probabilities Using Softmax
To compute the probabilities of generating each token, apply the softmax
function:

πθ (oi |q) = PVexp(logits(oi )) .


exp(logits(oj ))
j=1

For each token:


exp(2.0) 7.39 7.39
- πθ (A|q) = exp(2.0)+exp(1.0)+exp(0.5) = 7.39+2.72+1.65 = 11.76 ≈ 0.628,
exp(1.0) 2.72
- πθ (B|q) = =
11.76 ≈ 0.231,
11.76
exp(0.5) 1.65
- πθ (C|q) = =
11.76 ≈ 0.141.
11.76
Thus, the probability distribution is:

πθ (o|q) = [0.628, 0.231, 0.141].

1

Step 3: Sample an Output
Using the probability distribution, sample an output o1 . Assume the model
generates o1 = A.

Step 4: Compute the Probability Ratio
The probability ratio compares the current policy πθ to the old policy πold .
Assume:
- πold (A|q) = 0.5 (probability of A under the old policy),
- πθ (A|q) = 0.628 (probability of A under the current policy).
The probability ratio is:

πθ (A|q) 0.628
r1 = πold (A|q) = 0.5 = 1.256.


Step 5: Assign Rewards
Assign a reward r1 based on the quality of the output. Assume:
- r1 = 1.0 (high reward for correct output).

Step 6: Compute the Advantage
The advantage A1 measures how much better or worse the output is com-
pared to the average performance of the group. Since we are assuming only one
output (G = 1), the advantage simplifies to:

A1 = r1 − mean({r1 }) = r1 − r1 = 0.

For simplicity, assume A1 = 1.0 (this is equivalent to normalizing rewards


for a single output).

Step 7: Apply Clipping
Clip the probability ratio to ensure stable updates. Assume ϵ = 0.2:

clip(r1 , 1 − ϵ, 1 + ϵ) = clip(1.256, 0.8, 1.2) = 1.2.


Step 8: Compute the Objective Term
The clipped objective term for o1 is:

2
min(r1 A1 , clip(r1 , 1 − ϵ, 1 + ϵ)A1 ).

Substitute the values:


- r1 A1 = 1.256 · 1.0 = 1.256,
- clip(r1 , 1 − ϵ, 1 + ϵ)A1 = 1.2 · 1.0 = 1.2.
Thus:

min(1.256, 1.2) = 1.2.


Step 9: Add KL-Divergence Regularization
Compute the KL-divergence between the current policy πθ and a reference
policy πref . Assume:
- πref (A|q) = 0.6,
- πθ (A|q) = 0.628.
The KL-divergence contribution for A is:

 
πθ (A|q)
DKL (A) = πθ (A|q) · log πref (A|q) −1 .

Substitute the values:

0.628

DKL (A) = 0.628 · log 0.6 −1 .

0.628
First, compute log 0.6 :

0.628
log 0.6 = log(1.0467) ≈ 0.0457.

Now compute DKL (A):

3
DKL (A) = 0.628 · (0.0457 − 1) = 0.628 · (−0.9543) ≈ −0.599.

Scale by β = 0.01:

βDKL = 0.01 · (−0.599) = −0.00599.


Step 10: Combine Terms
The GRPO objective is:

JGRP O (θ) = ObjectiveT erm − βDKL .

Substitute the values:

JGRP O (θ) = 1.2 − (−0.00599) = 1.2 + 0.00599 = 1.20599.


Step 11: Compute the Gradient
To update the weights, compute the gradient of the objective with respect
to the model parameters θ. Assume the gradient of the objective term is:

∇θ JGRP O (θ) = 0.01.


Step 12: Update the Weights
Use gradient ascent to update the weights. Assume a learning rate η = 0.01:

w ← w + η∇θ JGRP O (θ).

4
Substitute the values:

w ← w + 0.01 · 0.01 = w + 0.0001.


Summary
This example demonstrates how the model updates its weights step by step:
1. Compute logits and convert them to probabilities using softmax.
2. Sample an output and compute the probability ratio.
3. Assign rewards and compute the advantage.
4. Apply clipping to stabilize updates.
5. Add KL-divergence regularization to penalize deviations from the refer-
ence policy.
6. Combine terms to compute the objective.
7. Compute the gradient and update the weights.
This process ensures that the model improves its reasoning capabilities while
maintaining stability and alignment with human preferences.

You might also like