12 LLM Notes
12 LLM Notes
Herman Kamper
More on RLHF
1
High-level introduction
Intro to large language models by Andrej Karpathy [slides]
2
GPT is just a transformer language model
Inference: Predict one word at a time and feed it back in
sample
<s> it looks
The model outputs the next word and then feeds that in as its own
input to the next time step. In general, a model that takes its own
output as input is called autoregressive.
There are different ways of sampling the output word from the predicted
output distribution:
3
Training time: Predict all the next words and backprop
</s>
ŷ1 ŷ2 ŷ3 ŷ4 ŷ5 ŷ6 ŷ7 ŷ8
Masking ensures that the model cannot look into the future for pre-
dicting the output at the current time step. I.e. what happens during
training resembles what happens during inference.1
My illustrations are with words, but the language model (LM) probably
uses subword units like BPE. The basic architecture, I think, is still
the one from (Radford et al., 2018).
1
See the explanation of masking in my transformers note.
4
From language model to assistant
Three steps to go from GPT to ChatGPT:
<prompt>
Can you help me with this code? It seems like there is
a bug.
print(“hello world)
</prompt>
<answer>
It looks like you forgot to close the string passed to
the function print. You have to add a closing quote to
properly terminate the string. Here is the corrected
function:
print(“hello world”)
You can think of this as supervised learning, but really it is still exactly
the same task as the task used during pretraining. The dataset is just
swapped out and training is continued.
5
3. Reinforcement learning from human feedback (RLHF)
Reinforcement learning (RL):
Agent
Environment
But Rafailov et al. (2023) and others say that it isn’t really necessary
to use reinforcement learning to update the model: you can do it
directly through supervised learning by backpropping through the
learned reward model.
2
Figure by Andrej Karpathy.
6
More on RLHF
Learning a reward model from humans
For prompt x1:T , the reward model hϕ (x1:T , y1:M ) ∈ R should output
We train the classifier using the negative log likelihood loss. With one
training example consisting of a prompt x1:T , B candidate answers
(k)
{y1:M }Bk=1 and the selection b, the loss is
(k)
J(ϕ) = − log Pϕ b x1:T , {y1:M }B
k=1
(b)
ehϕ (x1:T ,y1:M )
= − log P (j)
B hϕ (x1:T ,y1:M )
j=1 e
3
Each candidate answer will have a different length M (j) that I denote explicitly
here but will now drop.
7
RL using the learned reward model
We denote the parameters of our LM before RLHF as θ.
Pπ (y1:M |x1:T )
r(x1:T , y1:M ) = hϕ (x1:T , y1:M ) − β log
Pθ (y1:M |x1:T )
Initialise
θ
Prompt
x1:T Agent
π
Reward Action
r(x1:T , y1:M ) a = y1:M
Environment
8
Why learn a reward model instead of just asking humans?
An alternative to learning a reward model beforehand would have just
been to ask humans directly to rate outputs as our model produce it.
We could then get the reward directly (and probably more accurately)
in our RL loop. But this would be much much slower than first learning
a reward model, fixing it, and then running the RL loop very quickly
(since we don’t have to wait for a slow human to respond).
9
Videos covered in this note
• Large language model training and inference (14 min)
• The difference between GPT and ChatGPT (13 min)
• Reinforcement learning from human feedback (RLHF) (15 min)
Further reading
I found Andrej Karpathy’s blog post Deep reinforcement learning:
Pong from pixels helpful.
References
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
language understanding by generative pre-training,” OpenAI, 2018.
10