0% found this document useful (0 votes)
33 views53 pages

LLM For Maths People

The document discusses deep learning language models, focusing on their mathematical foundations and implementation using PyTorch. It covers topics from neural networks to transformer models, emphasizing the importance of understanding these technologies for their potential applications and ethical implications. The author aims to enhance language models for Khmer and encourages readers to grasp the theoretical concepts before diving into coding.

Uploaded by

liangwu179
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views53 pages

LLM For Maths People

The document discusses deep learning language models, focusing on their mathematical foundations and implementation using PyTorch. It covers topics from neural networks to transformer models, emphasizing the importance of understanding these technologies for their potential applications and ethical implications. The author aims to enhance language models for Khmer and encourages readers to grasp the theoretical concepts before diving into coding.

Uploaded by

liangwu179
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Language Models for

Mathematicians
Vanny Khon, [email protected]

AFFILIATION PUBLISHED
Boston University October 23, 2024

In this note, I talk about deep learning


based language models. I try to
present the model as mathematically
concise as possible and along the way
motivate the terms involved. We start
with neural networks and walk our
way until we get to transformer-based
language models, the model used in
the making of ChatGPT. At the end, I
explain my thoughts on why these
things work, and pose some questions
that I think are important. I also try to
show the coding part of all these, and
for this, I use PyTorch.

I think Artificial Intelligence is one of


the most consequential technologies
in our lifetime. It is different from other
technologies in that it is not too hard
to learn. It has so many potential uses
and misuses. So it is important that
enough good people know it and,
hopefully, we can steer it in a good
direction that is benificial to all. And to
be able to do something with this
technology, we need to at least
understand the models and be able to
implement it. In fact, this note grows
out of a class I am teaching to several
friends in Cambodia, with the goal of
improving language models for
Khmer, my native language. The hope
is to build a team that can facilitate
and accelerate the development and
deployment of large language models
in Cambodia.

I recommend the reader to read the


"theory part" to get conceptual
understanding first, and then proceed
to read the code to try to implement it.

Table of Contents
1. Introduction: Neural Network
1.1 Coding with PyTorch
2. Universal Approximation Theorems

2.1 Implementation

3. Gradient Descent Algorithms

3.1 Stochastic Gradient Descent


3.2 Training Neural Networks with PyTorch
4. Cross-Entropy Loss

5. Recurrent Neural Network: RNN

5.1 RNN
5.2 LSTM
5.3 GRU
6. RNN-based Language Model

6.1 The n-Gram Language Model


7. Transformer Model

7.1 Decoder Only Transformer


2.2 Training a Transformer: Masked Attention
7.3 Making a ChatBot: Fine Tuning
7.4 Universal Approximation Theorems for Transformers
8. My Thoughts

8.1 The Question of Understanding


8.2 The Question of Complexity
8.3 The Question of Reasoning

1. Introduction: Neural Network


Let's imagine that our body receives a signal, say through the eyes.
This signal goes through the eyes and flows from one group of
neurons to the next until it reaches a certain place where we make a
certain judgement about the signal, say whether it is a cat or a flower.
How to make a mathematical model of this process?

Let x be a signal. After receiving the signal, the input signal x goes to a
first group neurons N11 , N21 , . . , Nk1 , neurons in the first layer, each
1

1
Ni1 processes information it received and passes it to deeper group of
neuron N11 , . . . , Nk2 , neurons in the second layer, this process
2

continues until the signal reaches a certain depth L with neurons


N1L , . . . , NKLL , neurons in the L-th layer, and the result from these last
group combined to produce an output.

Input Layer First Layer Second Layer Outpu

Neuron N⁰₁
Neuron N¹₁ Neuron N²₁

Neuron N⁰₂

Neuron N¹₂ Neuron N²₂ Neuron


Neuron N⁰₃

Neuron N⁰₄
Neuron N¹₃ Neuron N²₃ Neuron

x = [x₁, x₂, x₃, x₄]


Neuron N²₄
Weighted Sum Weighted Sum Weighted Sum

Information Flow in Neural Network

The simplest way to model a neuron Nji processing its inputs is to add
all the inputs it receives to get a single number. However, just adding
up everything is too restricted. To give extra freedom to the model and
still keep it simple, we do weighted sum instead. Let's say neuron Nji
takes signal xi−1
1
, . . . , xi−1
k
from N1i−1 , . . . , Nki−1 , respectively. The
k−1 i−1

result is

ki
xij = ∑ ail xi−1
l
+ bij
l=1
= (ai1 , … , aikj ) ⋅ (x(i−1)1 , … , x(i−1)kj )
+ bij
Here (ai1 , . . . , akj )i is the weight vector of neuron Nji uses to process
information giving to it from the (i-1)-th layer. Here each neuron has its
own weight vectors. All this can be efficient pack as matrix
multiplication and add a vector. Basically, the result from one layer of
the neuron to the next is the result of an affine transforming of its input
vectors.

The above implies that we can use affine transformations to model


how each layer of neurons processes its input and passes that result to
the next. So to model information flow in the brain through neuron, we
can just compose such a map. However, composition of affine
transformations is an affine transformation, this is equivalent to just
one layers of neurons. To obtain the notion of multi-layers, we let the
output of each layers passes through a non-linear function first before
sending it forward.

Definition

A neural network is a function obtained by composing affine linear


functions with non-linear functions. More precisely, a neural
network of L layers is a function NN : Rn → Rm defined by
NN(x) = AfL ∘ σL ∘ ⋯ ∘ Af2 ∘ σ2
∘ Af1 (x)
where each Afi : R → Rdi is an affine map, i.e.,
di−1

Afi (x) = Wi x + bi , with Wi ∈ L(Rdi , Rdi+1 ) and bi ∈ Rdi+1


and d0 = m and dL = n. Each σi is a non-linear function applied
pointwise to the output of Afi .

Let me clarify what we mean by "applied pointwise" here. Say we


have a function f(x) : R → R, and v = (v1 , v2 , . . , vN ) is a vector.
Applying f on v pointwise is f(v) = (f(v1 ), f(v2 ), . . . , f(vN )).

In theory, one can choose σi to be all different, but in practice, people


tend to choose one nonlinear function in a neural network, ie σi = σ.
The non-linear function σ : R → R is known as activation function,
and common choices are tanh(x), sigmoid function, or ReLu,
defined as follow

tanh(x) = (ex − e−x )/(ex + e−x )


sigmoid(x) = ex /(1 + ex )
ReLu(x) = 1/2(x + |x|)

Note that ReLu(x)=0 if x < 0 and Relu(x)=x if x ≥ 0. If we use ReLu


as the non-linear function, then when a neuron processes information
and obtains a negative output, it will pass 0 to the neurons in the next
layer, i.e., it deactivates its output, and it only activates its output if
the result is positive. I hope this justify the name activation function
used to refer to non-linear functions.

1.1 Coding Neural Networks with PyTorch


One popular Python Package used in deep learning is PyTorch,
developed by Facebook, where we now turn to. Coding will be
introduced throughout this note layer by layer.

In Python, when you want to use a package or a library, you have to


import it. The package that we need now are
matplotlib.pyplot , used to plot graphs, and pytorch , used
to create tensors. To import, write the following in the code block.

import torch
import matplotlib.pyplot as plt

The basics object in PyTorch are tensors, which are multi-


dimensional arrays. There are quite a number of ways to create
PyTorch tensors, and we recommend the reader to read the PyTorch
documentation for detail. Here we only show a few examples to give
an idea of how it works.

a=torch.tensor([-1,2]) # create a tensor from a list of


numbers from package torch
b=torch.tensor([2,-7])

We can perform operation on tensors.

1. a+b define pointwise addition


2. 3*a perfrom pointwise multiplication
3. b**2 perfrom pointwise square

In [21… import torch


import matplotlib.pyplot as plt# shorten matplotlib.pyplot to plt
a = torch.linspace(-4, 4, 30)# create a tensor of 30 elements of numbers evenly
plt.plot(a,a**2-3*a) # plot a^2-3a on a axis
plt.title('graph of a^2-3a')
plt.xlabel('a')
plt.ylabel('a^2-3a')

Out[214… Text(0, 0.5, 'a^2-3a')

Once you create a tensor, you can slice it or sample various subsets
of it. We show a few ways of how this can be done.

In [21… x=torch.rand(2,3) # this create a tensor of 2 by 3 whose elements are random


x1=x[1,:] # we define a tensor x1 to be the second row of x
y1=x[:,2]
#Now we can print the above tensors
print(f'x= {x}')
print(f'x1={x1}')
print(f'y1={y1}')

x= tensor([[0.9607, 0.1908, 0.6339],


[0.9940, 0.0890, 0.3960]])
x1=tensor([0.9940, 0.0890, 0.3960])
y1=tensor([0.6339, 0.3960])

To create neural network, we can use package torch.nn from


PyTorch. To get this package, we need to import it

import torch.nn as nn

Once we have torch.nn , we can create affine function by


torch.nn.Linear(m,n) .

In [30… import torch.nn as nn


L=nn.Linear(2,3) # this create an affine map taking 3 dimensional input and outp

In mathematical term, we create L(x) = xW + b, where W is a 2


by 3 matrix and b ∈ R3 . The enteries of W and b are random
number. If we want the bias term b = 0, we can do
L=nn.Linear(2,3,bias=False) . The code nn.Linear(2,3)
produce an object that has several attribute and operation attached
to it.

In [32… W=L.weight
b=L.bias
print(f'Weight W of L: {W}')
print(f'Biase b of L: {b}')

Weight W of L: Parameter containing:


tensor([[-0.2916, -0.7016],
[ 0.3956, 0.5720],
[ 0.3072, 0.3156]], requires_grad=True)
Biase b of L: Parameter containing:
tensor([-0.6838, -0.4431, -0.2809], requires_grad=True)

Let's me explain a bit more about L=nn.Linear(n,m) . Sure it


creates a linear map L ∈ L(Rn , Rm ). But this is something created
for using in machine learning, and during training a neural network, it
is efficent to process several data points, called batch, at the same
time. So the natural input for L is a batch of vectors instead of a
vector, ie it expects the input of shape
[batch_size,input_dim] . Say we have a batch of vectors
[v1 , v2 , . . . , vk ], where vi ∈ Rn , then the linear map
L=nn.Linear(n,m)

processes this stack in parallel and produce [L(v1 ), . . . , L(vk )]. We


will get back to this more later. When we create L, what we get is an
object. In Object-Orieted Programing (OOP), an object has attributes
and methods. I tend to think of attribute as internal state or data of
the object, and methods are functions or action we can do with these
internal states or datas. The object nn.Linear(n,m) has weigth
W and bias b attached to it, that's expected. What more is that the
weight W is a pair (weight,gradient) not just the weight, same
for b.

Let's create a two layers nueral network N : R → R, obtained by


composing a affine map A1 : R → R3 , tanh, and A2 : R3 → R.

In [36… class NeuralNet(nn.Module):


def __init__(self): # this is the part where we define attribute
super().__init__() # this initilizes the base class nn.Module
self.A1=nn.Linear(1,20) # create A1 as an affine map from 1d to 20 di
self.A2=nn.Linear(20,1)
def forward(self,x): # this is where we assemble all part/attribut
y=self.A1(x)
y=torch.tanh(y)
y=self.A2(y)
return y
N = NeuralNet() # this create a neural network

Now we can plot what function our neural network produce.

In [41… x = torch.linspace(-20, 20, 300) # create domain


y = torch.zeros_like(x) # create placeholder for y
with torch.no_grad(): # this code is there to stop pytorch from keep
for i in range(len(x)):
y[i] = N(x[i:i+1]) # x[i] is a zero dim tensor, and x[i:i+1] is [

#plot the result


plt.figure(figsize=(10,5))
plt.plot(x, y)
plt.show()

We will see how to train a neural network after we learn stochastic


gradient decent.

2. Universal Approximation Theorems


A neural network is defined by its weight matrices Wi and bias
vectors bi , collectively known as parameters and denoted by
P = (W1 , W2 , … , WL , b1 , … , bL ). These parameters can be
thought long vector of dimension K for some K ∈ N, representing a
point P in R . Each point corresponds uniquely to a neural network
K

instance, denoted NNP , which is structured according to the


network's fixed architecture. This parameter-to-network mapping
can be expressed as:

P ↦ NNP

The primary task of machine learning is to select the parameter P


that make NNP best approximates the function of interest. This
point of view allow us to see ML is a techique of function
approximation much like that of Fourier series and Taylor expansion.
One obvious question is what kind of function can be approximated
by neural networks. The answer of this question coming from
universal approximation theorems which we are now turn to. First, we
look at a theorem due to G. Cybenko in his paper Approximation by
Superpositions of a Sigmoidal Function

Universal Approximation Theorem, G. Cybenko: The space

N =1 {∑i αi σ(yi x + bi ) ∣ yi
N2 = ⋃ ∞ N T

∈ Rd }
is dense in C(Id , R), where σ : R → R is a fixed discriminatory
function, to be defined below, and Id is the unit cube in dimension
d.

Definition We say that σ is discriminatory if for a measure


µ ∈ M(In ),

∫ σ(y T x + θ) dµ(x) = 0
In

for all y ∈ Rn and θ ∈ R implies that µ = 0.

Proof

Note that
N2 =

N=1 {∑i αi σ(yi x + bi ) ∣ yi ∈ R }


⋃∞ N T d

is a linear subspace of C(Id , R). Assume it is not dense, then its


closure N̄2 is a proper Banach subspace of C. We can define a linear
functional ℓ : N̄2 → R to be zero. Since N̄2 is a proper subspace,
the Hanh-Banach's theorem allows us to extend ℓ to all of C such that
ℓ(f) = 0 if f ∈ N̄2 and ℓ(f) ≠ 0 for some f ∉ N̄2 . In particular,
there exist f0 ∈ C such that ℓ(f0 ) ≠ 0.

Now follow Rieze-Markov-Kakutani representation theorem, we know


there exist a positive Borel's measure µ on Id such that

ℓ(f) = ∫ fdµ

for all f ∈ C. In particular ∫ σ(yiT x + b)dµ = 0 for all y ∈ R and


d

b ∈ R. But σ is discriminatory, this implies µ = 0. However, we know


that ∫ f0 dµ = ℓ(f0 ) ≠ 0, this is a contradiction.

The next stage of the proof is to prove that the activation functions
we have above are discriminatory with respect to the Lebesgue
measure. This is true but we won't be proving it here.

We can equip the space C(Id , R) with the Lp p norm, and there are
similar results by Kurt Hornik in 1990 in his paper Approximation
capabilities of multilayer feedforward networks.

Theorem, K. Hornik: The space

N =1 {∑i αi σ(yi x + bi ) ∣ yi
N2 = ⋃ ∞ N T

∈ Rd }
is dense in the space of C(Id , R) when C is equipped with the Lp
norm with 1 ≤ p ≤ ∞ and σ is a continuous, bounded and non-
constant function.

2.1 Implementation
Let's consider the task of finding a two-layer neural network to
approximate cos(x) for x ∈ [−10, 10] with respect to the L2 norm.
Given ε > 0, the Universal Approximation Theorem tells us that there
exists a neural network which is ε-close to cos(x) on this interval in
the L2 norm. The question then is which neural network?
Unfortunately, there is no straightforward way to identify it directly.
What we can do is fix an architecture for the network and then search
for the parameters P that work best. We start with a random initial
guess for P and use gradient descent to iteratively improve this
guess.

Let's set this mathematically. As an example, first, we select the


architecture by setting the first layer to have dimension 13 and the
second layer to have dimension 5. This means we are looking for a
function

NNP (x)
= W3 (σ(W2 σ(W1 x + b1 ) + b2 )) + b3

where W1 is a matrix that takes R1 to R13 , W2 is a matrix taking R13


to R5 , W3 is a matrix that takes R5 back to R, and
b1 ∈ R13 , b2 ∈ R5 and b3 ∈ R. In this case,
P = (W1 , W2 , W3 , b1 , b2 , b3 ) is a point in
R13⋅1+13⋅5+5⋅1+13+5+1=102 .

Given P ∈ R102 , we construct a function NNP (x) to approximate


the cosine function. The error of this approximation is measured by
the loss function, defined as:

10
loss(P ) = ∫ (cos(x) − NNP (x))2 dx
−10

The task is to optimize loss : R102 → R. Of course we can't hope to


find P that truly is the minimum since the dimension is so large. What
we can hope for is to find P such that loss(P ) < ε for an acceptable
ε > 0.

Next, we need to be more realistic. We cannot compute loss(P )


directly because the function NNP (x) is difficult, if not impossible,
to integrate analytically. To address this issue, we define a loss
function at each point as:

l(P , x) = | cos(x) − NNP (x)|2

and then approximate the overall loss function loss(P ) by:

∑N
i=1
l(P , xi )
loss(P ) ≈ lossD =
N
where D = {x1 , … , xN } is a set of points sampled from [−10, 10].
Here, the subscript D indicates a loss function depends on the data
set D. Since this loss function is based on the data, it is called the
empirical loss. Note that as N → ∞, by continuity,
lossD (P ) → 20loss(P ).

Let's summarize where we are now. We have a dataset D. For each


data point xi ∈ [−10, 10], we can evaluate the point-wise loss for
each parameter set P . Averaging these, we obtain the empirical loss:

lossD (P )
2
∑N
i=1
| cos(x i ) − NN P (x i )|
=
N
While this empirical loss is computable, it remains challenging to
optimize using critical points method. A method commonly used for
this optimization is called gradient descent.

3. Gradient Descent Algorithm


Let's remind ourselves what gradient descent is. Consider
f : Rd → R. The goal is to find a point P where f(P ) is the global
minimum of f . The gradient descent algorithm suggests the following
steps:

1. Start at a point P0 .
2. Compute the gradient ∇f(P0 ).
3. Take a small step η > 0 in the direction opposite to the gradient,
i.e., P1 = P0 − η∇f(P0 ).
4. Repeat the above process.

The number η is known as the learning rate. The reason we move in


the direction opposite to the gradient is that the function decreases
fastest in this direction.

We now state and prove that if f is a nice function, the gradient


decent algorithm will converge to a minimum. Note that most function
we encounter in machine learning are far from being nice.

Theorem Let f : R → R be a strongly convex, bounded below,


d

non-constant differentiable function where the derivative


Df : Rd → Rd of f is Lipschitz. Then f has a unique minimum,
we denoted it by Pc . Moreover, for η sufficiently small, the
sequence

Pn+1 = Pn − η∇f(Pn )

converges to Pc for any initial point P0 .

Definition Let f : R → R. We say f is convex if for all


d

P , Q ∈ Rd we have

f(λP + (1 − λQ) ≤ λf(P )


+ (1 − λ)f(Q),

which implies

f(P ) − f(Q) ≥ ∇f(Q)T (P − Q)

if f is differentiable. A function f is strongly convex if there exist


H > 0 such that
f(P ) ≤ f(Q) + ∇f(Q)T (P − Q)
+ H/2∥P − Q∥2 .

And F : Rd → Rd is Lipschitz if there exist L > 0 such that

∥F (P ) − ∇F ∥ ≤ L∥P − Q∥

for any P , Q ∈ R .
d

Note: If f is convex, with Df being Lipschitz doesn't imply


uniqueness of a minimum, f(x, y) = x2 is an example of such
function.

Proof Recall that the gradient descent update rule is given by:

Pk+1 = Pk − η∇f(Pk )

where η is the step size (learning rate). We aim to show that this
sequence Pk converges to the global minimum.

Since f is strongly convex, there exists a constant µ > 0 such that


for any P and Q,

µ
f(Q) ≥ f(P ) + ∇f(P )T (Q − P ) +
2
∥P − Q∥2 .

Using the Lipschitz continuity of ∇f , we also have that there exists


L > 0 such that:

∥∇f(P ) − ∇f(Q)∥ ≤ L∥P − Q∥.

Now, consider the difference in function values at consecutive points


in the gradient descent sequence:

f(Pk+1 ) ≤ f(Pk ) + ∇f(Pk )T (Pk+1 − Pk )


L
+ ∥Pk+1 − Pk ∥2 .
2
Substitute Pk+1 = Pk − η∇f(Pk ) into this inequality:

Lη 2 2
f(Pk+1 ) ≤ f(Pk ) − η∥∇f(Pk )∥ +
2
2
∥∇f(Pk )∥ .

This simplifies to:

Lη 2
f(Pk+1 ) ≤ f(Pk ) − (η − )
2
∥∇f(Pk )∥2 .
2
For convergence, we require η to satisfy η < L . In this case, the
Lη 2
term η − 2 is positive, and thus f(Pk+1 ) ≤ f(Pk ), implying that
the function values decrease with each iteration of gradient descent.

Since f(P ) is bounded below, the sequence f(Pk ) converges, say to


mf .

The equation

Lη 2
f(Pk+1 ) ≤ f(Pk ) − (η − )
2
∥∇f(Pk )∥2 .

can be rewrite as

Lη 2
0 < (η − ) ∥Pk+1 − Pk ∥2 ≤ f(Pk )
2
− f(Pk+1 )

Sum up above expression from k = 0 to N , we obtain

Lη 2 N
0 < (η − ) ∑ ∥Pk+1 − Pk ∥2
2 k=1

≤ f(P0 ) − f(PN ) < f(P0 ) − mf


2
This means the series ∑k=1 ∥Pk+1 − Pk ∥2 is convergent, and
N

therefore {Pk }∞
k=1
must converges to a point, called this point Pc .
And this we leave it for keen readers to entertain themselves to prove
uniqueness.

3.1 Stochastic Gradient Descent


Consider the problem of training a neural network NN(x) to learn a
function F : R → R. We don't know F but we have N samples of it,
i.e., the data of F . Let D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} be
the data we collect about F .

Let's say we fix a neural network architecture, in other word, we fix


the number of layers and the number of neuron the neural network
has. This specification dictates the size of parameter P . Let's say
P ∈ RK . Then the empirical loss function
2
∑N
i=1
|F (x i ) − NN P (x i )|
lossD (P ) =
N
2
∑N
i=1
|y i − NN P (x i )|
=
N
This is a function from R → R. Training the neural network means
K

using the gradient decent, GD ,to update the parameter P . Now if N


is large, the loss function for the whole data set can be
computationally costly and the network can overfit the data.
Stochastic gradient decent is proposed to solve these issues. Instead
of computing the loss function for all N data points, we instead
compute it on subset of size n where n is much smaller than N .

Now fix n < N , the stochastic gradient decent algorithm is as follow:


~
1. Select random n data points from D, this make D ⊂ D.
~
2. Form loss function with respect to D, lossD
~ (P ).

3. Use gradient decent to update P .


~
~
4. Select new n data points from D randomly, call it D again.
5. Form lossD
~ (P ).

6. Update P using gradient decent.


7. Repeat the above step.

The number n used to compute the stochastic loss function is call


batch size. Let's suppose we have N = 1000 data points, if we
make batch size n = 20, then goes through all data points we must
run the SGD 1000/20 = 50 times. When we run through all the data,
people say it is an epoch.

Remark: An interesting question is how do we know that evolution of


Pk via SGD algorithm converge to a stable position, let alone a global
minimum.

3.2 Training Neural Networks with PyTorch


Let's say we want to find a neural network to approximate cos(x) on
I = [−20, 20]. First, we create a data of cosine value on I , and then
train a fixed size two layers neural network to approximate this data.

In [66… #Import libraries


import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy

#Generate data of cosine


x=torch.linspace(-10,10,200)
y=torch.cos(x) # pytorch cosine function apply pointwise on a
plt.figure(figsize=(10,3))
plt.plot(x,y)

Out[66]: [<matplotlib.lines.Line2D at 0x15c94aa50>]


Next, we create a neural network with 3 layers. The input starts with 1
dimension, passes through a 40-dimensional space, then a 15-
dimensional space, and finally maps to a 1-dimensional output (a real
number).

In [69… # Define a model


class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork,self).__init__()
self.A1=nn.Linear(1,40)
self.A2=nn.Linear(40,15)
self.A3=nn.Linear(15,1)
def forward(self,x):
y=torch.tanh(self.A1(x))
y=torch.tanh(self.A2(y))
y=self.A3(y)
return y
net=NeuralNetwork()

In [71… # ploting the network before training


z=torch.zeros(len(x))
with torch.no_grad():
for i in range(len(x)):
z[i]=net(x[i:i+1])
plt.figure(figsize=(10,4))
plt.plot(x,z)

Out[71]: [<matplotlib.lines.Line2D at 0x15cbac500>]


In [73… #Define loss function
def loss_fn(a,b):
return (a-b)**2
#choose learning rate
lr=0.01

In [75… # Start gradient decent step


for _ in range(40000):
idx = torch.randint(0, len(x), (1,)).item() # select a random index bet
input=x[idx:idx+1] # get the point x correspon
output=y[idx] # get the value of cosine
predicted_out=net(input) # value computed by the neu
loss=loss_fn(predicted_out,output) # compute the error
net.zero_grad() # make the gradient associa
loss.backward() # compute the gradient for
with torch.no_grad(): # stop tracking gradient
for param in net.parameters():
param.data -= lr * param.grad

# Now evaluat how the model changed after training


with torch.no_grad():
predicted_by_net=net(x.view(-1,1))
plt.figure(figsize=(12,3))
plt.plot(x,y,label='Cosine')
plt.plot(x,predicted_by_net,label='NeuralNet')
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Cosine vs NeuralNet')
plt.legend()

Out[75]: <matplotlib.legend.Legend at 0x15cbb62a0>


Note that our batch size is 1. That's why I wanted the network to do
gradient decent 40000 steps.

One might think that the network "learn" the cosine function. I think
this is not the case, I think what the network does is it imitate the data
it see and that's all to it. Let's plot the neural network prediction in a
domain larger than what it've seen.

In [78… x_l=torch.linspace(-20,20,1000) # create a larger domain


y_cos=torch.cos(x_l)
with torch.no_grad():
y_net=net(x_l.view(-1,1))
plt.figure(figsize=(13,3))
plt.plot(x_l,y_cos,label='Cosine')
plt.plot(x_l,y_net,label='NeuralNet')
plt.xlabel('Input')
plt.ylabel('Output')
plt.legend()

Out[78]: <matplotlib.legend.Legend at 0x15c991f10>

An esay argument to see that a neural network defined above can


never approximate the cos(x) on R regardless how big it is is that a
neural network has only finitely many critical points while the cosine
function has infinitly many critical points.
Now, let's me explain about the zero_grad issue. In PyTorch, when
we create an tensor W, it could come as a pair (W,grad). And this is
always true for parameter in neural network objects. For example, in
the neural network we defined above, there are serveral parameters.
If we do net.A1 , this will give us a linear map nn.Linear(1,40) ,
which is an obect on its own. Now when we do net.A1.weight ,
we get the weight of A1 , and associate to this weight is grad. This
mean we have a pair (W,grad). The code loss.backward() ,
compute the the derivative of the loss function, and it updates
gradient of any parameters that involve in computing the loss
function, therefore W will get its gradient updated. In the begining,
before the gradient is updated at all, grad=0, and
loss.backward() update the gradient by accumulation, i.e.,
0+grad. Note that this doesn't change the weight W itself. To change
W, we must do gradient decent. After the gradient decent, the pair
(W,grad) become (W_new,grad). Next if we do loss.backward()
again, we would get a part (W_new,grad+grad(W_new)), so to avoid
this problem, we need to make the pair (W_new,grad) to (W_new,0)
first before we we do loss.backward() , and we can do this by
net.zero_grad() , which set gradients of all parameter to be
zero.

4. Loss Functions for Classification Task:


Cross-Entropy
N
Setup: We have a set D = ⋃i=1 Di as disjoint union of subsets Di .
The function F : D → RN defined by F (x) = ei if x ∈ Di , where
e1 , e2 , . . . , eN are the standard unit vectors, allow us to classify point
in D based on which subset Di they are in. For each x, the vector
value F (x) can be thought as a probability distribution on
{D1 , D2 , . . . , DN }. To use neural network to do clasification task
means to find a neural network NN : D → RN that approximate F
on D. The output NN(x) ∈ RN should be thought of as a
probability distribution.

Now need a way to compare two probability distributions. This will


allow us to compare how well a neural network does a classification
job.

Definition

Let's P , Q : X → [0, 1] be a probability distribution over X. The


cross entropy H(P , Q) is defined as

H(P , Q) = − ∑ P (x) log(Q(x))


x∈X

It would be cool if the cross entropy is a metric. However, it is not,


one obvious thing is that it is not symmetric, ie
H(P , Q) ≠ H(P , Q). What make the cross entropy a good
candidate as a measure how close Q is to P is the following
ineqaulity H(P , Q) ≥ H(P , P ) This means that the smaller we can
make the cross entropy, the better we make Q close to P .

Proof Note that

H(P , Q) = − ∑ P (x) log Q(x)


x∈X

can be rewritten as

H(P , Q) = −

∑ (P (x) log
Q(x)
+ P (x) log P (x))
x∈X
P (x)

which simplifies to
H(P , Q) = H(P , P ) − ∑ P (x) log
x∈X

Q(x)
P (x)

Now we need to prove that


P (x )
DKL (P ∥Q) = − ∑x∈X P (x) log Q(x)
≥0
. Note that DKL (P ∥Q) is called the Kullback-Leibler (KL)
divergence. It is 0 if and only if P = Q.

Now, log : R → R is a concave function, therefore it satisfies the


Jensen's inequality:

∑ αi log(xi ) ≤ log(∑ αi xi )
n n

i=1 i=1

n
whenever xi ∈ [0, ∞] and αi ∈ [0, 1] satisfies ∑i=1 αi = 1.

Q(xi )
Applying this to the case where αi = P (xi ) and xi = P (x ) , we get:
i

)
Q(x)
−DKL (P ∥Q) ≤ log(∑ P (x)
x∈X
P (x)
= log(1) = 0

Thus, DKL (P ∥Q) ≥ 0.

Back to using neural networks NN to approximate probability


distributions. We know that the output of the neural network is a
vector NN(x) = (s1 , s2 , . . . , sN ) in RN . This is not any probability
distribution. To deal with this problem, we pass the result of the
neural network into softmax function defined as:

Definition The function SoftMax : Rn → I n is defined by


SoftMax(x1 , … , xn ) = (p1 , … , pn )

=( )
e x1 e xn
,…,
∑ni=1 exi ∑ni=1 exi

5. Recurrent Nueral Network


Imagine we have a collection of news articles
C = {A1 , A2 , . . . , AK } and we want to classify them into L={sport,
politics, health}. Each news article is a string of words. To process
these articles, we need a way to represent each words numerically so
that we can use mathematical techniques to do the classification
task.

Assume Ai = v1 . . . vL where vk are words used to write Ai and L is


the length of Ai . Let W = {w1 , w2 , . . . , wN } be the set of
vocabulary used to compose {A1 , . . . , AK }, ordered in some
manner. In the palance of natural language processing, C is refereed
to as a corpus and W is the dictionary.

The goal is to represent wi numerically, one naive way to do this is to


use number 1 to represent w1 , 2 to represent w2 , and so on. But the
natural number has ordered structure, while we can't quite see that
with words. So instead we should use vectors to represent words.
More precisely, we want to find an embedding E : W → R for
d

some d. One way to do this is to take d = |W| = N and send wi to


the standard unit vector ei ∈ RN , this is known as the one hot
encoding. However, there are several issue with this method. First,
the embedding dimension is large, for example there are around
600000 words in English. Second, RN is a metric space, we would
hope that that words wi and wj that are similar in meaning should be
encoded closer compare to those that are not, but every pair of
different words has distance √2 in the one-hot-encoding.
Continuous word embedding deal with the issues mentioned about
the one one encoding. There is d ≪ |W| and a map E : W → Rd
that allow one to encoder each word as a vector where word with
similar meaning stay close to one another. This is not a mathematical
statement but rather an observed fact. We recommend the paper A
Neural Probabilistic Language Model by Bengio et al for detail.

The classification problem mention in the beginning now can be seen


as classification of sequences of various length where each terms of
the sequences are in R . A more general set up is having finite
d

sequence x1 , x2 , . . . , xL in R , find a map f to the output space Y ,


d

which can be continuous or discreet. This is an example of time


series analysis. To deal with time series, we introduce recurrent
neural network.

5.1 Recurrent Neural Network (RNN)


The text A1 = x1 , x2 , . . . , xLA , where xi ∈ R . Any model that can
d
1

classify this text should look at the whole text to do so. Let's try
formulate this question of classifying text mathematically. Basically,
we want a function f that take input as finite sequence in Rd to R3 ,
such that if the sequence coming from sport article, healths, and
politics, f would map it to e1 ,e2 and e3 , respectively. We can not find
such a function, what we can do is to approximate it.

One way to we can do is to train a several layers neural network that


take the concatenation of xi , that is long vectors. However, here we
run to the same problem of large dimensional space. The recurrent
neural network was proposed to process sequence of any length and
at the same time require relatively small number of parameters.

A recurrent neural network (RNN) processing the sequence


x1 , x2 , . . . , xT is defined as follow

= ( , )
ht = f(xt , ht−1 )
{
yt = g(ht )

for all 1 ≤ t ≤ T , with ht ∈ R


H
for some H and h0 = 0. We should
think of ht as the memory of the system up to time t and yt as the
output. What happen is before anything happen, h0 = 0, the system
contain no information, and at time t = 1, the system look at the first
word x1 and produce a memory. Next, it look at the second word x2
and use previous memory h1 to produce new memory h2 . The
process repeat until the end of the sequence. In the classification
task we mention earlier, we would want to use yT to make prediction
of the class the news article belong to. hT obtained by the system
after it looks at all words that make up of A1 .

We think of yt as the output of the network. And ht are known as


hidden vector and H is called the hidden dimension.

To truely get the name neural network, f and g should be


composition of affine maps and non-linear activation functions. One
choice is the following

f(x, h) = σ(Wx x + Wh h + bh )
{
g(h) = Wy h + by

Now we present a version of the universal approximation theorem of


the RNN.

Theorem Let F1 (x1 ), F2 (x1 , x2 ), . . . , FT (x1 , . . , xT ) be


continuous functions with xi ∈ A ⊂ R and their output are in
d

Rm . If A is compact, then for any ε > 0, there exist appropriate


weight Wx , Wh , Wy , bh , by such that

sup sup |Ft (x1 , . . , xt ) − yt | ≤ ε


1≤t≤T xi ∈A

where yt is the output of the recurrent neural network with


weights Wx , Wh , Wy , bh and by .

We will not prove this theorem here.

Other Variant of RNN


During training an RNN, we need to approximate gradient of the
derivative of the model, which is a composition of functions. The
longer the input sequence, the deeper the composition, and this
leads to error in the approximation. Two phenomenons at least can
happen, one is the gradient become close to zero, leading to slow
training, and the other is the gradient is too large leading to unstable
learning. To address these issue, various varience of RNN has been
introduced. We will introduce two of them here and from now on we
will call all of them as RNN, and all that we care for is the hidden
state ht .

5.2 Long Short Term Memory: LSTM


An LSTM consists of several key components called gates or cells.
These components include the forget gate, input gate, cell state
update, and output gate.

Given the input at time step t, xt , the previous hidden state ht−1 , and
the previous cell state Ct−1 , the LSTM cell performs the following
operations:

1. Forget Gate:

ft = σ(Wf xt + Uf ht−1 + bf )

where σ is the sigmoid function.

2. Input Gate:

it = σ(Wi xt + Ui ht−1 + bi )
3. Cell State Update:
~
Ct = tanh(WC xt + UC ht−1 + bC )

The new cell state is computed as:


~
Ct = ft ⊙ Ct−1 + it ⊙ Ct

where ⊙ denotes element-wise multiplication.

4. Output Gate:

ot = σ(Wo xt + Uo ht−1 + bo )

5. Hidden State Update:

ht = ot ⊙ tanh(Ct )

In summary, the equations for the LSTM cell are:

ft = σ(Wf xt + Uf ht−1 + bf )
it = σ(Wi xt + Ui ht−1 + bi )
~
Ct = tanh(WC xt + UC ht−1 + bC )
~
Ct = ft ⊙ Ct−1 + it ⊙ Ct
ot = σ(Wo xt + Uo ht−1 + bo )
ht = ot ⊙ tanh(Ct )

Here, Wf , Wi , WC , Wo are the weight matrices for the respective


gates, Uf , Ui , UC , Uo are the weight matrices for the hidden state,
and bf , bi , bC , bo are the bias vectors.

5.3 Gated Recurrent Unit


A GRU consists of two main gates that manage the update of the
hidden state: the reset gate and the update gate.

Given the input at time step t, xt , and the previous hidden state ht−1 ,
the GRU cell performs the following operations:
1. Reset Gate:

rt = σ(Wr xt + Ur ht−1 + br )

where σ is the sigmoid function.

2. Update Gate:

zt = σ(Wz xt + Uz ht−1 + bz )

3. Candidate Hidden State:


~
ht = tanh(Wh xt + Uh (rt ⊙ ht−1 )
+ bh )

where ⊙ denotes element-wise multiplication.

4. Hidden State Update: The new hidden state is a linear


interpolation between the previous hidden state and the
candidate hidden state:
~
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ ht

In summary, the equations for the GRU cell are:

rt = σ(Wr xt + Ur ht−1 + br )
zt = σ(Wz xt + Uz ht−1 + bz )
~
ht = tanh(Wh xt + Uh (rt ⊙ ht−1 ) + bh )
~
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ ht

Here, Wr , Wz , Wh are the weight matrices for the respective


gates, Ur , Uz , Uh are the weight matrices for the hidden state, and
br , bz , bh are the bias vectors.

6. RNN Based Language Model


A language model is a function that takes sequence of words as input
and output a new word. This allows one to get another new word
using the word just genereated and the previous word and one can
keep produce as many word as one wants this way. This is called
auto-regressive model.

In this section, we would build a character level language model


based on RNN. What we mean is we would use one of the RNN model
to process a squence and use the hidden state to produce the
resulting prediction.

The strategy is as follow. First we one hot encode each character, ie if


the list of characers are {c1 , c2 , . . , cK }, then
one-hot(cj ) = ej ∈ RK . Next we will project this embedding into
another space R via a map E : R → R , this process is called
d K d

continuous embedding, and it can be done by multiplying ej by a


matrix E , which is a learnable parameters. This whole process allow
us to turn a sequnece of character to a sequence of vectors in R ,
d

and from here we can use an RNN to approximate

P (wn+1 |w1 , . . . , wn ).

This conditional probablity is a map P : (w1 , . . , wn ) → R .


K

An RNN would be able to take arbitarly lenght senquence and make


prediction. If an RNN can learn to predict the next word based on a
dependency of a long sequence, then it would be best. But we can't
take too long a sequence, since the longer the sequnce is, the more
difficult for us to approximate the gradient, hence no effective
learning.

The maximum lenght of sequence we choose to process is called


block size. Now let's say we have block_size=32, this mean we want
to use 32 character to predict the next characters. Let T be the text
corpus we used for training. This mean during training, we take a
chunk of 33 character piece of T , w1 . . . . . w32 w33 , use
w1 w2 . . . w32 as input and w33 as output. The RNN goes through the
embedding x1 , x2 , . . . , x32 of w1 w2 . . . w32 to produce
h1 , h2 , . . . , h32 , and we use h32 to predict w33 by map
g : RH → RK defined by g(h) = SoftMax(Wy h + by ).

Now if we want our model to take any sequence of length less than
32 to predict the next word, the data should be something like this x1
predict w2 , x1 x2 predict w3 and so on. And good news is we don't
need to do more work for this, to compute h32 , we already compute
h1 , h2 , . . ., so what should do is apply g on these hidden vector. Now
the input and output look like w1 w2 . . . . w32 and w2 . . . . . w33 ,
respectively, and the loss function is

1 32
∑ H(ewk , g(hk−1 ))
32 k=
2

Another thing to note is that we probably want to train a batch of data


at the same time, this mean we should stable text at several places at
time, and each place we would take 33 characters.

Next we want to talk about continuous embedding. Earlier, we


mention that we should do one hot encoding and then map it to a low
dimensional space. This is unnessary in PyTorch. Let's me explain
this. Say we have the text T=hello, then directionary for this text is
{e,h,l,o}, we can label e as 1, h as 2, l as 3 and o as 4, and the text is
translated to 21334. Note that the vocabulary size is 4. The object
nn.Embedding(4,3) take input of shape
(batch_size,sequence_lenght) with each entry as "interger" and
produce output of shape (batch_size,
sequence_length,3=embedding_dimension). So basically, as soon as
we encode a text as sequence of integers, we can feed it to
nn.Embedding(4,3) to get continous embedding. Note that 4
here is the size of the vocabylary.

In [87… # importing necessary package


import torch
import torch.nn as nn # this one is for neur
import torch.optim as optim # this one is for opti
import torch.nn.functional as F # this one is used for
import numpy

#hyper parameter
seq_length=30
batch_size=32
device='gpu' if torch.cuda.is_available() else 'cpu' # this tell the comput
n_embd=128 # this is for embeddi
hidden_size=128
num_layer=2 # one can stack RNN on
learning_rate=0.001
max_iters=10000 # this is for the numd
eval_interval=1000 # the frequency which
eval_iters=1000

The following text is a news article about Grothendieck by He was in


mystic delirium’: was this hermit mathematician a forgotten genius
whose ideas could transform AI – or a lonely madman? by The
Guardian.

In [90… text="One day in September 2014, in a hamlet in the French Pyrenean foothills, J

In [92… vocabs=sorted(set(text))
char_to_idx={char:idx for idx,char in enumerate(vocabs)} # translate from chara
idx_to_char={idx:char for idx,char in enumerate(vocabs)} # translate from inde
encode= lambda s: [char_to_idx[char] for char in s] # this is a function t
decode= lambda l: ''.join(idx_to_char[idx] for idx in l) # convert a sequnece o
data=torch.tensor(encode(text),dtype=torch.long) # used function encode
n=int(0.8*len(text)) # we make a cut to get
train_data=data[:n] # this is part of the
val_data=data[n:] # this is part of the
vocab_size=len(vocabs)

# We create a function that catch a batch of data for stochastic gradient decent
def get_batch(split):
data=train_data if split == 'train' else val_data # there are two type
idx=torch.randint(len(data)-seq_length, (batch_size,)) # this produces an ar
x=torch.stack([data[i:i+seq_length] for i in idx]) # get a batch of inpu
y=torch.stack([data[i+1:i+seq_length+1] for i in idx]) # get a batch of outp
x,y=x.to(device),y.to(device) # move x and y to dev
return x,y

In [94… @torch.no_grad()
def estimate_loss():
out = {}
model.eval() # Set model to evaluation mode
for split in ['train', 'val']: # there are 2 type of lo
losses = torch.zeros(eval_iters) # create tensor to hold
for k in range(eval_iters): # we compute loss for ev
X, Y = get_batch(split) # get batch of data
logits, loss = model(X, Y) # forward pass
losses[k] = loss.item() # store the loss for eac
out[split] = losses.mean().item() # compute the mean loss
model.train() # set model back to trai
return out

In [96… class RNNLanguageModel(nn.Module):


def __init__(self): # attribuate part of
super().__init__() # initialize the mode
self.embd_table = nn.Embedding(vocab_size, n_embd) # continous embedding
self.lstm = nn.LSTM(n_embd, hidden_size, num_layers=num_layer, batch_fir
self.proj = nn.Linear(hidden_size, vocab_size) # this is the functio

def forward(self, idx, targets=None): # during inference, w


B, T = idx.shape # get_batch produce i
tok_emb = self.embd_table(idx) # turn index to conti
output, _ = self.lstm(tok_emb) # the LSTM processes
logits = self.proj(output) # project h_t to prep

if targets is None: # Corrected variable name # if this is during i


loss = None
else: # if during training,
B, T, C = logits.shape
logits = logits.view(B * T, C) # there are B*T predi
targets = targets.view(B * T) # there are B*T groun
loss = F.cross_entropy(logits, targets)

return logits,loss

def generate(self, idx, generate_length, seq_length): # this are the in


hidden = None # ensure that fir
for _ in range(generate_length):
idx_cond = idx[:, -seq_length:] # use at most seq
tok_emb = self.embd_table(idx_cond)
output, hidden = self.lstm(tok_emb, hidden)
logits = self.proj(output[:, -1, :]) # select the last
probs = F.softmax(logits, dim=-1) # apply softmax o
idx_next = torch.multinomial(probs, num_samples=1) # select next ind
idx = torch.cat((idx, idx_next), dim=1) # add the predict
generated_text=decode(idx[0].tolist())
return generated_text
model=RNNLanguageModel() # this initialize
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters: {total_params}")
print(f"The text length is {len(text)}")
print(f"The ratio of parameters to text is {total_params/len(text):.4f}")

Total trainable parameters: 280640


The text length is 15069
The ratio of parameters to text is 18.6237

In [98… # create a PyTorch optimizer


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range(max_iters):
# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses[
# sample a batch of data
xb, yb = get_batch('train')
# evaluate the loss
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()

step 0: train loss 4.1654, val loss 4.1654


step 1000: train loss 1.4833, val loss 2.0562
step 2000: train loss 0.8643, val loss 2.4494
step 3000: train loss 0.5319, val loss 2.8772
step 4000: train loss 0.3966, val loss 3.2577
step 5000: train loss 0.3432, val loss 3.5259
step 6000: train loss 0.3147, val loss 3.6953
step 7000: train loss 0.2999, val loss 3.8593
step 8000: train loss 0.2877, val loss 3.9589
step 9000: train loss 0.2799, val loss 4.0387
step 9999: train loss 0.2718, val loss 4.1391

In [10… # The training loss become very small, the model literally memory the data
# generate from the model
context = torch.ones((1, 3), dtype=torch.long, device=device)
#print(decode(model.generate(context,500,seq_length)[0].tolist()))
print(model.generate(context,500,seq_length))

'''s. But inwered out; he politely refused to receive most of them. When he
did exchange words, he sometimes mention the dooble making it polstebook lo
gging the names of the topos could be key to building the next generation o
f AI, and has harm to move in his ligens and your future than you will ever
know. But these wild preoccupations took him to dark places he told one vis
itor that there were entities inside his house that might harm him. Grothen
diecks genius defied his attempts at erasing his ow

6.1 The n-gram Language Model


We've seen that the RNN model earlier attempts to approximate
P (wL+1 |w1 , . . . , wL ). Another way to get this probability
distribution is to actually count it, since this is a finite system. This is
known as the n-gram model. Allow me to explain the 3-gram model
and then you'll understand the n-gram model. Let's say we want to
predict the next word based on 3 words before it that imitates a text
T , ie T is the training data. First, we can ask what the possible
combination of 3 words are. If the text T is based on 1000 words,
then there are 10003 possible of 3 words sentences. Next for each of
the combination w1 w2 w3 , we count its occurence in the text, and
cound what are the word that come after it and in what frequency,
from here we can compute the exact probability.

Our goal here is not to study the n-gram model itself, but to use it to
help us understand better what the RNN and other lanaguage model
does.

Now what are some of the problem encounter by the n-gram model?
First, one main issue is that the n-gram model would give uniform
distribution for any sequence of words that doesn't appear in the text.
Consider creating an 2-gram model based on a text that the
sequence cat loves doesn't exist, when we ask the model to predict
what word come next after this sequnce, it will predict anyting can
happen with eqaul probablity. We don't want something like that. The
RNN model takes input as sequence of vectors, and it can process
any sequence, so it doesn't face the problem that n-gram model has.
One way one hope the RNN give good prediction is as follow: During
embedding, as the model look through the text, it learns that the
embedding of the word like and love are close, and it has seen the
sequence cat likes in the text, this allow us to give a non-uniform
distribution for the P (. |cat loves) based on what it seen for cat
likes. One question one can ask is does the RNN do more than a
modifcation of the n-gram model? By modified RNN here we mean
something like this: the model first regroup all possible 3 words
sentences based on their similar meaning and then does the 3-gram
model on that.

The hope here is that the embedding learned during the training of
the RNN is a way that allows us to embed each word in R , making
d

words with similar meaning to be nearby.

7. Transformer Model
Transformer was first introduced by Vaswani et al., in 2017 in the
widely cited paper "Attention Is All You Need", where the authors
tested it as a neural network-based machine translation model. One
of the key ingredients in the transformer model is self attention
mechanism. Andrej Karpathy thinks of attention mechanism as a
way for words in a sentence to communicate with each other. The
following presentation is how I like to think about the transformers.

7.1 Decoder only transformer


In the above RNN-based language model, the RNN takes a sequence
of words w1 w2 . . . wN and then turns it to sequence of vectors
x1 x2 . . . xN via continuous word embedding, and condenses it to get
the sequence of hidden vectors h1 , h2 , . . . hN , and we use hN to
predict the next word. We can think of hi as containing richer and
richer information about the sequence x1 . . . xN as i increases.

The way RNN condenses the sequence x1 x2 . . . xN is highly non-


linear and highly non-commutative, which makes it hard to
approximate its partial derivatives, leading to ineffective learning. The
transformer model is somewhat less non-commutative and look
linear, and in practice it turns out to be a lot easier to train.

Now say we have a sequence x1 , x2 , . . . xN , the vector

v = x1 + x2 +. . . +xN

could contain the information of a sequence x1 x2 . . xN , but probably


we shouldn't hope that this information is sufficent to make next word
prediction. One way to upgrade this and still keep things "linear" is to
do

v = a1 x1 + a2 x2 +. . . . +aN xN

where a1 , a2 , . . . aN are some real numbers. The appropriate choice


of a1 , . . . , aN would give us a represention of the sentence x1 . . . xN
. The transformer model suggests somewhat different from this, it
suggests we should upgrade the embedding of each word xi in the
sentence to incooperate the context it is in. This context enrichment
process takes embedding vector xi ∈ Rd to context-aware vector
yi = ai1 x1 + ai2 x2 +. . . . +aiN xN of the same dimension. The
coefficient aij should depend on both xi and xj . The idea is that aij
tells us how much xj attends to xi , therefore the name attention
mechanism. Furthermore, instead of combining the vectors xj
themselves, the model uses another linear transformed of xj call
value vectors, denoted by vj . Here is the summery of how the model
work roughly.

1. Convert xj to qj ,kj and vj , called query, key and value of xj ,


respectively.
2. Compute the dot product eij = (qi , kj ), called scores.
j
3. Turn the score vector E i = (ei1 , . . . , eN ) into convex hall
coefficents vector via softmax, i.e., Ai = SoftMax(E i ), where
Ai is called attention.
4. Compute the convex combination yi = ai1 v1 +. . . +aiN vN ,
where Ai = (a1 , . . . , an ).

Now we transform x1 , x2 , . . . xN to y1 , y2 , . . . , yN . We want to think


of yi as an enriched version of xi , based on the context sequence
x1 . . . xN .

Let me explain what I think could be the motivation, people in ML


community has several different way of thinking about this and here I
present what I think what happen.

First, how do we convert xi to key, query and value vectors? This is


done by multiplying xi by matrices WK , WQ , and WV , respectively,
and these matrices are learned during training. This allows us to give
different represent of xi for different usages. Computing the dot
product to get eij allow us to measure a sense of interaction/relation
between xi and xj and this computation isn't the direct dot product
of these two vectors but rather their other representations.
Onbiously, the computation of yi are highly non-linear since it
involves using the SoftMax function. However, because we only use
that as a linear coefficents, this makes y like a "linear combination" of
x. Next, we note that yi is independent of the order of xj , this is
rather different from the case with RNN.

The process we've seen so far is called self attention (actually


almost). In this note, I refer to it as attention. Next we write down
equation for attention mechanism that is more efficent, and add
correction term. First we make a matrix X = [x1 , x2 . . . , xN ], and
the equation for key, query, value, score and attention is as follow

1. K = WK X, Q = WQ X , V = WV X
KT Q
2. Score matrix E = ( ).
√d
KT Q
3. Attention = SoftMax ( )
√d
4. Attn(X) =
KT Q
SoftMax ( ) V = Attention
√d
×V
The division by √d, where d is the dimension of the embedding, is to
increase the variant of the attention, avoiding the issue of just
averaging vi , or just selecting one vector. As we discuss earlier, we
tend to think of the computation of the attention is a linear process,
so to introduce some non-linearity, we pass the attended vectors
through a feed-forward neural network.

FFN(Y ) = W2 ReLu(W1 Y + b1 ) + b2

where W1 ∈ L(Rd , Rh ) and W1 ∈ L(Rh , Rd ) and b1 , b2 are


vectors of appropriate dimension, and this feed forward network
process Y each word seperately. So in this stage, there is no
interaction between words.
The composition of attention and feedword is called attention block.
In general, we would compose several attention block, so we index
output of each block as X i , with X 0 be the input.

X i+1 = FFN(Attn(X i ))
{ 0
X =X

To make next word prediction, we can use the last vector in the last
layer X L , let's call it xL
N
. We can do this by using linear classifier, ie
use a linear projection to get logits of the prediction, logits = W xL
N
,
and P (. |x1 . . . xN ) = SoftMax(logits) is the probability, where
here W ∈ L(R , R ), where K is the vocabulary size.
d K

The model desribe above is called decoder only transformer. We


recommend the "Attention Is All You Need" paper for those who want
to understand encoder-decoder transformer

Multi-Heads Attention
People like to think that continous word vectors of a word contains
most information of that word. Now a word has several
informations\aspects. The tranfromers help us enrich information of
the word-vector xi ∈ Rd in the context of sentence x1 . . . . xN . It
might be too much to ask this enrichment to do well when xi contains
so much information. So maybe we should project xj down to smaller
subspace Rh , which we should hope this selects certain aspect of
linguistic information, and apply attention in that subspace to enrich
the context information of that aspect. This is the justification of
attention head. Also we should have many of these attention heads
working in parallel and by the end we should assemble it back to get
full information.

Mathematically, we are seeking for Wki ∈ L(Rd , Rdh ) for 1 ≤ k ≤ h


, ie we make h attention heads. Apply attention to these subspace to
get Hki = Attn(Wki (X)). As a result we produce is H1i , . . . , Hhi .
Next we concatenate these attention heads vertically to get H,

⎡ ⎤
H1i
⎢ H2 ⎥
H=⎢ ⎥
i

⎢ ⎥
⎢ ⋮ ⎥
⎢ ⎥
⎣ Hi ⎦
h

and to bring the result vectors back to the embedding dimension, we


can multiply by WO ∈ L(R h × , R ). The result is WO H , known as
d h d

multi-head attention. Since multi-head attention is a common


practice, we will refer to it as attention and denote it by Attn(X).
Multi-head attention can be seen as allowing different heads to talk
to one another, since WO H is equivalent to

h
∑ WOk Attn(Wki X i ).
k=1

To aid training, the dynamical equation of the attention block is


modified further


⎪X
i+1
= LayNorm(FFN(LayNorm(Attn(X i ))


⎪ 0
+ X i )) + X i
X =X+E

Where E is a fixed matrix, could be learnable, to make the model


non-communtative, non-permutation equivariant. The LayNorm here
refer to normalizing each elements of the matric to have mean 0 and
variant γ, which is a learnable parameter. The addition of X i into
attention is called residual connection, people found sucess in doing
this in image processing and then they have been keep doing it in
other part of ML since. Some people don't think of the residual
connection as adding input back, but instead they think of the
process of attention as information enrichment, the input X is flowing
in a stream, and as it passes an attention block, the attention block
process part of X, find something useful and add it back to the
stream X.

There are another model of transformer that has both encoder and
decoder part. However we will not talk about it here and we refer the
reader to read the original paper, Attention is All we Need.

7.2 Training a Transformer: Masked Attention


When we train the RNN, we used our input is of the form x1 x2 . . . xN
and output is x2 x3 . . . . xN+1 . The RNN process x1 x2 . . . . xN and
produce context representation h1 h2 . . . hN , then we use linear
classifer using hi as input to predict xi+1 .

Let's say we have a transformer of L layers, processing input


sequence X = X 0 = [x01 . . . x0N ] and produce output
X L = [xL0 . . . xLN ], which is an enriched version of x0i based on the
context sentence [x01 . . . x0N ]. We want use xL
i
playing the role of hi
to predict xi+1 as we did in the case of RNN. However, in the making
of xL
i
, we use the whole sentence X, so it is a cheat if we use xL
i
to
predict xi+1 when i < L. To avoid this situation, masked attention
is introduced. Recalled that, for a fixed i, the attented
yi = α1 v1 + α2 v2 +. . . +αN vN . To prevent yi to not depend on xj
with j > i, we can just require αj = 0 if j > i. This can be achieved
by adding a triangular matrix T to the score E , where Tij = −∞ if
j > i and Tij = 0 if j ≤ i. Annother way to do this is to turn
elements above the diagonal of attention score E to −∞, this is the
method used in the code given below.

When we train the model, we use xL


i
to predict xi+1 , this forces xL
i
to be the representation of sequence x1 . . . xi rather than just a
contextually enriched version of the embedding xi .

The following code is copied from Andrej Karpathy's GPT2 code.


You can visit his GitHub repository here. The strategy of the code is
as follows:

1. Create attention head object


2. Assemble attention heads to create Multihead Attention object
3. Create Feed Forward Layer object
4. Create an Attention Block object by Composing Multihead
Attention with Feed Foreword Layer
5. Create a GPT2 object by composing multiple attention blocks
and followed by a linear classifier

Note that in the GPT2 object, there are three part to it, the first two is
expected, which is the part that contains various components of the
transformers, and the forward part which assemble various
component. The last part is the generator.

In [10… import torch


import torch.nn as nn
from torch.nn import functional as F
# hyperparameters
batch_size = 32 # number of independent sequences proce
block_size = 30 # what is the maximum context length fo
max_iters = 10000
eval_interval = 1000
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 1000
n_embd = 124
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

The data preparation, loss function part are the same as we have for
RNN above. So we won't do it again here. Also we use the same text
used to train the RNN.

In [11… class Head(nn.Module):


""" one head of self-attention """

def __init__(self, head_size):


super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_siz
self.dropout = nn.Dropout(dropout)

def forward(self, x):


B,T,C = x.shape
k = self.key(x) # (B,T,C)
q = self.query(x) # (B,T,C)
# compute attention scores
wei = q @ k.transpose(-2,-1) * C**-0.5 #
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # Convert t
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
v = self.value(x) # (B,T,C)
out = wei @ v # Multiply weight with value
return out

class MultiHeadAttention(nn.Module):

def __init__(self, num_heads, head_size):


super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
self.dropout = nn.Dropout(dropout)

def forward(self, x):


out = torch.cat([h(x) for h in self.heads], dim=-1) # compute attention
out = self.dropout(self.proj(out))
return out

class FeedFoward(nn.Module):

def __init__(self, n_embd):


super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)

def forward(self, x):


return self.net(x)

#Now assemble various component to get attention block


class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedFoward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)

def forward(self, x):


x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x

# Create GPT2 Language Model


class GPT2(nn.Module):

def __init__(self):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in rang
self.ln_f = nn.LayerNorm(n_embd) # final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size)

def forward(self, idx, targets=None):


B, T = idx.shape

# idx and targets are both (B,T) tensor of integers


tok_emb = self.token_embedding_table(idx) # (B,T,C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
x = tok_emb + pos_emb
x = self.blocks(x)
x = self.ln_f(x)
logits = self.lm_head(x)

if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)

return logits, loss

def generate(self, idx, max_new_tokens):


# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# get the predictions
logits, loss = self(idx_cond)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
generated_text=decode(idx[0].tolist())
return generated_text

model = GPT2()
m = model.to(device)
# print the number of parameters in the model
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters: {total_params}")
print(f"The text length is {len(text)}")
print(f"The ratio of parameters to text is {total_params/len(text):.4f}")

Total trainable parameters: 762912


The text length is 15069
The ratio of parameters to text is 50.6279

In [11… # create a PyTorch optimizer


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses[

# sample a batch of data


xb, yb = get_batch('train')

# evaluate the loss


logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()

# generate from the model


context = torch.zeros((1, 1), dtype=torch.long, device=device)
print((m.generate(context, max_new_tokens=500)))

step 0: train loss 4.3377, val loss 4.3254


step 1000: train loss 0.5441, val loss 3.1649
step 2000: train loss 0.3394, val loss 3.9940
step 3000: train loss 0.3038, val loss 4.2778
step 4000: train loss 0.2909, val loss 4.5257
step 5000: train loss 0.2833, val loss 4.5417
step 6000: train loss 0.2776, val loss 4.7480
step 7000: train loss 0.2698, val loss 4.7692
step 8000: train loss 0.2625, val loss 4.8587
step 9000: train loss 0.2640, val loss 5.0643
step 9999: train loss 0.2592, val loss 4.9776
the broader conceptual structure around them to make them surrender their
solutions all at once. He compared the two approaches to using a hammer to
crack a walnut versus soaking it patiently in water until it opens naturall
y. He was above the road beyond are the snow-covered Pyrenees a promise of
a higher reality. Matthieu answers the door weare for war in a purensugglin
g to framed scroll of Chinese script stands on a sideboard, next trees le v
illage of Mormoiron. In the subsequent years, Groth

7.3 Making a ChatBot: Fine Tuning


The training we shown above is known as pre-training. What it mean
is that we adjust the weight of the transformer to imitate our
language. At this stage, the transformer is not yet a chatbot, i.e., it is
not designed to give answers to the questions we ask. Instead, it is
learning general language patterns. To turn the transformer into a
chatbot, another round of training is needed, it is called fine-tuning.

The idea of fine-tuning is this. Let’s say we have a pre-trained


transformer T that was trained on a large text corpus and is able to
produce coherent sentences. However, while it knows how to
generate meaningful language, it doesn’t yet know how to respond to
specific types of questions or follow conversational patterns. Now we
want it to answer questions in a certain way. To achieve this, we
collect a set of question-answer pairs

F = {(q1 , a1 ), … , (qN , aN )}.

The answers ai might reflect a specific tone, style, or format we


prefer (e.g., polite responses or concise technical explanations). The
fine-tuning process involves training the transformer on this dataset
F , adjusting its weights to mimic the behavior reflected in the
answers ai .

In addition to mimicking the answers, fine-tuning helps the


transformer learn the structure of the interaction between questions
and answers, refining its ability to respond in a coherent and context-
aware manner. The hope is that it can generalize beyond the training
examples qi to answer new questions q that it hasn’t seen before.
Fine-tuning can also involve techniques like reinforcement learning
with feedback (e.g., using human preferences to further align its
behavior), which helps improve its performance in more complex
scenarios.

This process ensures the transformer goes from generating general


language to behaving like a chatbot, giving responses that align with
the desired conversational behavior.
7.4 Universal Approximation Theorems for
Transformers
Obviously someone would want to prove that transformer can
approximate any function and in fact such a theorem exist. Basically
they prove that a certain class of transformer can approximate any
sequence of continuous functions. There are also people who prove
that there are certain classes of transformer that can't approximate
sequences of continuous functions. Such a theorem is interesting in
its own and we refer the interested reader to look for them. We
recommend two papers: 1) Are Transformers universal
approximators of sequence-to-sequence functions? by C Yun et
al., and 2) Your Transformer May Not be as Powerful as You
Expect by Luo et al.

8. My thought about language model and


some open questions
There are lots of discussion about whether or not a language model
understand the language it's processing. A quick objection to this is
that we need to clarify what we mean by understand this lead to a
philosophical rabit hole. I am just ignore that and I would just going to
state my view, which is there isn't anything deep about current
language models, and I think the success of LLM is due to the fact
that our languge is not as complicated as we've thought!. I believe
there are some fundamental quantities to be discovered about
language. I'll explain all these as follow.

8.1 The question of understanding


Imagine we have a set P = {p1 , p2 , … , pN } of distinct points in the
k
plane. We cluster them into P = ⋃i=1 Pi , with k < N , this might be
considered as different type of part of speech. Equip the set
{P1 , P2 , … , Pk } with partial order R. Now, we define a finite
sentence ps1 … psl of point in P such that consecutive points belong
to set Pi that respects the ordered relation R. This puts some
structure on sentences one can produce.

Let's entertain that we create so many of these sentences and glue


them to form a large text, denoted by T . Now we want to train a
transformer to create sentences the manner we define earlier.
Obviously, one probably guess that if N and k is very large, and the
structure of R is complicated, then it might be hard for the
transformer to produce correct sentences. However, if N is not too
large, and R doesn't have too many sub-relation, then sure enough
the transformer would remember T .

If a transformer can remember T , would we say that it understand the


language we describe above? This question doesn't make sense since
that language is not anything to be understood to start with, it is just
tracing through a collection of points in plane. So essentially, what the
transformer does is it remember the structure its see and it does this
by mimicking what it see during gradient decent. So a better name
should be machine imitation.

8.2 The question of complexity: Why it work?


As speculated in the previous section, if R is not too complex and N is
not too large, then it is reasonable to expect that we can train a
transformer to remember the language generated via R. The question
is how to make all this precise and how to quantify this complexity. It
would be interesting if such a quantity exists—one that can serve as a
necessary and sufficient condition to prove that a language is
parameterizable by a transformer. (You can read more about this topic
in an abstract nonsense note I made called Complexity of
Parametrisable Language ) . If such a measure is found, it might open
up possibilities for exploring other architectures beyond transformers.
Also I am not saying that our language is a form of ordered relation.
The ordered relation R is used as illustration.
The human's mind is rather simple; that is what I've observed about
my mind, the only mind I can observe. My thought is rather shallow
and small in scope. What spoken by such a mind must be repeatitive
and therefore a sense of structure, which is grammar, emerges.

My belief is that during training, a language model does two things: (1)
searches for good representations of words that reduce the
complexity of the language, and (2) approximates the relationships
between words, similar to an n-gram model. These two steps happen
in parallel, as this is how gradient descent works.

Recall that the attention equation is yi = ∑ αj vj . During training, the


model adjusts both the embedding of xj and the coefficient αj . I
believe that this adjustment reorganizes and regroups words in a way
that makes it easier for the model to cluster sentences, thereby
enabling it to perform a form of n-gram modeling. However, the model
can only achieve this if the language we provide is simple enough.

Sometimes we see a language model generate good reasoning. People


have been talking about emergent phenomena in language models—
saying that when the model is large enough, it suddenly can do certain
things. My belief is that larger models can approximate the n-gram
model better. As the probabilities become closer and closer to the true
probability distribution, some forms of reasoning seen in the training
text emerge, along with "facts" from the training. The reason I use
quotation marks is that these are solely based on the training text.
Imagine a prolific author writing fiction equivalent in size to the entire
text corpus used to train large language models, and we train the
model on this story. The model would then give us answers solely
based on the facts within that book, while remaining completely
oblivious to facts from the real world.

8.3 The Question of Reasoning: What else can it do?


Can we design a language model that can perform mathematical
reasoning? I think the answer depends on how complicated the
reasoning is. If we can define reasoning as a form of language, and
assume we can also define language complexity, this would allow us to
determine whether reasoning can be modeled.

As mentioned earlier, I believe the fact that language models work well
suggests that our language is not as complex as we once thought.
Now, let's consider the question of persuasion. Suppose we want to
design a machine that can persuade people on social media. How
would we do this? Perhaps a good way to measure how much
someone likes an argument is by tracking how long they engage with it
and, ultimately, whether they react positively or just move past it. We
could try to fine-tune a language model to generate sentences based
on two inputs: (1) the idea we want to persuade people about, and (2)
the person we want the model to convince. At this point, instead of
minimizing loss, we would maximize a reward function, which should
combine the time spent engaging with the argument and the action
taken afterward. Would this approach succeed? I think the answer
depends on how complex the human mind is in terms of belief
formation.

The example above is gloomy, so let me offer a more positive


application of language models. Imagine we want to build an AI teacher
that aligns with student sentiment and encourages students to
continue learning. We could apply a similar framework to the
persuasion machine described earlier. In developed countries, there is
a lot of skepticism around such AI teachers, but in developing
countries, where many students lack access to education, such AI
teachers would certainly be better than nothing.

I first got interested in LLMs when I tried to teach my little sister


English. Because I don’t have much time to prepare lesson plans for
her, I ask ChatGPT to turn a particular topic into a lesson, and with
some modifications, we go through the lesson. It is pretty fun to study
things with my sister, but there are a lot of things I think she should
know, and my time is very limited. The issue my sister and I encounter
is not unique to us. The lack of good teachers is a major issue. Many
people pay lots of money sending their kids to private classes and still
don’t get good results. The same issue happens with medical
consultation/advice, agriculture, law, and much more. I personally do
not believe we can build AI that can help us cure new diseases, but I
believe we can make AI help us well in low-risk tasks, and many of
these tasks are important in developing countries—maybe in
developed countries too.

In [ ]:

You might also like