0% found this document useful (0 votes)
11 views8 pages

Mathematics of LLMs Part 1

This document provides a mathematical overview of large transformer language models (LLMs), focusing on their architecture, training parameters, and the processes of tokenization, embedding, and feedforward layers. It emphasizes the need for a clear mathematical framework to describe LLMs and outlines the steps involved in constructing a vocabulary from a corpus of text. Additionally, it discusses the structure of feedforward neural networks, specifically multilayer perceptrons, and their role in the functioning of LLMs.

Uploaded by

Private
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

Mathematics of LLMs Part 1

This document provides a mathematical overview of large transformer language models (LLMs), focusing on their architecture, training parameters, and the processes of tokenization, embedding, and feedforward layers. It emphasizes the need for a clear mathematical framework to describe LLMs and outlines the steps involved in constructing a vocabulary from a corpus of text. Additionally, it discusses the structure of feedforward neural networks, specifically multilayer perceptrons, and their role in the functioning of LLMs.

Uploaded by

Private
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

NOTES ON THE MATHEMATICS OF LARGE TRANSFORMER

LANGUAGE MODEL ARCHITECTURE

SPENCER BECKER-KAHN

Introduction
From a mathematical point of view, the building and training of a large transformer
language model (LLM) is the construction of a certain function, from some euclidean space
to another, that has certain interesting properties. And it may therefore be surprising
to find that many key papers announcing significant new LLMs seem reluctant to simply
spell out the details of the function that they have constructed in plain mathematical
language or indeed even in complete pseudo-code. The latter form of this complaint is
the subject of the recent article of Phuong and Hutter [1]. Here, we focus on one aspect
of the former perspective and seek to give a relatively ‘pure’ mathematical description of
the architecture of an LLM. To do so adequately, on the one hand we seek a sufficiently
high level of accuracy, i.e. The mathematical framework that we build must faithfully
represent or be sufficiently analogous to the true nature of real-world models. On the
other hand, we seek mathematical elegance. And not simply for its own sake: One hope
with something like this is that a good mathematical framework can form a foundation
for doing certain kinds of conceptual reasoning effectively. And the elegance we seek when
building the framework - economy of expression, flexibility, getting just the right level of
generality etc. - is genuinely important for the utility of the framework.
Trainable Parameters. Like all such models in machine learning, the construction
initially describes a family of functions indexed by some set Θ = RN called the parameter
space. There is then a separate process - the training of the model - in which a particular
value θ ∈ Θ is selected using a training algorithm. Each dimension of Θ corresponds to
the possible values of an individual trainable parameter. We will draw attention to such
parameters as we introduce them, as opposed to attempting to give a definition of Θ up
front. A complete description of the training algorithm will possibly be the subject of a
follow-up.

1. Tokenization, Encodings and Embeddings


The transformer T will take as input a matrix t ∈ Rn×nvocab , each row of which is
a standard basis vector of Rnvocab (and is therefore all zeroes except for one ‘1’). The
so-called ‘decoder’ part of the transformer - the ‘deep learning’ bit - acts on matrices
X ∈ Rn×d for some d < nvocab . In this section we will explain how t corresponds to a
slice of text and how the mapping t 󰀁→ X works.
1.1. Tokenization. Let C denote a corpus of real text: A large collection of different
artefacts (articles, books, blog posts, text conversations etc.) from many different places
on the internet. We will use this to define a vocabulary for the transformer. We will
1
2 SPENCER BECKER-KAHN

think of the corpus as a single string α1 α2 α3 · · · = (αi )i∈I that uses a special character
to separate the individual artefacts in the corpus, i.e. the special character marks where
one thing ends and the next bit of text begins. Every other character is drawn from an
alphabet A.
We will produce the vocabulary via an iterative process, starting with the set V0
consisting of all 256 bytes, i.e. all 28 strings of length 8 consisting of only zeroes or
ones, together with special characters to denote the end of words and end of individual
artefacts. The UTF-8 encoding EUTF−8 = E is a standard way to map elements of A to
short strings of bytes. It can be thought of as an injective map E : A → V0 × V0 × V0 × V0 .
By replacing every character α in the corpus C by its image E(α) under the map E, we
produce a new corpus C0 (or rather a ‘translation’ of the original corpus)
󰀃 󰀄that is written
entirely using symbols from the vocabulary V0 , i.e. we set C0 := E(αi ) i∈I . Next, to
produce the set V1 from V0 , we take the pair of symbols (b, b′ ) which occurs the greatest
number of times consecutively in C0 and add the single new symbol bb′ to the vocabulary
V0 , i.e. we set V1 = V0 ∪ {bb′ }. And we replace every occurrence of (b, b′ ) in the corpus C0
by the new symbol s, resulting in the corpus C1 . Then of course we continue iteratively,
producing Vi+1 and Ci+1 from Vi and Ci in the same way. Notice that the size of the
vocabulary is increased by one at each step, i.e. |Vi+1 | = |Vi | + 1. We fix a parameter
nvocab and terminate the process at V := Vm with |Vm | = nvocab . We refer to V as the
vocabulary. An element of the vocabulary is called a token. The result is that:
1dagger 1†) With the vocabulary V having been fixed, we now have a canonical way of taking
any string S of real text and mapping it to a (finite) sequence of elements from
the fixed vocabulary V.
SS:Embedding
1.2. Embedding and Unembedding. Then we fix what is called a one-hot encoding
of the vocabulary, which is simply a bijection between the set V and the standard or-
thonormal basis of Rnvocab . Denote this by σ : V → {ej }nj=1
vocab
. We will sometimes also
refer to σ(V) ⊂ R nvocab
as the set of one-hot tokens.
2dagger 2†) With the vocabulary and the one-hot encoding having been fixed, we now have
a canonical way of taking any string S of real text and mapping it to a (finite)
sequence (t1 (S), t2 (S), . . . , tN (S)) of one-hot vectors in Rnvocab , where N = N (S).
We will refer to this as tokenization.
The height n, of the matrices that the decoder takes as input is a parameter nctx = n
which is called the size of the context window. So
3dagger 3†) With the vocabulary, the one-hot encoding and the parameter nctx = n having
been fixed, we now have a canonical way of taking any string S of real text that
satisfies N (S) = n and forming a matrix t ∈ Rn×nvocab , the rows of which are the
one-hot tokens t1 (S), t2 (S), . . . , tn (S).
Next, we embed the one-hot encoding in a smaller vector space. Specifically, we choose a
parameter d < nvocab and a (d × nvocab ) projection matrix
WE : Rnvocab → Rd
called the token embedding. The entries of the matrix WE are trainable parameters (and
after training an embedding like this is referred to as a learned embedding). We refer
to WE (σ(V)) as the set of embedded tokens. Notice that the embedded tokens are the
columns of WE .
MATHEMATICS OF LLMS 3

Given a matrix t of one-hot tokens, as in 3†), the first thing that the transformer does
is act on each row of t by the embedding matrix, i.e.
󰀳 󰀴󰀼
[ x1 ] 󰁁 󰁁
󰁅 .. 󰁆󰁁󰁁
󰁁
󰁅 . 󰁆 󰁁
󰁀
󰁅 󰁆
(1.1) T
t 󰀁−−−−−−−→ (Id ⊗ WE )t = tWE = X = 󰁅 [ 󰁅 xi 󰁆
]󰁆 n
Embedding 󰁅 .. 󰁆󰁁󰁁
󰁃 . 󰁄󰁁󰁁
󰁁
󰁁
󰀾
[ xn ]
󰁿 󰁾󰁽 󰂀
d

Let us also define now the (nvocab × d) matrix


WU : Rd → Rnvocab
called the unembedding, the entries of which are also trainable parameters. This acts on
the rows of a matrix X ∈ Rn×d , i.e.
󰀳 󰀴󰀼
[ τ1 ] 󰁁 󰁁
󰁅 .. 󰁆󰁁󰁁
󰁅 . 󰁆󰁁󰁁
󰁀
󰁅 󰁆
(1.2) X 󰀁−−−−−−−−→ (Id ⊗ WU )X = XWUT = 󰁅 󰁅 [ τ i ] 󰁆
󰁆 󰁁n
Unembedding 󰁅 .. 󰁆󰁁
󰁃 . 󰁄󰁁󰁁
󰁁
󰁁
󰀾
[ τn ]
󰁿 󰁾󰁽 󰂀
nvocab

Sometimes one insists that WU = WET , a constraint that we refer to as the embedding
and unembedding being ‘tied’.
Remark 1.1. One also typically includes some kind of positional encoding. There are
several competing ways of doing this which we will not review here and this is one of a
few aspects of these models that we essentially ignore in our mathematical framework.
Let us just say that the simplest positional encoding is just to define another n × d matrix
P the entries of which are all trainable parameters and to simply add it on to X, i.e.
during the embedding step, the transformer maps t 󰀁→ tWET + P . We have opted to leave
out any further discussion of positional encodings.

2. Feedforward Layers
We will describe the multi-layer perceptron by first describing a generalized feedforward
architecture. We will not need to use these objects in the full generality presented here
initially but it does serve to illustrate the underlying mathematical structure at the level
of the individual vertices or ‘aritificial neurons’ better than would be achieved by directly
describing the multi-layer perceptron in the way it is usually thought of.
SS:BasicGeneralDef
2.1. Basic General Definitions. Given a directed, acyclic graph G = (V, E) with min-
imum degree at least one (i.e. with the property that every vertex belongs to at least one
edge), we write V = I ∪ H ∪ O where:
• I is the set of vertices which have no incoming edges; these are called the input
vertices.
• O is the set of vertices, which have no outgoing edges; these are called the output
vertices.
4 SPENCER BECKER-KAHN

• H := V \ (I ∪ O).
A feed-forward artificial neural network N is a pair (G, A) where
(1) G = (V, E) is a finite, directed acyclic graph with minimum degree at least one
called the architecture of N ; and
(2) A = {σv : R → R : v ∈ V \ (I ∪ O)} is a family of functions called the activation
functions for N .
In a feed-forward artificial neural network, the input and output vertices are labelled,
so that we may write I = {I1 , . . . , Id }, where d = |I|, and O = {O1 , . . . , Od′ }, where
d′ = |O|. Given a choice of weights we ∈ R for e ∈ E and biases bv ∈ R for v ∈ V \ I, the

network defines a function m : Rd → Rd in the following way (for now we have omitted
the dependence of m on N , (we )e∈E , and (bv )v∈V \I from the notation): Given x ∈ Rd ,
for v ∈ I, we define zIj : R → R by
zIj (x) = xj .
Then, for any v ∈ V \ I, the preactivation at v is given by
󰁛
E:preactivation_at_v (2.1) zv (x) := bv + we σv′ (zv′ (x)),
e=(v ′ ,v)∈E

and the activation at v is:


󰀃 󰀄
E:activation_at_v (2.2) av (x) := σv zv (x) .
Thus
󰀓 󰁛 󰀔
E:activations (2.3) av (x) = σv bv + we av′ (x) .
e=(v ′ v)∈E

The output
󰀃 of the function m 󰀄 is given by the preactivations at the output vertices, i.e.
m(x) := zO1 (x), . . . , zOd′ (x) .
The parameter space of the network N is a set Θ = ΘN , each point of which represents
a choice of weights and biases, i.e. Θ is the collection of all θ = (w, b) where w =
(we )e∈E ∈ R is a set of weights for the network and b = (bv )v∈V \I ∈ R is a set of biases.
The weights and biases are referred to as the trainable parameters of the network.
2.2. The Fully Connected Multi-Layer Perceptron. The multilayer perceptron (MLP)
is a feed-forward artificial neural network for which:
(1) The architecture G = (V, E) is layered: This means that the vertex set V can be
written as the disjoint union I = V (0) ∪ · · · ∪ V (L) = O for some L ≥ 1 called the
󰁖L−1
depth 1 of the network and that E ⊂ l=0 V (l) × V (l+1) ;
(2) The activation functions are all equal to the same function σ.
In these notes we will be concerned only with MLPs that are fully connected which means
that
L−1
󰁞
E:fullyconnected (2.4) E= V (l) × V (l+1) .
l=0
(l)
The integer nl := |V | is called the width of the lth layer and we write vi for the ith
(l)
(l)
vertex in the lth layer of the network. Given θ = (w, b) ∈ Θ, write bi = bv(l) so that b(l)
i

1This is the ‘deep’ in deep learning


MATHEMATICS OF LLMS 5

is a vector containing all the biases at the lth layer of the network and for l = 1, . . . , L let
W (l) ∈ Rnl ×nl−1 denote a matrix whose entries are given by
(l)
wij := w(v(l) ,v(l−1) ) .
i j

This is called a weight matrix.


The special structure of the MLP means that it is fruitful to describe it in terms of
how it maps one layer to the next, as opposed to using a description that stays only at
the level of the individual vertices. Given x ∈ Rd , define the lth layer preactivation to be
the vector z (l) (x) ∈ Rnl whose components are given by
(l)
zi (x) := zv(l) (x).
i

And define the l th


layer activation to be the vector a(l) (x) ∈ Rnl given by
(l)
(l)
ai (x) = avi (x).
Then (2.1) implies that
󰀃 󰀄
E:z^l (2.5) z (l+1) (x) = b(l+1) + W (l+1) σ z (l) (x) ,
where in this equation and henceforth we we will use the convention that the activation
function σ : R → R acts component-wise when
󰀃 applied
󰀄 to a vector (and here of course
W (l+1) acts by matrix multiplication on σ z (l) (x) ), i.e. if we were to write out the
components in full, then we would have:
nl
󰁛
(l+1) (l+1) (l+1) 󰀃 (l) 󰀄
E:z^l_i (2.6) zi (x) = bi + wij σ zj (x)
j=1

for i = 1, . . . , nl+1 and l = 0, . . . , L − 1. And (2.3) implies that


󰀓 󰀔
(2.7) a(l+1) (x) = σ b(l+1) + W (l+1) a(l) (x) .

So for the multilayer perceptron, the map which takes the activations at a given layer
as inputs and then outputs the activations at the next layer is an affine transformation
followed by the application of the activation function component-wise. In this context,
the space Rnl+1 - that which contains the (l + 1)th layer activations a(l+1) (x) - is called
activation space. As in the general case, in these notes, the function that the MLP
implements is given by the preactivations at the final layer, i.e. m(x) = z (L) (x).
Remark 2.1. One may actually want to implement activations at the final layer, but it
seems better to leave the definition like this; it is easy to consider σ ◦ m if one needs to.
Often in theoretical work, we needn’t bother with biases at all; they can be simulated by
instead creating an additional input dimension at each layer which only ever receives the
input ‘1’. Then the action of adding a bias can be replicated via a weight on the new
input edge. ( It’s not clear to me this is really any simpler than just leaving the biases
there but it explains why you sometimes don’t see them. )
In the architecture that we are building up to, we will consider MLPs with d = n0 =
nL = d′ (i.e. the input and output dimensions are the same) and it is common in practice
to have L = 2, though there is no real reason to insist on that for a theoretical treatment.
6 SPENCER BECKER-KAHN

2.3. Feedforward Layers. With n, d ≥ 1 fixed, given a matrix X ∈ Rn×d , and an MLP
m with input dimension n0 = d, we will write m(X) to denote the matrix obtained by
letting m act independently on each row of X. And given such an MLP we define a
feedforward layer ffm by
ffm : Rn×d −→ Rn×d
X 󰀁−→ X + m(X).

3. Attention Layers
In this section we will describe the action of attention layers. The key structural
difference between attention layers and feedforward layers is that whereas a feedforward
layer processes each row of a matrix X ∈ Rn×d independently, an attention layer performs
operations ‘across’ the rows of the matrix.
3.1. Patterns and Heads. With n, d ≥ 1 fixed, define the softmax function softmax :
Rn×n → Rn×n to act on the matrix A = [aij ] entrywise by the formula
eaij
softmax(A)ij = 󰁓 apq
.
p,q e

We will write softmax∗ for a modified version of the softmax function using what is known
as autoregressive masking. The modified formula is:
󰀫
0 if i < j
softmax(A)ij = 󰁓 eaij
eapq else.
p≥q

We can think of this as being given by softmax∗ (A) = softmax(A + M ), where M is an


(n × n) matrix called a mask of the form
󰀫
0 if i ≥ j
mij =
−∞ else
i.e.
󰀳 󰀴
0 −∞ ... ... −∞
󰁅 .. 󰁆
󰁅0 0 −∞ ... . 󰁆
󰁅 󰁆
󰁅
M = 󰁅0 .. 󰁆 .
󰁅 ··· 0 −∞ . 󰁆
󰁆
󰁅. .. 󰁆
󰁃 .. ··· . 0 −∞󰁄
0 ··· ··· ··· 0
An attention head is a function
(3.1) h : Rn×d → Rn×d
of the form
󰀕 󰀖
󰀃 󰀄
E:attentionhead1 (3.2) h(X) = softmax∗ X Wqk
h
X T ⊗ Wov
h
X,

where Wqkh
is a d × d matrix called the query-key matrix and where Wov is a d × d
matrix called the output-value matrix. The attention pattern of the head h is the function
Ah : Rn×d → Rn×n given by
󰀃 󰀄
(3.3) Ah (X) = softmax∗ X Wqk h
XT ,
MATHEMATICS OF LLMS 7

so that once we take into account this definition and the way that the tensor product
acts, we have
󰀃 h 󰀄T
E:attentionhead2 (3.4) h(X) = Ah (X)X Wov .

The entries of the query-key matrix and the output-value matrix are all trainable pa-
rameters and both matrices are constrained to be products of two low-rank projections:
There is another fixed integer dh < d called the dimension of the attention head, and
there are four projection matrices: Wqh , Wkh , Wvh : Rd → Rdh , and Woh : Rdh → Rd
which are called the query, key, value and output matrices respectively, and for which
h
Wqk = (Wqh )T Wkh and Wov
h
:= Woh Wvh . So:
h T
Wh (WQ )
(3.5) Wqk : Rd −−−−−−K−−−−→ Rdh −−−−−−−−−−−−→ Rd

and
Wh Wh
(3.6) Wov : Rd −−−−−−
V
−−−→ Rdh −−−−−−
O
−−−→ Rd .

3.2. Attention Layers. An attention layer is defined via an an attention multi-head,


which is a set H of attention heads all with the same dimension, i.e. such that there is
some integer dH with dh = dH for every h ∈ H. Given such an attention multi-head, we
define an attention layer attnH by

attnH : Rn×d −→ Rn×d


󰁛
X 󰀁−→ X + h(X)
h∈H

Remark 3.1. (WQK h


as a Bilinear Form.) Notice that the expression XWQK h
X T can be
thought of as the result of applying a bilinear form to each pair of rows in the matrix X,
i.e. if we write x1 , . . . , xn ∈ Rd for the rows of X, and write 〈󰂓v , 󰂓v ′ 〉h := 󰂓v WQKh
󰂓v ′T for any
two vectors 󰂓v , 󰂓v ∈ R , then the attention pattern is
′ d

󰀳󰀳 󰀴󰀴
[ x1 ] 󰀳 󰀴
󰁅󰁅 .. 󰁆󰁆 〈x1 , x1 〉h 〈x1 , x2 〉h · · · 〈x1 , xn 〉h
󰁅󰁅 . 󰁆󰁆 󰁅 .. .. .. 󰁆
󰁅󰁅
h 󰁅󰁅
󰁆󰁆
󰁆󰁆
󰁅
∗ 󰁅 〈x2 , x1 〉h . . . 󰁆
󰁆
A 󰁅󰁅 [ xi ] 󰁆󰁆 := softmax 󰁅 . 󰁆
󰁅󰁅 .. 󰁆󰁆 󰁃 .
. 󰁄
󰁃󰁃 . 󰁄󰁄
〈xn , x1 〉h ··· · · · 〈xn , xn 〉h
[ xn ]

Remark 3.2. (Composition of Attention Heads) The product of two (or more) attention
heads behaves in a similar way to a single true attention head via the fact that
󰀃 h1 h1
󰀄󰀃 h2 h2
󰀄
A ⊗ Wov A ⊗ Wov = Ah1 Ah2 ⊗ Wov
h1 h2
Wov .

Thus the product behaves like an attention head with attention pattern given by Ah1 ◦h2 =
Ah1 Ah2 and with output-value matrix given by Wov h1 ◦h2 h1
= Wov h2
Wov . Again this is making
use of the fact that for a fixed attention pattern, attention heads have a linear nature.
8 SPENCER BECKER-KAHN

4. The Core Decoder Architecture


4.1. Residual Blocks. Given a set H of attention heads and an MLP m, a residual
block B = B(H, m) : Rn×d → Rn×d is defined to be the composition of the attention
layer attnH with the feedforward layer ffm , i.e.
(4.1) B(H, m) = ffm ◦ attnH .
Recalling the definitions
󰁛
attnH : X 󰀁−→ X + h(X); and
h∈H
󰀃 󰀄
ffm : X 󰀁−→ X + m X ,
we of course have
󰀓 󰁛 󰀔
(4.2) B(X) = X + m X + h(X) .
h∈H

Remark 4.1. Some modern architectures choose to do these operations in ‘parallel’ and
instead use blocks that compute X 󰀁→ X + ff(X) + attn(X), but here we will stick with
the composition which is used by, for example, the GPT architectures.
4.2. The Decoder Stack. With n, d ≥ 1 fixed as before, fix another integer n, to be
the number of blocks in the transformer. Let {Hi }ni=1 be a set of attention multi-heads
such that each multi-head has the same number of heads (i.e. with |Hi | = nheads for
i = 1, . . . , n) and the same dimension (i.e. dHi = dheads for i = 1, . . . , n). And let
{mi }ni=1 be a set of MLPs, each of which has L layers. The decoder stack D is the
composition of the corresponding n blocks {B(Hi , mi )}ni=1 :
(4.3)
B(H1 ,m1 ) B(H2 ,m2 ) B(Hn−1 ,mn−1 ) B(Hn ,mn )
D : Rn×d −−−−−−→ Rn×d −−−−−−→ · · · −−−−−−−−−→ Rn×d −−−−−−→ Rn×d .
4.3. The Full Transformer. Now, given a matrix t ∈ Rn×nvocab (of one-hot tokens as
described in 3†) of Section 1.2), we will define the full transformer, which we will denote
by T , as first acting on t via the embedding (together with the positional encoding), then
via the decoder stack, and then finally via the unembedding. So:
Id ⊗ W D Id ⊗ W
(4.4) T : Rn×nvocab −−−−−−−
E
→ Rn×d −−−−−→ Rn×d −−−−−−−U−→ Rn×nvocab
Embedding Decoder Unembedding

Notation. Note that in the literature, the notation nlayers is often used where we have
used the ‘typewriter’ font n.

References
phuonghutterformal [1] Mary Phuong and Marcus Hutter. Formal algorithms for transformers. arXiv preprint
arXiv:2207.09238, 2022.

You might also like