0% found this document useful (0 votes)

11 views8 pages

Mathematics of LLMs Part 1

This document provides a mathematical overview of large transformer language models (LLMs), focusing on their architecture, training parameters, and the processes of tokenization, embedding, and feedforward layers. It emphasizes the need for a clear mathematical framework to describe LLMs and outlines the steps involved in constructing a vocabulary from a corpus of text. Additionally, it discusses the structure of feedforward neural networks, specifically multilayer perceptrons, and their role in the functioning of LLMs.

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views8 pages

Mathematics of LLMs Part 1

Uploaded by

Private

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

NOTES ON THE MATHEMATICS OF LARGE TRANSFORMER

LANGUAGE MODEL ARCHITECTURE

SPENCER BECKER-KAHN

Introduction
From a mathematical point of view, the building and training of a large transformer
language model (LLM) is the construction of a certain function, from some euclidean space
to another, that has certain interesting properties. And it may therefore be surprising
to find that many key papers announcing significant new LLMs seem reluctant to simply
spell out the details of the function that they have constructed in plain mathematical
language or indeed even in complete pseudo-code. The latter form of this complaint is
the subject of the recent article of Phuong and Hutter [1]. Here, we focus on one aspect
of the former perspective and seek to give a relatively ‘pure’ mathematical description of
the architecture of an LLM. To do so adequately, on the one hand we seek a sufficiently
high level of accuracy, i.e. The mathematical framework that we build must faithfully
represent or be sufficiently analogous to the true nature of real-world models. On the
other hand, we seek mathematical elegance. And not simply for its own sake: One hope
with something like this is that a good mathematical framework can form a foundation
for doing certain kinds of conceptual reasoning effectively. And the elegance we seek when
building the framework - economy of expression, flexibility, getting just the right level of
generality etc. - is genuinely important for the utility of the framework.
Trainable Parameters. Like all such models in machine learning, the construction
initially describes a family of functions indexed by some set Θ = RN called the parameter
space. There is then a separate process - the training of the model - in which a particular
value θ ∈ Θ is selected using a training algorithm. Each dimension of Θ corresponds to
the possible values of an individual trainable parameter. We will draw attention to such
parameters as we introduce them, as opposed to attempting to give a definition of Θ up
front. A complete description of the training algorithm will possibly be the subject of a
follow-up.

1. Tokenization, Encodings and Embeddings

The transformer T will take as input a matrix t ∈ Rn×nvocab , each row of which is
a standard basis vector of Rnvocab (and is therefore all zeroes except for one ‘1’). The
so-called ‘decoder’ part of the transformer - the ‘deep learning’ bit - acts on matrices
X ∈ Rn×d for some d < nvocab . In this section we will explain how t corresponds to a
slice of text and how the mapping t 󰀁→ X works.
1.1. Tokenization. Let C denote a corpus of real text: A large collection of diﬀerent
artefacts (articles, books, blog posts, text conversations etc.) from many diﬀerent places
on the internet. We will use this to define a vocabulary for the transformer. We will
1
2 SPENCER BECKER-KAHN

think of the corpus as a single string α1 α2 α3 · · · = (αi )i∈I that uses a special character
to separate the individual artefacts in the corpus, i.e. the special character marks where
one thing ends and the next bit of text begins. Every other character is drawn from an
alphabet A.
We will produce the vocabulary via an iterative process, starting with the set V0
consisting of all 256 bytes, i.e. all 28 strings of length 8 consisting of only zeroes or
ones, together with special characters to denote the end of words and end of individual
artefacts. The UTF-8 encoding EUTF−8 = E is a standard way to map elements of A to
short strings of bytes. It can be thought of as an injective map E : A → V0 × V0 × V0 × V0 .
By replacing every character α in the corpus C by its image E(α) under the map E, we
produce a new corpus C0 (or rather a ‘translation’ of the original corpus)
󰀃 󰀄that is written
entirely using symbols from the vocabulary V0 , i.e. we set C0 := E(αi ) i∈I . Next, to
produce the set V1 from V0 , we take the pair of symbols (b, b′ ) which occurs the greatest
number of times consecutively in C0 and add the single new symbol bb′ to the vocabulary
V0 , i.e. we set V1 = V0 ∪ {bb′ }. And we replace every occurrence of (b, b′ ) in the corpus C0
by the new symbol s, resulting in the corpus C1 . Then of course we continue iteratively,
producing Vi+1 and Ci+1 from Vi and Ci in the same way. Notice that the size of the
vocabulary is increased by one at each step, i.e. |Vi+1 | = |Vi | + 1. We fix a parameter
nvocab and terminate the process at V := Vm with |Vm | = nvocab . We refer to V as the
vocabulary. An element of the vocabulary is called a token. The result is that:
1dagger 1†) With the vocabulary V having been fixed, we now have a canonical way of taking
any string S of real text and mapping it to a (finite) sequence of elements from
the fixed vocabulary V.
SS:Embedding
1.2. Embedding and Unembedding. Then we fix what is called a one-hot encoding
of the vocabulary, which is simply a bijection between the set V and the standard or-
thonormal basis of Rnvocab . Denote this by σ : V → {ej }nj=1
vocab
. We will sometimes also
refer to σ(V) ⊂ R nvocab
as the set of one-hot tokens.
2dagger 2†) With the vocabulary and the one-hot encoding having been fixed, we now have
a canonical way of taking any string S of real text and mapping it to a (finite)
sequence (t1 (S), t2 (S), . . . , tN (S)) of one-hot vectors in Rnvocab , where N = N (S).
We will refer to this as tokenization.
The height n, of the matrices that the decoder takes as input is a parameter nctx = n
which is called the size of the context window. So
3dagger 3†) With the vocabulary, the one-hot encoding and the parameter nctx = n having
been fixed, we now have a canonical way of taking any string S of real text that
satisfies N (S) = n and forming a matrix t ∈ Rn×nvocab , the rows of which are the
one-hot tokens t1 (S), t2 (S), . . . , tn (S).
Next, we embed the one-hot encoding in a smaller vector space. Specifically, we choose a
parameter d < nvocab and a (d × nvocab ) projection matrix
WE : Rnvocab → Rd
called the token embedding. The entries of the matrix WE are trainable parameters (and
after training an embedding like this is referred to as a learned embedding). We refer
to WE (σ(V)) as the set of embedded tokens. Notice that the embedded tokens are the
columns of WE .
MATHEMATICS OF LLMS 3

Given a matrix t of one-hot tokens, as in 3†), the first thing that the transformer does
is act on each row of t by the embedding matrix, i.e.
󰀳 󰀴󰀼
[ x1 ] 󰁁 󰁁
󰁅 .. 󰁆󰁁󰁁
󰁁
󰁅 . 󰁆 󰁁
󰁀
󰁅 󰁆
(1.1) T
t 󰀁−−−−−−−→ (Id ⊗ WE )t = tWE = X = 󰁅 [ 󰁅 xi 󰁆
]󰁆 n
Embedding 󰁅 .. 󰁆󰁁󰁁
󰁃 . 󰁄󰁁󰁁
󰁁
󰁁
󰀾
[ xn ]
󰁿 󰁾󰁽 󰂀
d

Let us also define now the (nvocab × d) matrix

WU : Rd → Rnvocab
called the unembedding, the entries of which are also trainable parameters. This acts on
the rows of a matrix X ∈ Rn×d , i.e.
󰀳 󰀴󰀼
[ τ1 ] 󰁁 󰁁
󰁅 .. 󰁆󰁁󰁁
󰁅 . 󰁆󰁁󰁁
󰁀
󰁅 󰁆
(1.2) X 󰀁−−−−−−−−→ (Id ⊗ WU )X = XWUT = 󰁅 󰁅 [ τ i ] 󰁆
󰁆 󰁁n
Unembedding 󰁅 .. 󰁆󰁁
󰁃 . 󰁄󰁁󰁁
󰁁
󰁁
󰀾
[ τn ]
󰁿 󰁾󰁽 󰂀
nvocab

Sometimes one insists that WU = WET , a constraint that we refer to as the embedding
and unembedding being ‘tied’.
Remark 1.1. One also typically includes some kind of positional encoding. There are
several competing ways of doing this which we will not review here and this is one of a
few aspects of these models that we essentially ignore in our mathematical framework.
Let us just say that the simplest positional encoding is just to define another n × d matrix
P the entries of which are all trainable parameters and to simply add it on to X, i.e.
during the embedding step, the transformer maps t 󰀁→ tWET + P . We have opted to leave
out any further discussion of positional encodings.

2. Feedforward Layers
We will describe the multi-layer perceptron by first describing a generalized feedforward
architecture. We will not need to use these objects in the full generality presented here
initially but it does serve to illustrate the underlying mathematical structure at the level
of the individual vertices or ‘aritificial neurons’ better than would be achieved by directly
describing the multi-layer perceptron in the way it is usually thought of.
SS:BasicGeneralDef
2.1. Basic General Definitions. Given a directed, acyclic graph G = (V, E) with min-
imum degree at least one (i.e. with the property that every vertex belongs to at least one
edge), we write V = I ∪ H ∪ O where:
• I is the set of vertices which have no incoming edges; these are called the input
vertices.
• O is the set of vertices, which have no outgoing edges; these are called the output
vertices.
4 SPENCER BECKER-KAHN

• H := V \ (I ∪ O).
A feed-forward artificial neural network N is a pair (G, A) where
(1) G = (V, E) is a finite, directed acyclic graph with minimum degree at least one
called the architecture of N ; and
(2) A = {σv : R → R : v ∈ V \ (I ∪ O)} is a family of functions called the activation
functions for N .
In a feed-forward artificial neural network, the input and output vertices are labelled,
so that we may write I = {I1 , . . . , Id }, where d = |I|, and O = {O1 , . . . , Od′ }, where
d′ = |O|. Given a choice of weights we ∈ R for e ∈ E and biases bv ∈ R for v ∈ V \ I, the
′
network defines a function m : Rd → Rd in the following way (for now we have omitted
the dependence of m on N , (we )e∈E , and (bv )v∈V \I from the notation): Given x ∈ Rd ,
for v ∈ I, we define zIj : R → R by
zIj (x) = xj .
Then, for any v ∈ V \ I, the preactivation at v is given by
󰁛
E:preactivation_at_v (2.1) zv (x) := bv + we σv′ (zv′ (x)),
e=(v ′ ,v)∈E

and the activation at v is:

󰀃 󰀄
E:activation_at_v (2.2) av (x) := σv zv (x) .
Thus
󰀓 󰁛 󰀔
E:activations (2.3) av (x) = σv bv + we av′ (x) .
e=(v ′ v)∈E

The output
󰀃 of the function m 󰀄 is given by the preactivations at the output vertices, i.e.
m(x) := zO1 (x), . . . , zOd′ (x) .
The parameter space of the network N is a set Θ = ΘN , each point of which represents
a choice of weights and biases, i.e. Θ is the collection of all θ = (w, b) where w =
(we )e∈E ∈ R is a set of weights for the network and b = (bv )v∈V \I ∈ R is a set of biases.
The weights and biases are referred to as the trainable parameters of the network.
2.2. The Fully Connected Multi-Layer Perceptron. The multilayer perceptron (MLP)
is a feed-forward artificial neural network for which:
(1) The architecture G = (V, E) is layered: This means that the vertex set V can be
written as the disjoint union I = V (0) ∪ · · · ∪ V (L) = O for some L ≥ 1 called the
󰁖L−1
depth 1 of the network and that E ⊂ l=0 V (l) × V (l+1) ;
(2) The activation functions are all equal to the same function σ.
In these notes we will be concerned only with MLPs that are fully connected which means
that
L−1
󰁞
E:fullyconnected (2.4) E= V (l) × V (l+1) .
l=0
(l)
The integer nl := |V | is called the width of the lth layer and we write vi for the ith
(l)
(l)
vertex in the lth layer of the network. Given θ = (w, b) ∈ Θ, write bi = bv(l) so that b(l)
i

1This is the ‘deep’ in deep learning

MATHEMATICS OF LLMS 5

is a vector containing all the biases at the lth layer of the network and for l = 1, . . . , L let
W (l) ∈ Rnl ×nl−1 denote a matrix whose entries are given by
(l)
wij := w(v(l) ,v(l−1) ) .
i j

This is called a weight matrix.

The special structure of the MLP means that it is fruitful to describe it in terms of
how it maps one layer to the next, as opposed to using a description that stays only at
the level of the individual vertices. Given x ∈ Rd , define the lth layer preactivation to be
the vector z (l) (x) ∈ Rnl whose components are given by
(l)
zi (x) := zv(l) (x).
i

And define the l th

layer activation to be the vector a(l) (x) ∈ Rnl given by
(l)
(l)
ai (x) = avi (x).
Then (2.1) implies that
󰀃 󰀄
E:z^l (2.5) z (l+1) (x) = b(l+1) + W (l+1) σ z (l) (x) ,
where in this equation and henceforth we we will use the convention that the activation
function σ : R → R acts component-wise when
󰀃 applied
󰀄 to a vector (and here of course
W (l+1) acts by matrix multiplication on σ z (l) (x) ), i.e. if we were to write out the
components in full, then we would have:
nl
󰁛
(l+1) (l+1) (l+1) 󰀃 (l) 󰀄
E:z^l_i (2.6) zi (x) = bi + wij σ zj (x)
j=1

for i = 1, . . . , nl+1 and l = 0, . . . , L − 1. And (2.3) implies that

󰀓 󰀔
(2.7) a(l+1) (x) = σ b(l+1) + W (l+1) a(l) (x) .

So for the multilayer perceptron, the map which takes the activations at a given layer
as inputs and then outputs the activations at the next layer is an aﬃne transformation
followed by the application of the activation function component-wise. In this context,
the space Rnl+1 - that which contains the (l + 1)th layer activations a(l+1) (x) - is called
activation space. As in the general case, in these notes, the function that the MLP
implements is given by the preactivations at the final layer, i.e. m(x) = z (L) (x).
Remark 2.1. One may actually want to implement activations at the final layer, but it
seems better to leave the definition like this; it is easy to consider σ ◦ m if one needs to.
Often in theoretical work, we needn’t bother with biases at all; they can be simulated by
instead creating an additional input dimension at each layer which only ever receives the
input ‘1’. Then the action of adding a bias can be replicated via a weight on the new
input edge. ( It’s not clear to me this is really any simpler than just leaving the biases
there but it explains why you sometimes don’t see them. )
In the architecture that we are building up to, we will consider MLPs with d = n0 =
nL = d′ (i.e. the input and output dimensions are the same) and it is common in practice
to have L = 2, though there is no real reason to insist on that for a theoretical treatment.
6 SPENCER BECKER-KAHN

2.3. Feedforward Layers. With n, d ≥ 1 fixed, given a matrix X ∈ Rn×d , and an MLP
m with input dimension n0 = d, we will write m(X) to denote the matrix obtained by
letting m act independently on each row of X. And given such an MLP we define a
feedforward layer ffm by
ffm : Rn×d −→ Rn×d
X 󰀁−→ X + m(X).

3. Attention Layers
In this section we will describe the action of attention layers. The key structural
diﬀerence between attention layers and feedforward layers is that whereas a feedforward
layer processes each row of a matrix X ∈ Rn×d independently, an attention layer performs
operations ‘across’ the rows of the matrix.
3.1. Patterns and Heads. With n, d ≥ 1 fixed, define the softmax function softmax :
Rn×n → Rn×n to act on the matrix A = [aij ] entrywise by the formula
eaij
softmax(A)ij = 󰁓 apq
.
p,q e

We will write softmax∗ for a modified version of the softmax function using what is known
as autoregressive masking. The modified formula is:
󰀫
0 if i < j
softmax(A)ij = 󰁓 eaij
eapq else.
p≥q

We can think of this as being given by softmax∗ (A) = softmax(A + M ), where M is an

(n × n) matrix called a mask of the form
󰀫
0 if i ≥ j
mij =
−∞ else
i.e.
󰀳 󰀴
0 −∞ ... ... −∞
󰁅 .. 󰁆
󰁅0 0 −∞ ... . 󰁆
󰁅 󰁆
󰁅
M = 󰁅0 .. 󰁆 .
󰁅 ··· 0 −∞ . 󰁆
󰁆
󰁅. .. 󰁆
󰁃 .. ··· . 0 −∞󰁄
0 ··· ··· ··· 0
An attention head is a function
(3.1) h : Rn×d → Rn×d
of the form
󰀕 󰀖
󰀃 󰀄
E:attentionhead1 (3.2) h(X) = softmax∗ X Wqk
h
X T ⊗ Wov
h
X,

where Wqkh
is a d × d matrix called the query-key matrix and where Wov is a d × d
matrix called the output-value matrix. The attention pattern of the head h is the function
Ah : Rn×d → Rn×n given by
󰀃 󰀄
(3.3) Ah (X) = softmax∗ X Wqk h
XT ,
MATHEMATICS OF LLMS 7

so that once we take into account this definition and the way that the tensor product
acts, we have
󰀃 h 󰀄T
E:attentionhead2 (3.4) h(X) = Ah (X)X Wov .

The entries of the query-key matrix and the output-value matrix are all trainable pa-
rameters and both matrices are constrained to be products of two low-rank projections:
There is another fixed integer dh < d called the dimension of the attention head, and
there are four projection matrices: Wqh , Wkh , Wvh : Rd → Rdh , and Woh : Rdh → Rd
which are called the query, key, value and output matrices respectively, and for which
h
Wqk = (Wqh )T Wkh and Wov
h
:= Woh Wvh . So:
h T
Wh (WQ )
(3.5) Wqk : Rd −−−−−−K−−−−→ Rdh −−−−−−−−−−−−→ Rd

and
Wh Wh
(3.6) Wov : Rd −−−−−−
V
−−−→ Rdh −−−−−−
O
−−−→ Rd .

3.2. Attention Layers. An attention layer is defined via an an attention multi-head,

which is a set H of attention heads all with the same dimension, i.e. such that there is
some integer dH with dh = dH for every h ∈ H. Given such an attention multi-head, we
define an attention layer attnH by

attnH : Rn×d −→ Rn×d

󰁛
X 󰀁−→ X + h(X)
h∈H

Remark 3.1. (WQK h

as a Bilinear Form.) Notice that the expression XWQK h
X T can be
thought of as the result of applying a bilinear form to each pair of rows in the matrix X,
i.e. if we write x1 , . . . , xn ∈ Rd for the rows of X, and write 〈󰂓v , 󰂓v ′ 〉h := 󰂓v WQKh
󰂓v ′T for any
two vectors 󰂓v , 󰂓v ∈ R , then the attention pattern is
′ d

󰀳󰀳 󰀴󰀴
[ x1 ] 󰀳 󰀴
󰁅󰁅 .. 󰁆󰁆 〈x1 , x1 〉h 〈x1 , x2 〉h · · · 〈x1 , xn 〉h
󰁅󰁅 . 󰁆󰁆 󰁅 .. .. .. 󰁆
󰁅󰁅
h 󰁅󰁅
󰁆󰁆
󰁆󰁆
󰁅
∗ 󰁅 〈x2 , x1 〉h . . . 󰁆
󰁆
A 󰁅󰁅 [ xi ] 󰁆󰁆 := softmax 󰁅 . 󰁆
󰁅󰁅 .. 󰁆󰁆 󰁃 .
. 󰁄
󰁃󰁃 . 󰁄󰁄
〈xn , x1 〉h ··· · · · 〈xn , xn 〉h
[ xn ]

Remark 3.2. (Composition of Attention Heads) The product of two (or more) attention
heads behaves in a similar way to a single true attention head via the fact that
󰀃 h1 h1
󰀄󰀃 h2 h2
󰀄
A ⊗ Wov A ⊗ Wov = Ah1 Ah2 ⊗ Wov
h1 h2
Wov .

Thus the product behaves like an attention head with attention pattern given by Ah1 ◦h2 =
Ah1 Ah2 and with output-value matrix given by Wov h1 ◦h2 h1
= Wov h2
Wov . Again this is making
use of the fact that for a fixed attention pattern, attention heads have a linear nature.
8 SPENCER BECKER-KAHN

4. The Core Decoder Architecture

4.1. Residual Blocks. Given a set H of attention heads and an MLP m, a residual
block B = B(H, m) : Rn×d → Rn×d is defined to be the composition of the attention
layer attnH with the feedforward layer ffm , i.e.
(4.1) B(H, m) = ffm ◦ attnH .
Recalling the definitions
󰁛
attnH : X 󰀁−→ X + h(X); and
h∈H
󰀃 󰀄
ffm : X 󰀁−→ X + m X ,
we of course have
󰀓 󰁛 󰀔
(4.2) B(X) = X + m X + h(X) .
h∈H

Remark 4.1. Some modern architectures choose to do these operations in ‘parallel’ and
instead use blocks that compute X 󰀁→ X + ff(X) + attn(X), but here we will stick with
the composition which is used by, for example, the GPT architectures.
4.2. The Decoder Stack. With n, d ≥ 1 fixed as before, fix another integer n, to be
the number of blocks in the transformer. Let {Hi }ni=1 be a set of attention multi-heads
such that each multi-head has the same number of heads (i.e. with |Hi | = nheads for
i = 1, . . . , n) and the same dimension (i.e. dHi = dheads for i = 1, . . . , n). And let
{mi }ni=1 be a set of MLPs, each of which has L layers. The decoder stack D is the
composition of the corresponding n blocks {B(Hi , mi )}ni=1 :
(4.3)
B(H1 ,m1 ) B(H2 ,m2 ) B(Hn−1 ,mn−1 ) B(Hn ,mn )
D : Rn×d −−−−−−→ Rn×d −−−−−−→ · · · −−−−−−−−−→ Rn×d −−−−−−→ Rn×d .
4.3. The Full Transformer. Now, given a matrix t ∈ Rn×nvocab (of one-hot tokens as
described in 3†) of Section 1.2), we will define the full transformer, which we will denote
by T , as first acting on t via the embedding (together with the positional encoding), then
via the decoder stack, and then finally via the unembedding. So:
Id ⊗ W D Id ⊗ W
(4.4) T : Rn×nvocab −−−−−−−
E
→ Rn×d −−−−−→ Rn×d −−−−−−−U−→ Rn×nvocab
Embedding Decoder Unembedding

Notation. Note that in the literature, the notation nlayers is often used where we have
used the ‘typewriter’ font n.

References
phuonghutterformal [1] Mary Phuong and Marcus Hutter. Formal algorithms for transformers. arXiv preprint
arXiv:2207.09238, 2022.

Floating Break Water
No ratings yet
Floating Break Water
160 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Primary 6 Finals
100% (1)
Primary 6 Finals
23 pages
2.signal and Linear System Analysis
No ratings yet
2.signal and Linear System Analysis
42 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
Lecture15 - Neural Models For NLP
No ratings yet
Lecture15 - Neural Models For NLP
62 pages
Fem Objective Questions
0% (2)
Fem Objective Questions
7 pages
Dlunit 4
No ratings yet
Dlunit 4
122 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Sequence-To-Sequence Models: CIS 530, Computational Linguistics: Spring 2018
No ratings yet
Sequence-To-Sequence Models: CIS 530, Computational Linguistics: Spring 2018
61 pages
DL Co4 PPT-1
No ratings yet
DL Co4 PPT-1
29 pages
Model5 Partial
No ratings yet
Model5 Partial
52 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
59 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
NLP Basics
No ratings yet
NLP Basics
119 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Tikas FYP
No ratings yet
Tikas FYP
37 pages
Day 5 Tokenisation and Embedding
No ratings yet
Day 5 Tokenisation and Embedding
12 pages
Triangle Class 10
No ratings yet
Triangle Class 10
69 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
LLM Embeddings
No ratings yet
LLM Embeddings
11 pages
Prompting Large Language Models With Speech Recognition Abilities
No ratings yet
Prompting Large Language Models With Speech Recognition Abilities
9 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
2024-25 G - 5 CH - 4 Factors Answerkey
No ratings yet
2024-25 G - 5 CH - 4 Factors Answerkey
3 pages
CHATGPT NLP
No ratings yet
CHATGPT NLP
6 pages
Recurrent Neural Networks (RNN) : Subtitle
No ratings yet
Recurrent Neural Networks (RNN) : Subtitle
53 pages
State of Multilingual and Multimodal NLP
No ratings yet
State of Multilingual and Multimodal NLP
27 pages
A Neural Words Encoding Model: Dayiheng Liu Jiancheng LV Xiaofeng Qi and Jiangshu Wei
No ratings yet
A Neural Words Encoding Model: Dayiheng Liu Jiancheng LV Xiaofeng Qi and Jiangshu Wei
5 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
10 Seconds Part 2
100% (1)
10 Seconds Part 2
167 pages
EC3M Q3 Solution
No ratings yet
EC3M Q3 Solution
9 pages
Module 5
No ratings yet
Module 5
76 pages
LR 1
No ratings yet
LR 1
3 pages
1707 06519
No ratings yet
1707 06519
8 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
Solved Example of Transformers
No ratings yet
Solved Example of Transformers
20 pages
NeurIPS 2022 Error Correction Code Transformer Paper Conference
No ratings yet
NeurIPS 2022 Error Correction Code Transformer Paper Conference
11 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
XVA1 Schematics
No ratings yet
XVA1 Schematics
1 page
Alpha Geometry 2
No ratings yet
Alpha Geometry 2
28 pages
DUnit IV
No ratings yet
DUnit IV
22 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Unit IV
No ratings yet
Unit IV
57 pages
Generative AI
No ratings yet
Generative AI
54 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Q2 Week 3 Relation and Function
No ratings yet
Q2 Week 3 Relation and Function
42 pages
LLM Intro
No ratings yet
LLM Intro
49 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
08 Exercises Word2vec MUD SOLVED
No ratings yet
08 Exercises Word2vec MUD SOLVED
3 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
Chapter II
No ratings yet
Chapter II
26 pages
978 3 662 03750 8
No ratings yet
978 3 662 03750 8
541 pages
Module 5 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 5 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
26 pages
DS Unit 5
No ratings yet
DS Unit 5
27 pages
QQ - GG: Point Any
No ratings yet
QQ - GG: Point Any
14 pages
S 2 BCTQ HK HMQB 9 Y2 Uew H4
No ratings yet
S 2 BCTQ HK HMQB 9 Y2 Uew H4
18 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Exponents Worksheets PDF
0% (3)
Exponents Worksheets PDF
2 pages
Muzen Vivaus Guide
No ratings yet
Muzen Vivaus Guide
46 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Ofa CVPR Tutorial
No ratings yet
Ofa CVPR Tutorial
87 pages
Game Theory Lecture Notes - Levent Kockesen
No ratings yet
Game Theory Lecture Notes - Levent Kockesen
120 pages
Unpacking PMF - Enjoy The Work
No ratings yet
Unpacking PMF - Enjoy The Work
31 pages
Neural ODE
No ratings yet
Neural ODE
21 pages
Electronics - Number System & Logic Gates
No ratings yet
Electronics - Number System & Logic Gates
26 pages
EET302 M2-Ktunotes - in
No ratings yet
EET302 M2-Ktunotes - in
33 pages
Recursive Make Considered Harmful
No ratings yet
Recursive Make Considered Harmful
22 pages
Sample Paper 2 Class 9
No ratings yet
Sample Paper 2 Class 9
5 pages
Multi View
No ratings yet
Multi View
49 pages
Thesis Defenseman Mike Has Been Named To His Team
No ratings yet
Thesis Defenseman Mike Has Been Named To His Team
113 pages
XVA1 Installation Guide
No ratings yet
XVA1 Installation Guide
10 pages
Us7915515 Pianoteq
No ratings yet
Us7915515 Pianoteq
19 pages
Notification-CSPE 2020 N Engl
No ratings yet
Notification-CSPE 2020 N Engl
9 pages
WORKING
No ratings yet
WORKING
8 pages
XNB Format
No ratings yet
XNB Format
23 pages
Artificial Neural Networks-Unsupervised Learning PDF
No ratings yet
Artificial Neural Networks-Unsupervised Learning PDF
39 pages
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
No ratings yet
Beyond Goldfish Memory - Long-Term Open-Domain Conversation
15 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Williams Landel Ferry - JACS55 PDF
No ratings yet
Williams Landel Ferry - JACS55 PDF
7 pages
Super 15marks Question
100% (1)
Super 15marks Question
2 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
A New Cycle-Stepped 6502 CPU Emulator
No ratings yet
A New Cycle-Stepped 6502 CPU Emulator
14 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Paper
10 pages
specular 거칠기의 일부 모양을 제공
No ratings yet
specular 거칠기의 일부 모양을 제공
28 pages
Noise Models and Filtering
No ratings yet
Noise Models and Filtering
29 pages
Non-Recursive Make Considered Harmful
No ratings yet
Non-Recursive Make Considered Harmful
12 pages
Sap Examples
No ratings yet
Sap Examples
6 pages
Jan 2006 Paper 2
No ratings yet
Jan 2006 Paper 2
16 pages
Roller Deflection
No ratings yet
Roller Deflection
18 pages
Geotechnical Characterization and Slope Stability of A Relict Landslide in Bimsoils (Blocks in Matrix Soils) in Dowtown Genoa, Italy
No ratings yet
Geotechnical Characterization and Slope Stability of A Relict Landslide in Bimsoils (Blocks in Matrix Soils) in Dowtown Genoa, Italy
7 pages
Solcv 2022
No ratings yet
Solcv 2022
2 pages
Proshake Tutorial
No ratings yet
Proshake Tutorial
10 pages
3.1 Motion Is Relative
No ratings yet
3.1 Motion Is Relative
3 pages
Module 9 - Motions of Physics - Study Guide
No ratings yet
Module 9 - Motions of Physics - Study Guide
4 pages
Elements of Tensor Calculus
From Everand
Elements of Tensor Calculus
A. Lichnerowicz
3.5/5 (2)
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet

Mathematics of LLMs Part 1

Uploaded by

Mathematics of LLMs Part 1

Uploaded by

NOTES ON THE MATHEMATICS OF LARGE TRANSFORMER

LANGUAGE MODEL ARCHITECTURE

1. Tokenization, Encodings and Embeddings

Let us also define now the (nvocab × d) matrix

and the activation at v is:

1This is the ‘deep’ in deep learning

This is called a weight matrix.

And define the l th

for i = 1, . . . , nl+1 and l = 0, . . . , L − 1. And (2.3) implies that

We can think of this as being given by softmax∗ (A) = softmax(A + M ), where M is an

3.2. Attention Layers. An attention layer is defined via an an attention multi-head,

attnH : Rn×d −→ Rn×d

Remark 3.1. (WQK h

4. The Core Decoder Architecture

You might also like