Book TheLMbook Sample
Book TheLMbook Sample
“Andriy has this almost supernatural talent for shrinking epic AI con-
cepts down to bite-sized, ‘Ah, now I get it!’ moments.”
― Jorge Torres, CEO at MindsDB
“Andriy paints for us, in 100 marvelous strokes, the journey from lin-
ear algebra basics to the implementation of transformers.”
― Florian Douetteau, Co-founder and CEO at Dataiku
Featuring a foreword by Tomáš Mikolov and back cover text by Vint Cerf
The Hundred-Page Language Models Book
Andriy Burkov
Copyright © 2025 Andriy Burkov. All rights reserved.
1. Read First, Buy Later: You are welcome to freely read and share this book with others by
preserving this copyright notice. However, if you find the book valuable or continue to use it,
you must purchase your own copy. This ensures fairness and supports the author.
2. No Unauthorized Use: No part of this work—its text, structure, or derivatives—may be used
to train artificial intelligence or machine learning models, nor to generate any content on
websites, apps, or other services, without the author’s explicit written consent. This restriction
applies to all forms of automated or algorithmic processing.
3. Permission Required If you operate any website, app, or service and wish to use any portion
of this work for the purposes mentioned above—or for any other use beyond personal read-
ing—you must first obtain the author’s explicit written permission. No exceptions or implied
licenses are granted.
4. Enforcement: Any violation of these terms is copyright infringement. It may be pursued le-
gally in any jurisdiction. By reading or distributing this book, you agree to abide by these
conditions.
ISBN 978-1-7780427-2-0
Publisher: True Positive Inc.
To my family, with love
“Language is the source of misunderstandings.”
―Antoine de Saint-Exupéry, The Little Prince
“In mathematics you don't understand things. You just get used to them.”
―John von Neumann
x
Foreword
First time I got involved in language modeling was already two decades ago. I wanted to improve
some of my data compression algorithms and found out about the n-gram statistics. Very simple
concept, but so hard to beat! Then I quickly gained another motivation—since my childhood, I was
interested in artificial intelligence. I had a vision of machines that would understand patterns in
our world that are hidden from our limited minds. It would be so exciting to talk with such super-
intelligence. And I realized that language modeling could be a way towards such AI.
I started searching for others sharing this vision and did find the works of Solomonoff, Schmidhu-
ber and the Hutter prize competition organized by Matt Mahoney. They all did write about AI
completeness of language modeling and I knew I had to try to make it work. But the world was
very different than it is today. Language modeling was considered a dead research direction, and
I've heard countless times that I should give up as nothing will ever beat n-grams on large data.
I've completed my master's thesis on neural language models, as these models were quite similar
to what I previously developed for data compression, and I did believe the distributed representa-
tions that could be applied to any language is the right way to go. This infuriated a local linguist
who declared my ideas to be a total nonsense as language modeling has to be addressed from the
linguistics point of view, and each language had to be treated differently.
However, I did not give up and did continue working on my vision of AI-complete language models.
Just the summer before starting my PhD, I did come up with the idea to generate text from these
neural models. I was amazed by how much better this text was than text generated from n-grams
models. That was summer 2007 and I quickly realized the only person excited about this at the
Brno University of Technology was actually me. But I did not give up anyways.
In the following years, I did develop a number of algorithms to make neural language models more
useful. To convince others about their qualities, I published open-source toolkit RNNLM in 2010.
It had the first implementations ever of neural text generation, gradient clipping, dynamic evalua-
tion, model adaptation (nowadays called fine-tuning) and other tricks such as hierarchical softmax
or splitting infrequent words into subword units. However, the result I was the most proud of was
when I could demonstrate in my PhD thesis that neural language models not only beat n-grams on
large datasets—something widely considered to be impossible at the time—but the improvements
were actually increasing with the amount of training data. This happened for the first time after
something like fifty years of language modeling research and I still remember the disbelief in faces
of famous researchers when I showed them my work.
Fast forward some fifteen years, and I'm amazed by how much the world has changed. The mindset
completely flipped—what used to be some obscure technology in a dead research direction
is now thriving and gets the attention of CEOs of the largest companies in the world. Language
models are everywhere today. With all this hype, I think it is needed more than ever to actually
understand this technology.
Young students who want to learn about language modeling are flooded with information. Thus,
I was delighted when I learned about Andriy's project to write a short book with only one hundred
pages that would cover some of the most important ideas. I think the book is a good start for
anyone new to language modeling who aspires to improve on state of the art—and if someone tells
xi
you that everything that could have been invented in language modeling has already been discov-
ered, don't believe it.
xii
Preface
My interest in text began in the late 1990s during my teenage years, building dynamic websites
using Perl and HTML. This early experience with coding and organizing text into structured for-
mats sparked my fascination with how text could be processed and transformed. Over the years, I
advanced to building web scrapers and text aggregators, developing systems to extract structured
data from webpages. The challenge of processing and understanding text led me to explore more
complex applications, including designing chatbots that could understand and address user needs.
The challenge of extracting meaning from words intrigued me. The complexity of the task only
fueled my determination to “crack” it, using every tool at my disposal—ranging from regular ex-
pressions and scripting languages to text classifiers and named entity recognition models.
The rise of large language models (LLMs) transformed everything. For the first time, computers
could converse with us fluently and follow verbal instructions with remarkable precision. However,
like any tool, their immense power comes with limitations. Some are easy to spot, but others are
more subtle, requiring deep expertise to handle properly. Attempting to build a skyscraper without
fully understanding your tools will only result in a pile of concrete and steel. The same holds true
for language models. Approaching large-scale text processing tasks or creating reliable products
for paying users requires precision and knowledge—guesswork simply isn’t an option.
• Production deployment: Topics like model serving, API development, scaling for high
traffic, monitoring, and cost optimization are not covered. The code examples focus on un-
derstanding the concepts rather than production readiness.
• Enterprise applications: This book won’t guide you through building commercial LLM
applications, handling user data, or integrating with existing systems.
Book Structure
To make this book engaging and to deepen the reader’s understanding, I decided to discuss lan-
guage modeling as a whole, including approaches that are often overlooked in modern literature.
While Transformer-based LLMs dominate the spotlight, earlier approaches like count-based meth-
ods and recurrent neural networks (RNNs) remain effective for some tasks.
Learning the math of the Transformer architecture from scratch may seem overwhelming for some-
one starting from scratch. By revisiting these foundational methods, my goal is to gradually build
up the reader’s intuition and mathematical understanding, making the transition to modern Trans-
former architectures feel like a natural progression rather than an intimidating leap.
The book is divided into six chapters, progressing from fundamentals to advanced topics:
• Chapter 1 covers machine learning basics, including key concepts like AI, models, neu-
ral networks, and gradient descent. Even if you’re familiar with these topics, the chapter
provides important foundations for understanding language models.
• Chapter 2 introduces language modeling fundamentals, exploring text representation
methods like bag of words and word embeddings, as well as count-based language mod-
els and evaluation techniques.
• Chapter 3 focuses on recurrent neural networks, covering their implementation, train-
ing, and application as language models.
• Chapter 4 provides a detailed exploration of the Transformer architecture, including key
components like self-attention, position embeddings, and practical implementation.
• Chapter 5 examines large language models (LLMs), discussing why scale matters, fine-
tuning techniques, practical applications, and important considerations around halluci-
nations, copyright, and ethics.
• Chapter 6 concludes with further reading on advanced topics like mixture of experts,
model compression, preference-based alignment, and vision language models, providing
direction for continued learning.
Most chapters contain working code examples you can run and modify. While only essential code
appears in the book, complete code is available as Jupyter notebooks on the book’s website, with
xiv
notebooks referenced in relevant sections. All code in notebooks remains compatible with the latest
stable versions of Python, PyTorch, and other libraries.
The notebooks run on Google Colab, which at the time of writing offers free access to computing
resources including GPUs and TPUs. These resources, though, aren’t guaranteed and have usage
limits that may vary. Some examples might require extended GPU access, potentially involving
wait times for availability. If the free tier proves limiting, Colab’s pay-as-you-go option lets you
purchase compute credits for reliable GPU access. While these credits are relatively affordable by
North American standards, costs may be significant depending on your location.
For those familiar with the Linux command line, GPU cloud services provide another option
through pay-per-time virtual machines with one or more GPUs. The book’s wiki maintains current
information on free and paid notebook or GPU rental services.
Verbatim terms and blocks indicate code, code fragments, or code execution outputs. Bold terms
link to the book’s term index, and occasionally highlight algorithm steps.
In this book, we use pip3 to ensure the packages are installed for Python 3. On most modern
systems, you can use pip instead if it's already set up for Python 3.
Acknowledgements
The high quality of this book would be impossible without volunteering editors. I especially thank
Erman Sert, Viet Hoang Tran Duong, Alex Sherstinsky, Kelvin Sundli, and Mladen Korunoski for
their systematic contributions.
I am also grateful to Alireza Bayat Makou, Taras Shalaiko, Domenico Siciliani, Preethi Raju, Sriku-
mar Sundareshwar, Mathieu Nayrolles, Abhijit Kumar, Giorgio Mantovani, Abhinav Jain, Steven
Finkelstein, Ryan Gaughan, Ankita Guha, Harmanan Kohli, Daniel Gross, Kea Kohv, Marcus
Oliveira, Tracey Mercier, Prabin Kumar Nayak, Saptarshi Datta, Gurgen R. Hayrapetyan, Sina Ab-
didizaji, Federico Raimondi Cominesi, Santos Salinas, Anshul Kumar, Arash Mirbagheri, Roman
Stanek, Jeremy Nguyen, Efim Shuf, and Manoj Pillai for their help.
If this is your first time exploring language models, I envy you a little—it’s truly magical to discover
how machines learn to understand the world through natural language.
I hope you enjoy reading this book as much as I enjoyed writing it.
Now grab your tea or coffee, and let’s begin!
xv
xvi
Chapter 1. Machine Learning Basics
This chapter starts with a brief overview of how artificial intelligence has evolved, explains what a
machine learning model is, and presents the four steps of the machine learning process. Then, it
covers some math basics like vectors and matrices, introduces neural networks, and wraps up with
optimization methods like gradient descent and automatic differentiation.
Between 1975 and 1980, and again between 1987 and 2000, AI went through two “winters” where
enthusiasm and funding dropped. Research outcomes did not meet high hopes set by initial suc-
cesses, so investors and policymakers lost confidence. Many projects were halted or slowed down,
leading to a significant decline in AI research and development across academia and industry.
During the first AI winter, even the term “AI” became somewhat taboo. Many research-
ers rebranded their work as “informatics,” “knowledge-based systems,” or “pattern
recognition” to avoid association with AI’s perceived failures.
1
Enthusiasm for AI has grown steadily since the early 1990s. Interest surged around 2012, particu-
larly in machine learning, driven by advances in computational power, access to large datasets,
and improvements in neural network algorithms and frameworks. These developments led to in-
creased funding and a significant AI boom.
Although the focus of artificial intelligence research has evolved, the core goal remains the same:
to create methods that enable machines to solve problems previously considered solvable only by
humans. This is how the term will be used throughout this book.
The term machine learning was introduced in 1959 by Arthur Samuel. In his paper, “Some Studies
in Machine Learning Using the Game of Checkers,” he described it as “programming computers to
learn from experience.”
Early AI researchers primarily focused on symbolic methods and rule-based systems—an approach
later dubbed good old-fashioned AI (GOFAI)—but over time, the field increasingly embraced
machine learning approaches, with neural networks emerging as a particularly powerful technique.
Neural networks, inspired by the brain, aimed to learn patterns directly from examples. One pio-
neering model and algorithm to train it, the Perceptron, was introduced by Frank Rosenblatt in
1958. It became a key step toward later advancements. The Perceptron defines a decision bound-
ary, a line that separates examples of two classes (e.g., spam and not spam):
Decision trees and random forests represent important evolutionary steps in machine learning.
Decision trees, introduced in 1963 by John Sonquist and James Morgan and later advanced by
Ross Quinlan’s ID3 algorithm in 1986, split data into subsets through a tree-like structure. Each
node represents a question about the data, each branch is an answer, and each leaf provides a
prediction. While these models are easy to understand, they can struggle with overfitting, where
they adapt too closely to training data, reducing their ability to perform well on new, unseen data.
To address this limitation, Leo Breiman introduced the random forest algorithm in 2001. A ran-
dom forest builds multiple decision trees using random subsets of data and combines their out-
puts. This approach improves predictive accuracy and reduces overfitting. Random forests remain
widely used for their reliability and performance.
2
Support vector machines (SVMs), introduced in 1992 by Vladimir Vapnik and his colleagues,
were another significant step forward. SVMs identify the optimal hyperplane that separates data
points of different classes with the widest margin. The introduction of kernel methods allowed
SVMs to manage complex, non-linear patterns by mapping data into higher-dimensional spaces,
making it easier to find a suitable separating hyperplane. These advances made SVMs central to
machine learning research.
Today, machine learning is a subfield of AI focused on creating algorithms that learn from collec-
tions of examples. These examples can come from nature, be designed by humans, or be generated
by other algorithms. The process involves gathering a dataset and building a model from it, which
is then used to solve a problem.
1.2. Model
A model is typically represented by a mathematical equation:
𝑦 = 𝑓(𝑥)
Here, 𝑥 is the input, 𝑦 is the output, and 𝑓 represents a function of 𝑥. A function is a named rule
that describes how one set of values is related to another. Formally, a function 𝑓 maps inputs from
the domain to outputs in the codomain, ensuring each input has exactly one output. The function
uses a specific rule or formula to transform the input into the output.
In machine learning, the goal is to compile a dataset of examples and use them to build 𝑓, so
when 𝑓 is applied to a new, unseen 𝑥, it produces a 𝑦 that gives meaningful insight into 𝑥.
To estimate a house’s price based on its area, the dataset might include (area, price) pairs such as
{(150,200), (200,600), … }. Here, the area is measured in m! , and the price is in thousands.
Curly brackets denote a set. A set containing 𝑁 elements, ranging from 𝑥" to 𝑥# , is
expressed as {𝑥$ }#
$%" .
Imagine we own a house with an area of 250 m! (about 2691 square feet). To find a function 𝑓
that returns a reasonable price for this house, testing every possible function is infeasible. Instead,
we select a specific structure for 𝑓 and focus on functions that match this structure.
Let’s define the structure for 𝑓 as:
def
𝑓(𝑥) = 𝑤𝑥 + 𝑏, (1.1)
which describes a linear function of 𝑥. The formula 𝑤𝑥 + 𝑏 is a linear transformation of 𝑥.
def
The notation = means “equals by definition” or “is defined as.”
3
For linear functions, determining 𝑓 requires only two values: 𝑤 and 𝑏. These are called the param-
eters or weights of the model.
In other texts, 𝑤 might be referred to as the slope, coefficient, or weight term. Similarly, 𝑏 may
be called the intercept, constant term, or bias. In this book, we’ll stick to “weight” for 𝑤 and
“bias” for 𝑏, as these terms are widely used in machine learning. When the meaning is clear, “pa-
rameters” and “weights” will be used interchangeably.
!
For instance, when 𝑤 = & and 𝑏 = 1, the linear function is shown below:
Here, the bias shifts the graph vertically, so the line crosses the 𝑦-axis at 𝑦 = 1. The weight deter-
mines the slope, meaning the line rises by 2 units for every 3 units it moves to the right.
Even with a simple model like 𝑓(𝑥) = 𝑤𝑥 + 𝑏, the parameters 𝑤 and 𝑏 can take infinitely many
values. To find the best ones, we need a way to measure optimality. A natural choice is to minimize
the average prediction error when estimating house prices from area. Specifically, we want 𝑓(𝑥) =
𝑤𝑥 + 𝑏 to generate predictions that match the actual prices as closely as possible.
Let our dataset be {(𝑥$ , 𝑦$ )}#
$%" , where 𝑁 is the size of the dataset and {(𝑥" , 𝑦" ), (𝑥! , 𝑦! ), … , (𝑥# , 𝑦# )}
are individual examples , with each 𝑥$ being the input and corresponding 𝑦$ being the target.
When examples contain both inputs and targets, the learning process is called supervised. This
book focuses on supervised machine learning.
Other machine learning types include unsupervised learning, where models learn pat-
terns from inputs alone, and reinforcement learning, where models learn by interact-
ing with environments and receiving rewards or penalties for their actions.
4
When 𝑓(𝑥) is applied to 𝑥$ , it generates a predicted value 𝑦9$ . We can define the prediction error
err(𝑦9$ , 𝑦$ ) for a given example (𝑥$ , 𝑦$ ) as:
def
err(𝑦9$ , 𝑦$ ) = (𝑦9$ − 𝑦$ )! (1.2)
This expression, called squared error, equals 0 when 𝑦9$ = 𝑦$ . This makes sense: no error if pre-
dicted price matches the actual price. The further 𝑦9$ deviates from 𝑦$ , the larger the error becomes.
Squaring ensures the error is always positive, whether the prediction overshoots or undershoots.
We define 𝑤 ∗ and 𝑏∗ as the optimal parameter values for 𝑤 and 𝑏 in our function 𝑓, when they
minimize the average price prediction error across our dataset. This error is calculated using the
following expression:
err(𝑦9" , 𝑦" ) + err(𝑦9! , 𝑦! ) + ⋯ + err(𝑦9# , 𝑦# )
𝑁
Let’s rewrite the above expression by expanding each err(⋅):
(𝑦9" − 𝑦" )! + (𝑦9! − 𝑦! )! + ⋯ + (𝑦9# − 𝑦# )!
𝑁
Let’s assign the name 𝐽(𝑤, 𝑏) to our expression, turning it into a function:
5
Let’s plot it:
Navigate to the book’s wiki, from the file thelmbook.com/py/1.1 retrieve the code used
to generate the above plot, run the code, and rotate the graph to observe the minimum.
)* )*
Now we need to derive the expressions for and . Notice that 𝐽(𝑤, 𝑏) is a composition of the
)' )+
following functions:
def def def
• Functions 𝑑" = 150𝑤 + 𝑏 − 200, 𝑑! = 200𝑤 + 𝑏 − 600, 𝑑& = 260𝑤 + 𝑏 − 500 are linear
functions of 𝑤 and 𝑏;
def def def
• Functions err" = 𝑑"! , err! = 𝑑!! , err& = 𝑑&! are quadratic functions of 𝑑" , 𝑑! , and 𝑑& ;
def "
• Function 𝐽 = & (err" + err! + err& ) is a linear function of err" , err! , and err& .
A composition of functions means the output of one function becomes the input to
another. For example, with two functions 𝑓 and 𝑔, you first apply 𝑔 to 𝑥, then apply 𝑓
to the result. This is written as 𝑓C𝑔(𝑥)D, which means you calculate 𝑔(𝑥) first and then
use that result as the input for 𝑓.
In our loss function 𝐽(𝑤, 𝑏), the process starts by computing the linear functions for 𝑑" , 𝑑! , and 𝑑&
using the current values of 𝑤 and 𝑏. These outputs are then passed into the quadratic functions
err" , err! , and err& . The final step is averaging these results to compute 𝐽.
)*
Using the sum rule and the constant multiple rule of differentiation, )' is given by:
∂𝐽 1 ∂err" ∂err! ∂err&
= E + + F,
∂𝑤 3 ∂𝑤 ∂𝑤 ∂𝑤
6
)err! )err" )err#
where )'
, )'
, and )'
are the partial derivatives of err" , err! , and err& with respect to 𝑤.
The sum rule of differentiation states that the derivative of the sum of two functions
) ) )
equals the sum of their derivatives: ), [𝑓(𝑥) + 𝑔(𝑥)] = ), 𝑓(𝑥) + ), 𝑔(𝑥).
The constant multiple rule of differentiation states that the derivative of a constant
multiplied by a function equals the constant times the derivative of the function:
) )
[𝑐 ⋅ 𝑓(𝑥)] = 𝑐 ⋅ 𝑓(𝑥).
), ),
By applying the chain rule of differentiation, the partial derivatives of err" , err! , and err& with
respect to 𝑤 are:
The chain rule of differentiation states that the derivative of a composite function
)
𝑓C𝑔(𝑥)D, written as ), J𝑓C𝑔(𝑥)DK, is the product of the derivative of 𝑓 with respect to 𝑔
) )- ).
and the derivative of 𝑔 with respect to 𝑥, or: ), J𝑓C𝑔(𝑥)DK = ). ⋅ ), .
Then,
Therefore,
)*
Similarly, we find :
)+
∂𝐽 1
= C2 ⋅ (150𝑤 + 𝑏 − 200) + 2 ⋅ (200𝑤 + 𝑏 − 600) + 2 ⋅ (260𝑤 + 𝑏 − 500)D
∂𝑏 3
1
= (1220𝑤 + 6𝑏 − 2600)
3
7
Setting the partial derivatives to 0 results in the following system of equations:
1
(260200𝑤 + 1220𝑏 − 560000) = 0
?3
1
(1220𝑤 + 6𝑏 − 2600) = 0
3
Simplifying the system and using substitution to solve for the variables gives the optimal values:
𝑤 ∗ = 2.58 and 𝑏∗ = −91.76.
The resulting model 𝑓(𝑥) = 2.58𝑥 − 91.76 is shown in the plot below. It includes the three exam-
ples (blue dots), the model itself (red solid line), and a prediction for a new house with an area of
240 m! (dotted orange lines).
A vertical blue dashed line shows the square root of the model’s prediction error compared to the
actual price.1 Smaller errors mean the model fits the data better. The loss, which aggregates these
errors, measures how well the model aligns with the dataset.
When we calculate the loss using our model’s training dataset (called the training set), we obtain
the training loss. For our model, this training loss is defined by Equation 1.3. Using our learned
parameter values, we can now compute the loss for the training set:
(2.58 ⋅ 150 − 91.76 − 200)! (2.58 ⋅ 200 − 91.76 − 600)!
𝐽(2.58, −91.76) = +
3 3
(2.58 ⋅ 260 − 91.76 − 500)!
+
3
= 15403.19.
The square root of this value is approximately 124.1, indicating an average prediction error of
around $124,100. The interpretation of whether a loss value is high or low depends on the specific
1 It’s the square root of the error because our error, as defined in Equation 1.2, is the square of the difference be-
tween the predicted price and the real price of the house. It’s common practice to take the square root of the mean
squared error because it expresses the error in the same units as the target variable (price in this case). This makes
it easier to interpret the error value.
8
business context and comparative benchmarks. Neural networks and other non-linear models,
which we explore later in this chapter, typically achieve lower loss values.
1.4. Vector
To predict a house price, knowing its area alone isn’t enough. Factors like the year of construction
or the number of bedrooms and bathrooms also matter. Suppose we use two attributes: (1) area
and (2) number of bedrooms. In this case, the input 𝐱 becomes a feature vector. This vector in-
cludes two features, also called dimensions or components:
(")
𝐱 = R𝑥 (!) S
def
𝑥
In this book, vectors are represented with lowercase bold letters, such as 𝐱 or 𝐰. For a given house
𝐱, 𝑥 (") represents its size in square meters, and 𝑥 (!) represents the number of bedrooms.
The dimensionality of the vector, or its size, refers to the number of components it contains. Here,
𝐱 has two components, so its dimensionality is 2.
With two features, our linear model needs three parameters: the weights 𝑤 (") and 𝑤 (!) , and the
bias 𝑏. The weights can be grouped into a vector:
(")
𝐰 = R𝑤 (!) S
def
𝑤
The linear model can then be written compactly as:
𝑦 = 𝐰 ⋅ 𝐱 + 𝑏, (1.4)
where 𝐰 ⋅ 𝐱 is a dot product of two vectors (also known as scalar product). It is defined as:
9
3
def
𝐰 ⋅ 𝐱 = U 𝑤 (2) 𝑥 (2)
2%"
The dot product combines two vectors of the same dimensionality to produce a scalar, a number
like 22, 0.67, or −10.5. Scalars in this book are denoted by italic lowercase or uppercase letters,
such as 𝑥 or 𝐷. The expression 𝐰 ⋅ 𝐱 + 𝑏 generalizes the idea of a linear transformation to vectors.
The equation above uses capital-sigma notation, where 𝐷 represents the dimensionality of the
input, and 𝑗 runs from 1 to 𝐷. For example, in the 2-dimensional house scenario,
def
∑!2%" 𝑤 (2) 𝑥 (2) = 𝑤 (")𝑥 (") + 𝑤 (!) 𝑥 (!) .
Although the capital-sigma notation suggests the dot product might be implemented
as a loop, modern computers handle it much more efficiently. Optimized linear alge-
bra libraries like BLAS and cuBLAS compute the dot product using low-level, highly
optimized methods. These libraries leverage hardware acceleration and parallel pro-
cessing, achieving speeds far beyond a simple loop.
The sum of two vectors 𝐚 and 𝐛, both with the same dimensionality 𝐷, is defined as:
def 1
𝐚 + 𝐛 = J𝑎(") + 𝑏(") , 𝑎(!) + 𝑏(!) , … , 𝑎(3) + 𝑏 (3) K
The calculation for a sum of two 3-dimensional vectors is illustrated below:
In this chapter’s illustrations, the numbers in the cells indicate the position of an ele-
ment within an input or output matrix, or a vector. They do not represent actual values.
10
The norm of a vector 𝐱, denoted ∥ 𝐱 ∥, represents its length or magnitude. It is defined as the
square root of the sum of the squares of its components:
3
def
∥ 𝐱 ∥= ^U(𝑥 (2) )!
2%"
11
Unit vectors are valuable because their dot product equals the cosine of the angle be-
tween them, and computing dot products is efficient. When documents are represented
as unit vectors, finding similar ones becomes fast by calculating the dot product be-
tween the query vector and document vectors. This is how vector search engines and
libraries like Milvus, Qdrant, and Weaviate operate.
As dimensions increase, the number of parameters in a linear model becomes too large to solve
manually. Furthermore, in high-dimensional spaces, we cannot visually verify if data follows a
linear pattern. Even if we could visualize beyond three dimensions, we would still need more flex-
ible models to handle data that linear models cannot fit.
The next section explores non-linear models, with a focus on neural networks—the foundation for
understanding large language models, which are a specialized neural network architecture.
12
To overcome this, we introduce non-linearity. For a one-dimensional input, the model becomes:
𝑦 = 𝜙(𝑤𝑥 + 𝑏)
The function 𝜙 is a fixed non-linear function, known as an activation. Common choices are:
def
1) ReLU (rectified linear unit): ReLU(𝑧) = max(0, 𝑧), which outputs non-negative values
and is widely used in neural networks;
def "
2) Sigmoid: 𝜎(𝑧) = "45 $%, which outputs values between 0 and 1, making it suitable for bi-
nary classification (e.g., classifying spam emails as 1 and non-spam as 0);
def 5 % 65 $%
3) Tanh (hyperbolic tangent): tanh(𝑧) = 5 % 45 $%
; outputs values between −1 and 1.
The structure 𝜙(𝑤𝑥 + 𝑏) enables learning non-linear models but can’t capture all non-linear curves.
def
By nesting these functions, we build more expressive models. For instance, let 𝑓" (𝑥) = 𝜙(𝑎𝑥 + 𝑏)
def
and 𝑓! (𝑧) = 𝜙(𝑐𝑧 + 𝑑). A composite model combining 𝑓" and 𝑓! is:
𝑦 = 𝑓! C𝑓" (𝑥)D = 𝜙(𝑐𝜙(𝑎𝑥 + 𝑏) + 𝑑)
Here, the input 𝑥 is first transformed linearly using parameters 𝑎 and 𝑏, then passed through the
non-linear function 𝜙. The result is further transformed linearly with parameters 𝑐 and 𝑑, followed
by another application of 𝜙.
Below is the graph representation of the composite model 𝑦 = 𝑓! C𝑓" (𝑥)D:
13
A computational graph represents the structure of a model. The computational graph above
shows two non-linear units (blue rectangles), often referred to as artificial neurons. Each unit
contains two trainable parameters—a weight and a bias—represented by grey circles. The left ar-
row ← denotes that the value on the right is assigned to the variable on the left. This graph illus-
trates a basic neural network with two layers, each containing one unit. Most neural networks in
practice are built with more layers and multiple units per layer.
Suppose we have a two-dimensional input, an input layer with three units, and an output layer
with a single unit. The computational graph appears as follows:
This structure represents a feedforward neural network (FNN), where information flows in one
direction—left to right—without loops. When units in each layer connect to all units in the subse-
quent layer, as shown above, we call it a multilayer perceptron (MLP). A layer where each unit
connects to all units in both adjacent layers is termed a fully connected layer, or dense layer.
In Chapter 3, we will explore recurrent neural networks (RNNs). Unlike FNNs, RNNs have loops,
where outputs from a layer are used as inputs to the same layer.
Convolutional neural networks (CNN) are feedforward neural networks with convo-
lutional layers that are not fully connected. While initially designed for image pro-
cessing, they are effective for tasks like document classification in text data. To learn
more about CNNs refer to the additional materials in the book’s wiki.
To simplify diagrams, individual neural units can be replaced with squares. Using this approach,
the above network can be represented more compactly as follows:
14