0% found this document useful (0 votes)
19 views27 pages

Chapter 1

Chapter 1 of 'The Hundred-Page Language Models Book' introduces the basics of machine learning, covering its evolution, key concepts, and foundational mathematical principles. It discusses the historical context of AI, including periods of optimism and setbacks, and explains the significance of models and optimization techniques in machine learning. The chapter emphasizes supervised learning, the role of functions in modeling, and the importance of minimizing prediction errors through methods like gradient descent.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views27 pages

Chapter 1

Chapter 1 of 'The Hundred-Page Language Models Book' introduces the basics of machine learning, covering its evolution, key concepts, and foundational mathematical principles. It discusses the historical context of AI, including periods of optimism and setbacks, and explains the significance of models and optimization techniques in machine learning. The chapter emphasizes supervised learning, the role of functions in modeling, and the importance of minimizing prediction errors through methods like gradient descent.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

DRAFT The Hundred-Page Language Models Book DRAFT

Chapter 1. Machine Learning Basics


This chapter introduces the fundamental concepts of machine learning. Starting with the evolution
of artificial intelligence and machine learning, it defines a model and presents the four-step
machine learning process. The chapter then covers essential mathematical foundations, including
vectors and matrices, before examining neural networks. It concludes with key optimization
techniques, focusing on gradient descent and automatic differentiation.

1.1. AI and Machine Learning


The term “artificial intelligence” (AI) was first introduced in 1955 during a workshop led by John
McCarthy, focusing on exploring how machines could use language, form concepts, solve problems
like humans, and improve over time. Building on these ideas, Joseph Weizenbaum developed the
first chatbot, ELIZA, in 1966. ELIZA simulated conversations by detecting patterns in user input and
replying with preprogrammed responses, giving the impression of understanding.

In AI’s early years, researchers were overly optimistic about achieving human-level intelligence. In
1965, Herbert Simon, a Turing Award recipient, predicted that “machines will be capable, within
twenty years, of doing any work a man can do.” However, progress was slower than expected,
leading to periods of reduced funding and interest, known as “AI winters.”

Interestingly, since the 1950s, experts consistently predicted that human-level AI will be achieved
in about 25 years:

Two major AI winters occurred in 1974–1980 and 1987–2000. These periods were marked by
notable setbacks, including the failure of machine translation in 1966, poor outcomes from
DARPA’s Speech Understanding Research program at Carnegie Mellon (1971–1975), and reduced
AI funding in the UK after the 1973 Lighthill report. In the 1990s, many expert systems—computer
programs that simulated human decision-making using predefined rules and domain-specific
logic—were abandoned due to high costs and limited success.

Read first, buy later 9


DRAFT The Hundred-Page Language Models Book DRAFT

During the first AI winter, even the term “AI” became somewhat taboo. Many
researchers rebranded their work as “informatics,” “knowledge-based systems,” or
“pattern recognition” to avoid association with AI’s perceived failures.

Enthusiasm for AI has grown steadily since the early 1990s. Interest surged around 2012,
particularly in machine learning, driven by advances in computational power, access to large
datasets, and improvements in neural network algorithms and frameworks. These developments
led to increased funding and a significant AI boom.

Although the focus of artificial intelligence research has evolved, the core goal remains the same:
to create methods that enable machines to solve problems previously considered solvable only by
humans. This is how the term will be used throughout this book.

The term “machine learning” was introduced in 1959 by Arthur Samuel. In his paper, “Some Studies
in Machine Learning Using the Game of Checkers,” he described it as “programming computers to
learn from experience.”

Early AI researchers primarily focused on symbolic methods and rule-based systems—an approach
later dubbed good old-fashioned AI (GOFAI)—but over time, the field increasingly embraced
machine learning approaches, with neural networks emerging as a particularly powerful technique.

Neural networks, inspired by the brain, aimed to learn patterns directly from examples. One
foundational model, the perceptron, was introduced by Frank Rosenblatt in 1958. It became a key
step toward later advancements. The perceptron defines a decision boundary, a line that separates
examples of two classes (e.g., spam and not spam):

Decision trees and random forests represent important evolutionary steps in machine learning.
Decision trees, introduced in the 1960s and later advanced by Ross Quinlan’s ID3 algorithm in
1986, split data into subsets through a tree-like structure. Each node represents a question about
the data, each branch is an answer, and each leaf provides a prediction. While these models are easy

Read first, buy later 10


DRAFT The Hundred-Page Language Models Book DRAFT

to understand, they can struggle with overfitting, where they adapt too closely to training data,
reducing their ability to perform well on new, unseen data.

To address this limitation, Leo Breiman introduced the random forest algorithm in 2001. A random
forest builds multiple decision trees using random subsets of data and combines their outputs. This
approach improves predictive accuracy and reduces overfitting. Random forests remain widely
used for their reliability and performance.

Support vector machines (SVMs), introduced in the 1990s by Vladimir Vapnik and his colleagues,
were another significant step forward. SVMs identify the optimal hyperplane that separates data
points of different classes with the widest margin. The introduction of kernel methods allowed
SVMs to manage complex, non-linear patterns by mapping data into higher-dimensional spaces,
making it easier to find a suitable separating hyperplane. These advances made SVMs central to
machine learning research.

Today, machine learning is a subfield of AI focused on creating algorithms that learn from
collections of examples. These examples can come from nature, be designed by humans, or be
generated by other algorithms. The process involves gathering a dataset and building a model from
it, which is then used to solve the problem.

I will use “learning” and “machine learning” interchangeably to save keystrokes.

1.2. Model
A model is typically represented by a mathematical equation:

𝑦 = 𝑓(𝑥)

Here, 𝑥 is the input, 𝑦 is the output, and 𝑓 represents a function of 𝑥. A function is a named rule
that describes how one set of values is related to another. Formally, a function 𝑓 maps inputs from
the domain to outputs in the codomain, ensuring each input has exactly one output. The function
uses a specific rule or formula to transform the input into the output.

In machine learning, the goal is to compile a dataset of examples and use them to build 𝑓, so when
𝑓 is applied to a new, unseen 𝑥, it produces a 𝑦 that gives meaningful insight into 𝑥.

To predict a house’s price based on its area, the dataset might include (area, price) pairs such as
{(150,200), (200,600), … }. Here, the area is measured in m2, and the price is in thousands.

Curly brackets denote a set. A set containing 𝑁 elements, ranging from 𝑥1 to 𝑥𝑁 , is


expressed as {𝑥𝑖 }𝑁
𝑖=1 .

Imagine we own a house with an area of 250 m2 (about 2691 square feet). To derive a function 𝑓
that provides a reasonable price for this house, testing every possible function is infeasible. Instead,
we select a specific structure for 𝑓 and focus on functions that match this structure.

Read first, buy later 11


DRAFT The Hundred-Page Language Models Book DRAFT

Let’s define the structure for 𝑓 as:

def
𝑓(𝑥) = 𝑤𝑥 + 𝑏, (1.1)

which describes a linear function of 𝑥. The formula 𝑤𝑥 + 𝑏 is a linear transformation of 𝑥.

def
The notation = means “equals by definition” or “is defined as.”

For linear functions, determining 𝑓 requires only two values: 𝑤 and 𝑏. These are called the
parameters or weights of the model.

In other texts, 𝑤 might be referred to as the slope, coefficient, or weight term. Similarly, 𝑏 may be
called the intercept, constant term, or bias. In this book, we’ll stick to “weight” for 𝑤 and “bias”
for 𝑏, as these terms are widely used in machine learning. When the meaning is clear, “parameters”
and “weights” will be used interchangeably.
2
For instance, when 𝑤 = and 𝑏 = 1, the linear function is shown below:
3

Here, the bias shifts the graph vertically, so the line crosses the 𝑦-axis at 𝑦 = 1. The weight
determines the slope, meaning the line rises by 2 units for every 3 units it moves to the right.

Mathematically, the function 𝑓(𝑥) = 𝑤𝑥 + 𝑏 is an affine transformation, rather than


a linear one, since a true linear transformation requires 𝑏 = 0. In machine learning,
however, we often call such models “linear” as long as the parameters appear linearly
in the equation. This means 𝑤 and 𝑏 are only multiplied by the inputs or constants and
added—they don’t multiply each other, get raised to powers, or appear inside functions
like sin(𝑤) or 𝑒 𝑏 .

Read first, buy later 12


DRAFT The Hundred-Page Language Models Book DRAFT

Even with a simple model like 𝑓(𝑥) = 𝑤𝑥 + 𝑏, the parameters 𝑤 and 𝑏 can take infinitely many
values. To find the optimal ones, we need an optimality criterion. A reasonable choice is to minimize
the average error when predicting house prices based on area. In this case, we aim for 𝑓(𝑥) = 𝑤𝑥 +
𝑏 to make predictions as close as possible to the actual prices.

Let our dataset be {(𝑥𝑖 , 𝑦𝑖 )}𝑁


𝑖=1 , where 𝑁 is the size of the dataset and {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑁 , 𝑦𝑁 )}
are individual examples. In machine learning, 𝑥𝑖 is called the input, and 𝑦𝑖 is the target. When every
example includes both an input and a target, the learning process is known as supervised. This
book’s focus is supervised machine learning.

Other machine learning types include unsupervised learning, where models learn
patterns from inputs alone, and reinforcement learning, where models learn by
interacting with environments and receiving rewards or penalties for their actions.

When 𝑓(𝑥) is applied to 𝑥𝑖 , it generates a predicted value 𝑦̃𝑖 . We can define the error err(𝑦̃𝑖 , 𝑦𝑖 ) for
a given example (𝑥𝑖 , 𝑦𝑖 ) as:
def
err(𝑦̃𝑖 , 𝑦𝑖 ) = (𝑦̃𝑖 − 𝑦𝑖 )2 (1.2)

This expression, called the squared error, equals 0 when 𝑦̃𝑖 = 𝑦𝑖 . This makes sense: there’s no
error if the predicted price matches the actual price. The further 𝑦̃𝑖 deviates from 𝑦𝑖 , the larger the
error becomes. Squaring ensures the error is always positive, whether the prediction overshoots
or undershoots.

We define 𝑤 ∗ and 𝑏∗ as the optimal parameter values for 𝑓, which minimize the average price
prediction error for the dataset using the following expression:

err(𝑦̃1 , 𝑦1 ) + err(𝑦̃2 , 𝑦2 ) + ⋯ + err(𝑦̃𝑁 , 𝑦𝑁 )


𝑁
Let’s rewrite the above expression by expanding each err(⋅):

(𝑦̃1 − 𝑦1 )2 + (𝑦̃2 − 𝑦2 )2 + ⋯ + (𝑦̃𝑁 − 𝑦𝑁 )2


𝑁
Let’s assign the name 𝐽(𝑤, 𝑏) to our expression, turning it into a function:

def (𝑤𝑥1 + 𝑏 − 𝑦1 )2 + (𝑤𝑥2 + 𝑏 − 𝑦2 )2 + ⋯ + (𝑤𝑥𝑁 + 𝑏 − 𝑦𝑁 )2


𝐽(𝑤, 𝑏) = (1.3)
𝑁

In the equation defining 𝐽(𝑤, 𝑏), which represents the average prediction error, the values of 𝑥𝑖 and
𝑦𝑖 for each 𝑖 from 1 to 𝑁 are known since they come from the dataset. The unknowns are 𝑤 and 𝑏.
To determine the optimal 𝑤 ∗ and 𝑏 ∗ , we need to minimize 𝐽(𝑤, 𝑏). As this function is quadratic in
two variables, calculus guarantees it has a single minimum.

The expression Equation 1.3 is referred to as the loss function in the machine learning problem of
linear regression. In this case, the loss function is the mean squared error or MSE.

Read first, buy later 13


DRAFT The Hundred-Page Language Models Book DRAFT

To find the optimum (minimum or maximum) of a function, we calculate its first derivative. When
we reach the optimum, the first derivative equals zero. For functions of two or more variables, like
the loss function 𝐽(𝑤, 𝑏), we compute partial derivatives with respect to each variable. We denote
∂𝐽 ∂𝐽
these as for 𝑤 and for 𝑏.
∂𝑤 ∂𝑏

To determine 𝑤 and 𝑏 ∗ , we solve the following system of two equations:


∂𝐽
=0
{∂𝑤
∂𝐽
=0
∂𝑏
Fortunately, the mean squared error function’s structure and the model’s linearity allow us to solve
this system of equations analytically. To illustrate, consider a dataset with three examples:
(𝑥1 , 𝑦1 ) = (150,200), (𝑥2 , 𝑦2 ) = (200,600), and (𝑥3 , 𝑦3 ) = (260,500). For this dataset, the loss
function is:

def (150𝑤 + 𝑏 − 200)2 + (200𝑤 + 𝑏 − 600)2 + (260𝑤 + 𝑏 − 500)2


𝐽(𝑤, 𝑏) =
3
Let’s plot it:

Navigate to the book’s wiki, from the file thelmbook.com/py/1.1 retrieve the code used
to generate the above plot, run the code, and rotate the graph in 3D to better observe
the minimum.

Read first, buy later 14


DRAFT The Hundred-Page Language Models Book DRAFT

∂𝐽 ∂𝐽
Now we need to derive the expressions for and . Notice that 𝐽(𝑤, 𝑏) is a composition of the
∂𝑤 ∂𝑏
following functions:
def def def
• Functions 𝑑1 = 150𝑤 + 𝑏 − 200, 𝑑2 = 200𝑤 + 𝑏 − 600, 𝑑3 = 260𝑤 + 𝑏 − 500 are
linear functions of 𝑤 and 𝑏;
def def def
• Functions err1 = 𝑑12 , err2 = 𝑑22 , err3 = 𝑑32 are quadratic functions of 𝑑1 , 𝑑2 , and 𝑑3 ;
def 1
• Function 𝐽 = (err1 + err2 + err3) is a linear function of err1, err2, and err3.
3

A composition of functions means the output of one function becomes the input to
another. For example, with two functions 𝑓 and 𝑔, you first apply 𝑔 to 𝑥, then apply 𝑓
to the result. This is written as 𝑓(𝑔(𝑥)), which means you calculate 𝑔(𝑥) first and then
use that result as the input for 𝑓.

In our loss function 𝐽(𝑤, 𝑏), the process starts by computing the linear functions for 𝑑1 , 𝑑2 , and 𝑑3
using the current values of 𝑤 and 𝑏. These outputs are then passed into the quadratic functions
err1, err2, and err3. The final step is averaging these results to compute 𝐽.
∂𝐽
Using the sum rule and the constant multiple rule of differentiation, is given by:
∂𝑤

∂𝐽 1 ∂err1 ∂err2 ∂err3


= ( + + ),
∂𝑤 3 ∂𝑤 ∂𝑤 ∂𝑤
∂err1 ∂err2 ∂err3
where , , and are the partial derivatives of err1, err2, and err3 with respect to 𝑤.
∂𝑤 ∂𝑤 ∂𝑤

The sum rule of differentiation states that the derivative of the sum of two functions
∂ ∂ ∂
equals the sum of their derivatives: ∂𝑥 [𝑓(𝑥) + 𝑔(𝑥)] = ∂𝑥 𝑓(𝑥) + ∂𝑥 𝑔(𝑥).
The constant multiple rule of differentiation states that the derivative of a constant
multiplied by a function equals the constant times the derivative of the function:
∂ ∂
[𝑐 ⋅ 𝑓(𝑥)] = 𝑐 ⋅ 𝑓(𝑥).
∂𝑥 ∂𝑥

By applying the chain rule of differentiation, the partial derivatives of err1, err2, and err3 with
respect to 𝑤 are:

Read first, buy later 15


DRAFT The Hundred-Page Language Models Book DRAFT

The chain rule of differentiation states that the derivative of a composite function

𝑓(𝑔(𝑥)), written as ∂𝑥 [𝑓(𝑔(𝑥))], is the product of the derivative of 𝑓 with respect to 𝑔
∂ ∂𝑓 ∂𝑔
and the derivative of 𝑔 with respect to 𝑥, or: [𝑓(𝑔(𝑥))] = ⋅ .
∂𝑥 ∂𝑔 ∂𝑥

Then,

Therefore,

∂𝐽
Similarly, we find ∂𝑏:

∂𝐽 1
= (2 ⋅ (150𝑤 + 𝑏 − 200) + 2 ⋅ (200𝑤 + 𝑏 − 600) + 2 ⋅ (260𝑤 + 𝑏 − 500))
∂𝑏 3
1
= (1220𝑤 + 6𝑏 − 2600)
3
Setting the partial derivatives to 0, which is required to locate the optimal values, results in the
following system of equations:
1
(260200𝑤 + 1220𝑏 − 560000) =0
{3
1
(1220𝑤 + 6𝑏 − 2600) =0
3
Simplifying the system and using substitution to solve for the variables gives the optimal values:
𝑤 ∗ = 2.58 and 𝑏∗ = −91.76.

The resulting model 𝑓(𝑥) = 2.58𝑥 − 91.76 is shown in the plot below. It includes the three
examples (blue dots), the model itself (red solid line), and a prediction for a new house with an area
of 240 m2 (dotted orange lines).

Read first, buy later 16


DRAFT The Hundred-Page Language Models Book DRAFT

A vertical blue dashed line shows the square root of the model’s prediction error compared to the
actual price.1 Smaller errors mean the model fits the data better. The loss, which aggregates these
errors, measures how well the model aligns with the dataset.

When we calculate the loss using the same dataset that trained the model, the result is called the
training loss. The dataset used for training is referred to as the training dataset or training set.
For our model, the training loss is defined by Equation 1.3. Now, we can use the learned parameter
values to compute the loss for the training set:

(2.58 ⋅ 150 − 91.76 − 200)2 (2.58 ⋅ 200 − 91.76 − 600)2


𝐽(2.58, −91.76) = +
3 3
2
(2.58 ⋅ 260 − 91.76 − 500)
+
3
= 15403.19.
The square root of this value is approximately 124.1, indicating an average prediction error of
around $124,100. The interpretation of whether a loss value is high or low depends on the specific
business context and comparative benchmarks. Neural networks and other non-linear models,
which we explore later in this chapter, typically achieve lower loss values.

1.3. Four-Step Machine Learning Process


At this stage, you should clearly understand the four steps involved in supervised learning:

1. Collect a dataset: For example, (𝑥1 , 𝑦1 ) = (150,200), (𝑥2 , 𝑦2 ) = (200,600), and


(𝑥3 , 𝑦3 ) = (260,500).

1It’s the square root of the error because our error, as defined in Equation 1.2, is the square of the difference
between the predicted price and the real price of the house. It’s common practice to take the square root of the
mean squared error because it expresses the error in the same units as the target variable (price in this case). This
makes it easier to interpret the error value.

Read first, buy later 17


DRAFT The Hundred-Page Language Models Book DRAFT

2. Define the model’s structure: For example, 𝑦 = 𝑤𝑥 + 𝑏.


3. Define the loss function: Such as Equation 1.3.
4. Minimize the loss: Minimize the loss function on the dataset.

In our example, we minimized the loss manually by solving a system of two equations with two
variables. This approach works for small systems. However, as models grow in complexity—such
as large language models with billions of parameters—manual approach becomes infeasible. Let’s
now introduce new concepts that will help us address this challenge.

1.4. Vector
To predict a house price, knowing its area alone isn’t enough. Factors like the year of construction
or the number of bedrooms and bathrooms also matter. Suppose we use two attributes: (1) area
and (2) number of bedrooms. In this case, the input 𝐱 becomes a feature vector. This vector
includes two features, also called dimensions or components:
def (1)
𝐱 = [𝑥 (2) ]
𝑥
In this book, vectors are represented with lowercase bold letters, such as 𝐱 or 𝐰. For a given house
𝐱, 𝑥 (1) represents its size in square meters, and 𝑥 (2) represents the number of bedrooms. The
dimensionality of the vector, or its size, refers to the number of components it contains. Here, 𝐱
has two components, so its dimensionality is 2.

A vector is usually represented as a column of numbers, called a column vector.


However, in text, it is often written as its transpose, 𝐱⊤ . Transposing a column vector
def
converts it into a row vector. For example, 𝐱⊤ = [𝑥 (1) , 𝑥 (2) ].

With two features, our linear model needs three parameters: the weights 𝑤 (1) and 𝑤 (2) , and the
bias 𝑏. The weights can be grouped into a vector:
def (1)
𝐰 = [𝑤 (2) ]
𝑤
The linear model can then be written compactly as:

𝑦 = 𝐰 ⋅ 𝐱 + 𝑏, (1.4)

where 𝐰 ⋅ 𝐱 is a dot product of two vectors (also known as scalar product). It is defined as:
𝐷
def
𝐰 ⋅ 𝐱 = ∑ 𝑤 (𝑗) 𝑥 (𝑗)
𝑗=1

The dot product combines two vectors of the same dimensionality to produce a scalar, a single
number like 22, 0.67, or −10.5. Scalars in this book are denoted by italic lowercase or uppercase

Read first, buy later 18


DRAFT The Hundred-Page Language Models Book DRAFT

letters, such as 𝑥 or 𝐷. The expression 𝐰 ⋅ 𝐱 + 𝑏 generalizes the idea of a linear transformation to


vectors.

The equation uses capital-sigma notation, where 𝐷 represents the dimensionality of the input,
and 𝑗 runs from 1 to 𝐷. For example, in the 2-dimensional house scenario,
def
∑2𝑗=1 𝑤 (𝑗) 𝑥 (𝑗) = 𝑤 (1) 𝑥 (1) + 𝑤 (2) 𝑥 (2).

Although the capital-sigma notation suggests the dot product might be implemented as
a loop, modern computers handle it much more efficiently. Optimized linear algebra
libraries like BLAS and cuBLAS compute the dot product using low-level, highly
optimized methods. These libraries leverage hardware acceleration and parallel
processing, achieving speeds far beyond a simple manual loop.

The sum of two vectors 𝐚 and 𝐛, both with the same dimensionality 𝐷, is defined as:
def ⊤
𝐚 + 𝐛 = (𝑎 (1) + 𝑏 (1) , 𝑎(2) + 𝑏(2) , … , 𝑎(𝐷) + 𝑏 (𝐷) )

The calculation for a sum of two 3-dimensional vectors is illustrated below:2

The element-wise product of two vectors 𝐚 and 𝐛 of dimensionality 𝐷, is defined as:


def
𝐚 ⊙ 𝐛 = (𝑎(1) ⋅ 𝑏 (1) , 𝑎(2) ⋅ 𝑏 (2) , … , 𝑎(𝐷) ⋅ 𝑏 (𝐷) )⊤

The computation of the element-wise product for two 3-dimensional vectors is shown below:

2In this chapter’s illustrations, the numbers in the cells indicate the position of an element within an input or output
matrix, or a vector. They do not represent actual values.

Read first, buy later 19


DRAFT The Hundred-Page Language Models Book DRAFT

The norm of a vector 𝐱, denoted ∥ 𝐱 ∥, represents its length or magnitude. It is defined as the
square root of the sum of the squares of its components:

𝐷
def
∥ 𝐱 ∥= √∑(𝑥 (𝑗) )2
𝑗=1

For a 2-dimensional vector 𝐱, the norm is:

∥ 𝐱 ∥= √(𝑥 (1) )2 + (𝑥 (2) )2

The cosine of the angle 𝜃 between two vectors 𝐱 and 𝐲 is defined as:
𝐱⋅𝐲
cos(𝜃) = (1.5)
∥ 𝐱 ∥∥ 𝐲 ∥

The cosine of the angle between two vectors quantifies their similarity. For instance, two houses
with similar areas and bedroom counts will have a cosine similarity close to 1. Cosine similarity
is widely used to compare words or documents represented as embedding vectors. This will be
discussed further in Section 2.2.

A zero vector has all components equal to zero. A unit vector has a length of 1. To convert any
non-zero vector 𝐱 into a unit vector 𝐱̂, you divide the vector by its norm:
𝐱
𝐱̂ =
∥𝐱∥

Dividing a vector by a number results in a new vector where each component of the original vector
is divided by that number.

A unit vector preserves the direction of the original vector but has a length of 1. The figure below
demonstrates this with 2-dimensional examples. On the left, aligned vectors have cos(𝜃) = 0.78.
On the right, nearly orthogonal vectors have cos(𝜃) = −0.02.

Read first, buy later 20


DRAFT The Hundred-Page Language Models Book DRAFT

Unit vectors are valuable because their dot product equals the cosine of the angle
between them, and computing dot products is efficient. When documents are
represented as unit vectors, finding similar ones becomes fast by calculating the dot
product between the query vector and document vectors. This is how vector search
engines and libraries like Faiss, Qdrant, and Weaviate operate.

As dimensions increase, the number of parameters in a linear model becomes too large to solve
manually. These models also face inherent limitations—they can only fit data that follows a straight
line or its higher-dimensional analogues like planes and hyperplanes. (This problem is illustrated
in the next section.)

In high-dimensional spaces, we cannot visually verify if data follows a linear pattern. Even if we
could visualize beyond three dimensions, we would still need more flexible models to handle data
that linear models cannot fit.

The next section explores non-linear models, with a focus on neural networks—the foundation for
understanding large language models, which are a specialized neural network architecture.

1.5. Neural Network


A neural network differs from a linear model in two fundamental ways: (1) it applies fixed non-
linear functions to the outputs of trainable linear functions, and (2) its structure is deeper,
combining multiple functions hierarchically through layers. Let’s illustrate these differences.

Linear models like 𝑤𝑥 + 𝑏 or 𝐰 ⋅ 𝐱 + 𝑏 cannot solve many machine learning problems effectively.
Even if we combine them into a composite function 𝑓2 (𝑓1 (𝑥)), a composite function of linear
functions remains linear. This is straightforward to verify.
def def
Let’s define 𝑦1 = 𝑓1 (𝑥) = 𝑎1 𝑥 and 𝑦2 = 𝑓2 (𝑦1 ) = 𝑎2 𝑦1 . Here, 𝑓2 depends on 𝑓1 , making it a
composite function. We can rewrite 𝑓2 as:

Read first, buy later 21


DRAFT The Hundred-Page Language Models Book DRAFT

𝑦2 = 𝑎2 𝑦1 = 𝑎2 (𝑎1 𝑥) = (𝑎2 𝑎1 )𝑥
def
Since 𝑎1 and 𝑎2 are constants, we can define 𝑎3 = 𝑎1 𝑎2 , so 𝑦2 = 𝑎3 𝑥, which is linear.

A straight line often fails to capture patterns in one-dimensional data, as demonstrated when linear
regression is applied to non-linear data:

To overcome this, we introduce non-linearity. For a 1D input, the model becomes:

𝑦 = 𝜙(𝑤𝑥 + 𝑏)

The function 𝜙 is a fixed non-linear function, known as an activation. Common choices are:
def
1) ReLU (rectified linear unit): ReLU(𝑧) = max(0, 𝑧), which outputs non-negative values
and is widely used in neural networks;
def 1
2) Sigmoid: 𝜎(𝑧) = 1+𝑒 −𝑧, which outputs values between 0 and 1, making it suitable for
binary classification (e.g., classifying spam emails as 1 and non-spam as 0);
def 𝑒 𝑧−𝑒 −𝑧
3) Tanh (hyperbolic tangent): tanh(𝑧) = ; outputs values between −1 and 1.
𝑒 𝑧+𝑒 −𝑧

In these equations, 𝑒 denotes Euler’s number, approximately 2.72.

These functions are widely used due to their mathematical properties, simplicity, and effectiveness
in diverse applications. This is what they look like:

Read first, buy later 22


DRAFT The Hundred-Page Language Models Book DRAFT

The structure 𝜙(𝑤𝑥 + 𝑏) enables learning non-linear models but can’t capture all non-linear
curves. By nesting these functions, we build more expressive models. For instance, let
def def
𝑓1 (𝑥) = 𝜙(𝑎𝑥 + 𝑏) and 𝑓2 (𝑧) = 𝜙(𝑐𝑧 + 𝑑). A composite model combining 𝑓1 and 𝑓2 is:

𝑦 = 𝑓2 (𝑓1 (𝑥)) = 𝜙(𝑐𝜙(𝑎𝑥 + 𝑏) + 𝑑)

Here, the input 𝑥 is first transformed linearly using parameters 𝑎 and 𝑏, then passed through the
non-linear function 𝜙. The result is further transformed linearly with parameters 𝑐 and 𝑑, followed
by another application of 𝜙.

Below is the graph representation of the composite model 𝑦 = 𝑓2 (𝑓1 (𝑥)):

A computational graph represents the structure of a model. The computational graph above
shows two non-linear units (blue rectangles), often referred to as artificial neurons. Each unit
contains two trainable parameters—a weight and a bias—represented by grey circles. The left
arrow ← denotes that the value on the right is assigned to the variable on the left. This graph
illustrates a basic neural network with two layers, each containing one unit. Most neural networks
in practice are built with more layers and multiple units per layer.

Suppose we have a two-dimensional input. The input layer contains three units, while the output
layer has a single unit. The network’s structure appears as follows:

Read first, buy later 23


DRAFT The Hundred-Page Language Models Book DRAFT

Figure 1.1: A neural network with two layers.

This structure represents a feedforward neural network (FNN), where information flows in one
direction—left to right—without loops. When units in each layer connect to all units in the
subsequent layer, as shown above, we call it a multilayer perceptron (MLP). A layer where each
unit connects to all units in both adjacent layers is termed a fully connected layer, or dense layer.

In Chapter 3, we will explore recurrent neural networks (RNNs). Unlike FNNs, RNNs have loops,
where outputs from a layer are used as inputs to the same layer.

Convolutional neural networks (CNN) are feedforward neural networks with


convolutional layers that are not fully connected. While initially designed for image
processing, they are effective for tasks like document classification in text data. To learn
more about CNNs refer to the additional materials in the book’s wiki.

To simplify diagrams, individual neural units can be replaced with squares. Using this approach,
the above network can be represented more compactly as follows:

If you think this simple model is too weak, look at the figure below. It contains three plots
demonstrating how increasing model size improves performance. The left plot shows a model with

Read first, buy later 24


DRAFT The Hundred-Page Language Models Book DRAFT

2 units: one input, one output, and ReLU activations. The middle plot is a model with 4 units: three
inputs and one output. The right plot shows a much larger model with 100 units:

The ReLU activation function, despite its simplicity, was a breakthrough in machine
learning. Neural networks before 2012 relied on smooth activations like tanh and
sigmoid, which made training deep models increasingly difficult. We will return to this
subject in Chapter 4 on the Transformer neural network architecture.

Increasing the number of parameters helps the model approximate the data more accurately.
Experiments consistently show that adding more units per layer or increasing the number of layers
in a neural network improves its capacity to fit high-dimensional datasets, such as natural language,
voice, sound, image, and video data.

1.6. Matrix
Neural networks can handle high-dimensional datasets but require substantial memory and
computation. Calculating a layer’s transformation naïvely would involve iterating over thousands
of parameters per unit across thousands of units and dozens of layers, which is both slow and
resource-intensive. Using matrices makes the computations more efficient.

A matrix is a two-dimensional array of numbers arranged into rows and columns, which
generalizes the concept of vectors to higher dimensionalities. Formally, a matrix 𝐀 with 𝑚 rows and
𝑛 columns is written as:
𝑎1,1 𝑎1,2 ⋯ 𝑎1,𝑛
𝑎2,1
def 𝑎2,2 ⋯ 𝑎2,𝑛
𝐀=[ ⋮ ⋮ ⋱ ⋮ ]
𝑎𝑚,1 𝑎𝑚,2 ⋯ 𝑎𝑚,𝑛

Here, 𝑎𝑖,𝑗 represents the element in the 𝑖-th row and 𝑗-th column of the matrix. The dimensions of
the matrix are expressed as 𝑚 × 𝑛 (read as “m by n”).

Matrices are fundamental in machine learning. They compactly represent data and weights and
enable efficient computation through operations such as addition, multiplication, and
transposition. In this book, matrices are represented with uppercase bold letters, such as 𝐗 or 𝐖.
Read first, buy later 25
DRAFT The Hundred-Page Language Models Book DRAFT

The sum of two matrices 𝐀 and 𝐁 of the same dimensionality is defined element-wise as:
def
(𝐀 + 𝐁)𝑖,𝑗 = 𝑎𝑖,𝑗 + 𝑏𝑖,𝑗

For example, for two 2 × 3 matrices 𝐀 and 𝐁, the addition works like this:

The product of a matrices 𝐀 with dimensions 𝑚 × 𝑛 and 𝐁 with dimensions 𝑛 × 𝑝 is a matrix 𝐂


with dimensions 𝑚 × 𝑝 such that the value in row 𝑖 and column 𝑘 is given by:
𝑛

(𝐂)𝑖,𝑘 = ∑ 𝑎𝑖,𝑗 𝑏𝑗,𝑘


𝑗=1

For example, for a 4 × 3 matrix 𝐀 and a 3 × 5 matrix 𝐁, the product is a 4 × 5 matrix:

Transposing a matrix 𝐀 swaps its rows and columns, resulting in 𝐀⊤, where:

(𝐀⊤ )𝑖,𝑗 = 𝑎𝑗,𝑖

Read first, buy later 26


DRAFT The Hundred-Page Language Models Book DRAFT

For example, for a 2 × 3 matrix 𝐀, its transpose 𝐀⊤ look like this:

Matrix-vector multiplication is a special case of matrix multiplication. When an 𝑚 × 𝑛 matrix 𝐀


is multiplied by a vector 𝐱 of size 𝑛, the result is a vector 𝐲 = 𝐀𝐱 with 𝑚 components.

Each element 𝑦𝑖 of the resulting vector 𝐲 is computed as:


𝑛

𝑦𝑖 = ∑ 𝑎𝑖,𝑗 𝑥 (𝑗)
𝑗=1

For example, a 4 × 3 matrix 𝐀 multiplied by a 3D vector 𝐱 produces a 4-dimensional vector:

The weights and biases in fully connected layers of neural networks can be compactly represented
using matrices and vectors, enabling the use of highly optimized linear algebra libraries. As a result,
matrix operations form the backbone of neural network training and inference.

Let’s express the model in Figure 1.1 using matrix notation. Let 𝐱 be the 2D input feature vector.
For the first layer, the weights and biases are represented as a 3 × 2 matrix 𝐖1 and a 3D vector 𝐛1 ,
respectively. The 3D output 𝐲1 of the first layer is given by:

𝐲1 = 𝜙(𝐖1 𝐱 + 𝐛1 ) (1.6)

The second layer also uses a weight matrix and a bias. The output 𝑦2 of the second layer is computed
using the output 𝐲1 from the first layer. The weight matrix for the second layer is a 1 × 3 matrix 𝐖2.
The bias for the second layer is a scalar 𝑏2,1. The model output corresponds to the output of the
second layer:

𝑦2 = 𝜙(𝐖2 𝐲1 + 𝑏2,1 ) (1.7)

Read first, buy later 27


DRAFT The Hundred-Page Language Models Book DRAFT

Equation 1.6 and Equation 1.7 capture the operations from input to output in the neural network,
with each layer’s output serving as the input for the next.

1.7. Gradient Descent


Neural networks are typically large and composed of non-linear functions, which makes solving for
the minimum of the loss function analytically infeasible. Instead, the gradient descent algorithm is
widely used to minimize the loss, including in large language models.

Consider a practical example: binary classification. This task assigns input data to one of two
classes, like deciding if an email is spam or not, or detecting whether a website connection request
is a DDoS attack.

Our training dataset 𝒟 is {(𝐱𝑖 , 𝑦𝑖 )}𝑁


𝑖=1 , where 𝐱 𝑖 are vectors of input features, and 𝑦𝑖 are the labels.
Each 𝑦𝑖 , indexed from 1 to 𝑁, takes a value of 0 for “not spam” or 1 for “spam.” A well-trained model
should output 𝑦̃ close to 1 for spam inputs 𝐱 and close to 0 for non-spam inputs. We can define the
model as follows:

𝑦 = 𝜎(𝐰 ⋅ 𝐱 + 𝑏), (1.8)


𝐷 𝐷
where 𝐱 = [𝑥 (𝑗) ]𝑗=1 and 𝐰 = [𝑤 (𝑗) ]𝑗=1 are 𝐷-dimensional vectors, 𝑏 is a scalar, and 𝜎 is the sigmoid
defined in Section 1.5.

This model, called logistic regression, is commonly used for binary classification tasks. Unlike
linear regression, which produces outputs ranging from −∞ to ∞, logistic regression always
outputs values between 0 and 1. It can serve either as a standalone model or as the output layer in
a larger neural network.

Despite being over 80 years old, logistic regression remains one of the most widely
used algorithms in production machine learning systems.

A common choice for the loss function in this case is binary cross-entropy, also called logistic
loss. For a single example 𝑖, the binary cross-entropy loss is defined as:

def
loss(𝑦̃𝑖 , 𝑦𝑖 ) = − [𝑦𝑖 log(𝑦̃𝑖 ) + (1 − 𝑦𝑖 )log(1 − 𝑦̃𝑖 )] (1.9)

In this equation, 𝑦𝑖 represents the actual label of the 𝑖-th example in the dataset, and 𝑦̃𝑖 is the
prediction score, a value between 0 and 1 that the model outputs for input vector 𝐱𝑖 . The function
log denotes the natural logarithm.

Loss functions are usually designed to penalize incorrect predictions while rewarding accurate
ones. To see why logistic loss works for logistic regression, consider two extreme cases:

1. Perfect prediction, when 𝑦𝑖 = 0 and 𝑦̃𝑖 = 0:

loss(0,0) = −[0 ⋅ log(0) + (1 − 0) ⋅ log(1 − 0)] = −log(1) = 0

Read first, buy later 28


DRAFT The Hundred-Page Language Models Book DRAFT

Here, the loss is zero which is good because the prediction matches the label perfectly.

2. Opposite prediction, when 𝑦𝑖 = 0 and 𝑦̃𝑖 = 1:

loss(1,0) = −[0 ⋅ log(1) + (1 − 0) ⋅ log(1 − 1)] = −log(0)

The logarithm of 0 is undefined, and as 𝑎 approaches 0, −log(𝑎) approaches infinity, representing


a severe loss for completely wrong predictions. However, since 𝑦̃𝑖 , the output of the sigmoid
function, always remains strictly between 0 and 1, the loss stays finite.

For an entire dataset 𝒟, the loss is given by the average loss for all examples in the dataset:
𝑁
def1
loss𝒟 = − ∑[𝑦𝑖 log(𝑦̃𝑖 ) + (1 − 𝑦𝑖 )log(1 − 𝑦̃𝑖 )] (1.10)
𝑁
𝑖=1

To simplify the gradient descent derivation, we’ll stick to a single example, 𝑖, and rewrite the
equation by substituting the prediction score 𝑦̃𝑖 with the model’s expression for it:

loss(𝑦̃𝑖 , 𝑦𝑖 ) = −[𝑦𝑖 log(𝜎(𝑧𝑖 )) + (1 − 𝑦𝑖 )log(1 − 𝜎(𝑧𝑖 ))], where 𝑧𝑖 = 𝐰 ⋅ 𝐱𝑖 + 𝑏

To minimize loss(𝑦̃𝑖 , 𝑦𝑖 ), we calculate the partial derivatives with respect to each weight 𝑤 (𝑗) and
the bias 𝑏. We will use the chain rule because we have a composition of three functions:
def
• Function 1: 𝑧𝑖 = 𝐰 ⋅ 𝐱𝑖 + 𝑏, a linear function involving the weights 𝐰 and the bias 𝑏;
def 1
• Function 2: 𝑦̃𝑖 = 𝜎(𝑧𝑖 ) = , the sigmoid function applied to 𝑧𝑖 ;
1+𝑒 −𝑧𝑖
• Function 3: loss(𝑦̃𝑖 , 𝑦𝑖 ), as defined in Equation 1.9, which depends on 𝑦̃𝑖 .

Notice that 𝐱𝑖 and 𝑦𝑖 are given: 𝐱𝑖 is the feature vector for example 𝑖, and 𝑦𝑖 ∈ {0,1} is
its label. The notation 𝑦𝑖 ∈ {0,1} means that 𝑦𝑖 belongs to the set {0,1} and, in this case,
indicates that 𝑦𝑖 can only be 0 or 1.

Let’s denote loss(𝑦̃𝑖 , 𝑦𝑖 ) as l𝑖 . For weights 𝑤 (𝑗) , the application of the chain rule gives us:

∂l𝑖 ∂l𝑖 ∂𝑦̃𝑖 ∂𝑧𝑖 (𝑗)


= ⋅ ⋅ = (𝑦̃𝑖 − 𝑦𝑖 ) ⋅ 𝑥𝑖
∂𝑤 (𝑗) ∂𝑦̃𝑖 ∂𝑧𝑖 ∂𝑤 (𝑗)

For the bias 𝑏, we have:


∂l𝑖 ∂l𝑖 ∂𝑦̃𝑖 ∂𝑧𝑖
= ⋅ ⋅ = 𝑦̃𝑖 − 𝑦𝑖
∂𝑏 ∂𝑦̃𝑖 ∂𝑧𝑖 ∂𝑏

This is where the beauty of machine learning math truly shines: the activation
function—sigmoid—and loss function—cross-entropy—both arise from 𝑒, Euler’s
number. Their functional properties serve distinct purposes: sigmoid ranges between
0 and 1, ideal for binary classification, while cross-entropy spans from 0 to ∞, perfect

Read first, buy later 29


DRAFT The Hundred-Page Language Models Book DRAFT

as a penalty. When combined, the exponential and logarithmic components elegantly


cancel, yielding a linear function—prized for its computational simplicity and
numerical stability. The book’s wiki provides the full derivation.

The partial derivatives with respect to 𝑤 (𝑗) and 𝑏 for a single example (𝐱𝑖 , 𝑦𝑖 ) can be extended to
the entire dataset {(𝐱𝑖 , 𝑦𝑖 )}𝑁
𝑖=1 by summing the contributions from all examples and averaging them.
This follows from the sum rule and the constant multiple rule of differentiation:
𝑁
∂loss 1 (𝑗)
= ∑[(𝑦̃𝑖 − 𝑦𝑖 ) ⋅ 𝑥𝑖 ]
∂𝑤 (𝑗) 𝑁
𝑖=1
𝑁 (1.11)
∂loss 1
= ∑[𝑦̃𝑖 − 𝑦𝑖 ]
∂𝑏 𝑁
𝑖=1

Averaging the losses for individual examples ensures that each example contributes equally to the
overall loss, regardless of the total number of examples.

The gradient is a vector that contains all the partial derivatives. The gradient of the loss function,
denoted as ∇loss, is defined as follows:

def ∂loss ∂loss ∂loss ∂loss


∇loss = ( (1) , (2) ,…, , )
∂𝑤 ∂𝑤 ∂𝑤 (𝐷) ∂𝑏
If a gradient’s component is positive, this means that increasing the corresponding parameter will
increase the loss. Therefore, to minimize the loss, we should decrease that parameter.

The gradient descent algorithm uses the gradient of the loss function to iteratively update the
weights and bias, aiming to minimize the loss function. Here’s how it operates:

0. Initialize parameters: Start with random values of parameters 𝑤 (𝑗) and 𝑏.

1. Compute the predictions: For each training example (𝐱𝑖 , 𝑦𝑖 ), compute the predicted
value 𝑦̃𝑖 using the model:

𝑦̃𝑖 ← 𝜎(𝐰 ⋅ 𝐱𝑖 + 𝑏)

2. Compute the gradient: Calculate the partial derivatives of the loss function with respect
to each weight 𝑤 (𝑗) and the bias 𝑏 using Equation 1.11.

3. Update the weights and bias: Adjust the weights and bias in the direction that
decreases the loss function. This adjustment involves taking a small step in the opposite
direction of the gradient. The step size is controlled by the learning rate 𝜂 (explained
below):

∂loss
𝑤 (𝑗) ← 𝑤 (𝑗) − 𝜂
∂𝑤 (𝑗)

Read first, buy later 30


DRAFT The Hundred-Page Language Models Book DRAFT

∂loss
𝑏 ←𝑏−𝜂
∂𝑏

4. Calculate the loss: Calculate the logistic loss by substituting the updated values of 𝑤 (𝑗)
and 𝑏 into Equation 1.10.

5. Continue the iterative process: Repeat steps 1-4 for a set number of iterations (also
called steps) or until the loss value converges to a minimum.

Here’s a bit more detail to clarify the steps:

• Gradients are subtracted from parameters because they point in the direction of steepest
ascent in the loss function. Since our goal is to minimize loss, we move in the opposite
direction—hence, the subtraction.
• The learning rate 𝜂 is a positive value close to 0 and serves as a hyperparameter,—not
learned by the model but set manually. It controls the step size of each update, and
finding its optimal value requires experimentation.
• Convergence occurs when subsequent iterations yield minimal decreases in loss. The
learning rate 𝜂 is crucial here: too small, and progress crawls; too large, and we risk
overshooting the minimum or even seeing the loss increase rather than decrease.
Choosing an appropriate 𝜂 is therefore essential for effective gradient descent.

Let’s illustrate the process with a simple dataset of 12 examples:

((22,25), 0), ((25,35), 0), ((47,80), 1), ((52,95), 1), ((46,82), 1), ((56,90), 1),
{ }
((23,27), 0), ((30,50), 1), ((40,60), 1), ((39,57), 0), ((53,95), 1), ((48,88), 1)

In this dataset, 𝐱 𝑖 contains two features: age (in years) and income (in thousands of dollars). The
objective is to predict whether a person will buy a product, with label 𝑦𝑖 being either 0 (will not
buy) or 1 (will buy).

The loss evolution across gradient descent steps and the resulting trained model are shown in the
figure below:

Read first, buy later 31


DRAFT The Hundred-Page Language Models Book DRAFT

The left plot shows the loss decreasing steadily during gradient descent optimization. The right plot
displays the trained model’s sigmoid function, with training examples positioned by their z-values
(𝑧𝑖 = 𝐰 ∗ ⋅ 𝐱𝑖 + 𝑏 ∗ ), where 𝐰 ∗ and 𝑏 ∗ are the learned weights and bias.

The 0.5 threshold was chosen based on the plot’s clear separation: all “will-buy” examples (blue
dots) lie above it, while all “will-not-buy” examples (red dots) fall below. For new inputs 𝐱, predict
using 𝑦̃ = 𝜎(𝐰 ∗ ⋅ 𝐱 + 𝑏 ∗ ). If 𝑦̃ < 0.5, predict “will not buy;” otherwise, “will buy.”

1.8. Automatic Differentiation


Gradient descent optimizes model parameters but requires partial derivative equations. Until now,
these derivatives had to be calculated by hand for each model. As models grow more complex,
particularly in neural networks with multiple layers, manual derivation becomes impractical.

This is where automatic differentiation (or autograd) comes in. Built into machine learning
frameworks like PyTorch and TensorFlow, this feature computes partial derivatives directly from
model-defining Python code. This eliminates manual derivation, even for the most sophisticated
models.

Modern automatic differentiation systems can handle derivatives for millions of


variables efficiently. Manual computation of these derivatives would be unfeasible—
writing the equations alone could take years.

To use gradient descent in PyTorch, first install it with pip3 like this:

$ pip3 install torch

Now that PyTorch is installed, let’s import the dependencies:

import torch
import torch.nn as nn
import torch.optim as optim

The torch.nn module contains building blocks for creating models. When you use these
components, PyTorch automatically handles derivative calculations. For optimization algorithms
like gradient descent, the torch.optim module has what you need. Here’s how to implement
logistic regression in PyTorch:

model = nn.Sequential(
nn.Linear(n_inputs, n_outputs), ➊
nn.Sigmoid() ➋
)

Our model leverages PyTorch’s sequential API, which is well-suited for simple feedforward neural
networks where data flows sequentially through layers. Each layer’s output naturally becomes the
input for the subsequent layer. The more versatile module API, which we’ll cover in the next
chapter, enables the creation of models with multiple inputs, outputs, or loops.

Read first, buy later 32


DRAFT The Hundred-Page Language Models Book DRAFT

The input layer, defined in line ➊ using nn.Linear, has input dimensionality (n_inputs) matching
the size of our feature vector 𝐱, while the output dimensionality (n_outputs) determines the
layer’s unit count. For our buy/no-buy classifier—a model assigning classes to inputs—we set

n_inputs to 2 since 𝐱 = [𝑥 (1) , 𝑥 (2) ] . With the output 𝑧 being scalar, n_outputs becomes 1. Line
➋ transforms 𝑧 through the sigmoid function to produce the output score.

We then proceed to define our dataset, create the model instance, establish the binary cross-
entropy loss function, and set up the gradient descent algorithm:

inputs = torch.tensor([
[22, 25], [25, 35], [47, 80], [52, 95], [46, 82], [56, 90],
[23, 27], [30, 50], [40, 60], [39, 57], [53, 95], [48, 88]
], dtype=torch.float32) ➊

labels = torch.tensor([
[0], [0], [1], [1], [1], [1], [0], [1], [1], [0], [1], [1]
], dtype=torch.float32) ➋

model = nn.Sequential(
nn.Linear(inputs.shape[1], 1),
nn.Sigmoid()
)
optimizer = optim.SGD(model.parameters(), lr=0.001)
criterion = nn.BCELoss() # binary cross-entropy loss

In the above code block, we defined inputs and labels. The inputs form a matrix with 12 rows
and 2 columns, while the labels are a vector with 12 components. The shape attribute of the
inputs tensor return its dimensionality:

>>> inputs.shape
torch.Size([12, 2])

Tensors are PyTorch’s core data structures—multi-dimensional arrays optimized for computation
on both CPU and GPU. Supporting automatic differentiation and flexible data reshaping, tensors
form the foundation for neural network operations. In our example, the inputs tensor contains 12
examples with 2 features each, while the labels tensor holds 12 examples with single labels.
Following standard convention, examples are arranged in rows and their features in columns.

If you’re not familiar with tensors, there’s an introductory chapter on tensors available
on the book’s wiki.

When creating tensors in PyTorch, specifying dtype=torch.float32 in lines ➊ and ➋ sets 32-bit
floating-point precision explicitly. This precision setting is essential for neural network
computations, including weight adjustments, activation functions, and gradient calculations.

Read first, buy later 33


DRAFT The Hundred-Page Language Models Book DRAFT

The 32-bit floating-point precision is not the only option for neural networks.
Quantization, an advanced technique that uses lower-precision data types like 16-bit
or 8-bit floats and integers, helps reduce model size and improve computational
efficiency. For more information, refer to resources on model optimization and
deployment available on the book’s wiki.

The optim.SGD class implements gradient descent by taking a list of model parameters and
learning rate as inputs.3 Since our model inherits from nn.Module, we can access all trainable
parameters through its parameters method.

PyTorch provides the binary cross-entropy loss function through nn.BCELoss().

Now, we have everything we need to start the training loop:

for step in range(500):


optimizer.zero_grad() ➊
loss = criterion(model(inputs), labels) ➋
loss.backward() ➌
optimizer.step() ➍

Line ➋ calculates the binary cross-entropy loss (Equation 1.10) by evaluating model predictions
against training labels. Line ➌ then uses backpropagation to compute the gradient of this loss with
respect to the model parameters.

Backpropagation applies differentiation rules, particularly the chain rule, to compute gradients
through deep composite functions. This algorithm forms the backbone of neural network training.
When PyTorch operates on tensors, it builds a computational graph as shown in Figure 1.1 from
Section 1.5. This graph tracks all operations performed on the tensors. The loss.backward() call
prompts PyTorch to traverse this graph and compute gradients via the chain rule, eliminating the
need for manual gradient derivation and implementation.

The flow of data from input to output through the computational graph constitutes the forward
pass, while the computation of gradients from output to input through backpropagation represents
the backward pass.

PyTorch accumulates gradients in the .grad attribute of parameters like weights and
biases. While this feature enables multiple gradient computations before parameter
updates—useful for recurrent neural networks (covered in Section 3)—our
implementation doesn’t require gradient accumulation. Line ➊ therefore clears the
gradients at each step’s beginning.

3 While 0.001 is a common default learning rate, optimal values vary by problem and dataset. Finding the best rate
involves systematically testing different values and comparing model performance.

Read first, buy later 34


DRAFT The Hundred-Page Language Models Book DRAFT

Finally, in line ➍, parameter values are updated by subtracting the product of the learning rate and
the loss function’s partial derivatives, completing step 3 of the gradient descent algorithm
discussed earlier.

One of automatic differentiation’s key advantages is its flexibility with model switching—as long as
you’re using PyTorch’s components, you can readily swap between different architectures. For
instance, you could replace logistic regression with a basic two-layer FNN, defined through the
sequential API:

model = nn.Sequential(
nn.Linear(features.shape[1], 100),
nn.Sigmoid(),
nn.Linear(100, labels.shape[1]),
nn.Sigmoid()
)

In this setup, each of the 100 units in the first layer contains 2 weights and 1 bias, while the output
layer’s single unit has 100 weights and 1 bias. The automatic differentiation system handles
gradient computation internally, so the remaining code stays unchanged.

The next chapter examines how to represent and process text data, beginning with fundamental
techniques for converting documents into numerical representations like bag-of-words and word
embeddings, followed by exploring the count-based language modeling approach.

Read first, buy later 35

You might also like