Chapter 1
Chapter 1
In AI’s early years, researchers were overly optimistic about achieving human-level intelligence. In
1965, Herbert Simon, a Turing Award recipient, predicted that “machines will be capable, within
twenty years, of doing any work a man can do.” However, progress was slower than expected,
leading to periods of reduced funding and interest, known as “AI winters.”
Interestingly, since the 1950s, experts consistently predicted that human-level AI will be achieved
in about 25 years:
Two major AI winters occurred in 1974–1980 and 1987–2000. These periods were marked by
notable setbacks, including the failure of machine translation in 1966, poor outcomes from
DARPA’s Speech Understanding Research program at Carnegie Mellon (1971–1975), and reduced
AI funding in the UK after the 1973 Lighthill report. In the 1990s, many expert systems—computer
programs that simulated human decision-making using predefined rules and domain-specific
logic—were abandoned due to high costs and limited success.
During the first AI winter, even the term “AI” became somewhat taboo. Many
researchers rebranded their work as “informatics,” “knowledge-based systems,” or
“pattern recognition” to avoid association with AI’s perceived failures.
Enthusiasm for AI has grown steadily since the early 1990s. Interest surged around 2012,
particularly in machine learning, driven by advances in computational power, access to large
datasets, and improvements in neural network algorithms and frameworks. These developments
led to increased funding and a significant AI boom.
Although the focus of artificial intelligence research has evolved, the core goal remains the same:
to create methods that enable machines to solve problems previously considered solvable only by
humans. This is how the term will be used throughout this book.
The term “machine learning” was introduced in 1959 by Arthur Samuel. In his paper, “Some Studies
in Machine Learning Using the Game of Checkers,” he described it as “programming computers to
learn from experience.”
Early AI researchers primarily focused on symbolic methods and rule-based systems—an approach
later dubbed good old-fashioned AI (GOFAI)—but over time, the field increasingly embraced
machine learning approaches, with neural networks emerging as a particularly powerful technique.
Neural networks, inspired by the brain, aimed to learn patterns directly from examples. One
foundational model, the perceptron, was introduced by Frank Rosenblatt in 1958. It became a key
step toward later advancements. The perceptron defines a decision boundary, a line that separates
examples of two classes (e.g., spam and not spam):
Decision trees and random forests represent important evolutionary steps in machine learning.
Decision trees, introduced in the 1960s and later advanced by Ross Quinlan’s ID3 algorithm in
1986, split data into subsets through a tree-like structure. Each node represents a question about
the data, each branch is an answer, and each leaf provides a prediction. While these models are easy
to understand, they can struggle with overfitting, where they adapt too closely to training data,
reducing their ability to perform well on new, unseen data.
To address this limitation, Leo Breiman introduced the random forest algorithm in 2001. A random
forest builds multiple decision trees using random subsets of data and combines their outputs. This
approach improves predictive accuracy and reduces overfitting. Random forests remain widely
used for their reliability and performance.
Support vector machines (SVMs), introduced in the 1990s by Vladimir Vapnik and his colleagues,
were another significant step forward. SVMs identify the optimal hyperplane that separates data
points of different classes with the widest margin. The introduction of kernel methods allowed
SVMs to manage complex, non-linear patterns by mapping data into higher-dimensional spaces,
making it easier to find a suitable separating hyperplane. These advances made SVMs central to
machine learning research.
Today, machine learning is a subfield of AI focused on creating algorithms that learn from
collections of examples. These examples can come from nature, be designed by humans, or be
generated by other algorithms. The process involves gathering a dataset and building a model from
it, which is then used to solve the problem.
1.2. Model
A model is typically represented by a mathematical equation:
𝑦 = 𝑓(𝑥)
Here, 𝑥 is the input, 𝑦 is the output, and 𝑓 represents a function of 𝑥. A function is a named rule
that describes how one set of values is related to another. Formally, a function 𝑓 maps inputs from
the domain to outputs in the codomain, ensuring each input has exactly one output. The function
uses a specific rule or formula to transform the input into the output.
In machine learning, the goal is to compile a dataset of examples and use them to build 𝑓, so when
𝑓 is applied to a new, unseen 𝑥, it produces a 𝑦 that gives meaningful insight into 𝑥.
To predict a house’s price based on its area, the dataset might include (area, price) pairs such as
{(150,200), (200,600), … }. Here, the area is measured in m2, and the price is in thousands.
Imagine we own a house with an area of 250 m2 (about 2691 square feet). To derive a function 𝑓
that provides a reasonable price for this house, testing every possible function is infeasible. Instead,
we select a specific structure for 𝑓 and focus on functions that match this structure.
def
𝑓(𝑥) = 𝑤𝑥 + 𝑏, (1.1)
def
The notation = means “equals by definition” or “is defined as.”
For linear functions, determining 𝑓 requires only two values: 𝑤 and 𝑏. These are called the
parameters or weights of the model.
In other texts, 𝑤 might be referred to as the slope, coefficient, or weight term. Similarly, 𝑏 may be
called the intercept, constant term, or bias. In this book, we’ll stick to “weight” for 𝑤 and “bias”
for 𝑏, as these terms are widely used in machine learning. When the meaning is clear, “parameters”
and “weights” will be used interchangeably.
2
For instance, when 𝑤 = and 𝑏 = 1, the linear function is shown below:
3
Here, the bias shifts the graph vertically, so the line crosses the 𝑦-axis at 𝑦 = 1. The weight
determines the slope, meaning the line rises by 2 units for every 3 units it moves to the right.
Even with a simple model like 𝑓(𝑥) = 𝑤𝑥 + 𝑏, the parameters 𝑤 and 𝑏 can take infinitely many
values. To find the optimal ones, we need an optimality criterion. A reasonable choice is to minimize
the average error when predicting house prices based on area. In this case, we aim for 𝑓(𝑥) = 𝑤𝑥 +
𝑏 to make predictions as close as possible to the actual prices.
Other machine learning types include unsupervised learning, where models learn
patterns from inputs alone, and reinforcement learning, where models learn by
interacting with environments and receiving rewards or penalties for their actions.
When 𝑓(𝑥) is applied to 𝑥𝑖 , it generates a predicted value 𝑦̃𝑖 . We can define the error err(𝑦̃𝑖 , 𝑦𝑖 ) for
a given example (𝑥𝑖 , 𝑦𝑖 ) as:
def
err(𝑦̃𝑖 , 𝑦𝑖 ) = (𝑦̃𝑖 − 𝑦𝑖 )2 (1.2)
This expression, called the squared error, equals 0 when 𝑦̃𝑖 = 𝑦𝑖 . This makes sense: there’s no
error if the predicted price matches the actual price. The further 𝑦̃𝑖 deviates from 𝑦𝑖 , the larger the
error becomes. Squaring ensures the error is always positive, whether the prediction overshoots
or undershoots.
We define 𝑤 ∗ and 𝑏∗ as the optimal parameter values for 𝑓, which minimize the average price
prediction error for the dataset using the following expression:
In the equation defining 𝐽(𝑤, 𝑏), which represents the average prediction error, the values of 𝑥𝑖 and
𝑦𝑖 for each 𝑖 from 1 to 𝑁 are known since they come from the dataset. The unknowns are 𝑤 and 𝑏.
To determine the optimal 𝑤 ∗ and 𝑏 ∗ , we need to minimize 𝐽(𝑤, 𝑏). As this function is quadratic in
two variables, calculus guarantees it has a single minimum.
The expression Equation 1.3 is referred to as the loss function in the machine learning problem of
linear regression. In this case, the loss function is the mean squared error or MSE.
To find the optimum (minimum or maximum) of a function, we calculate its first derivative. When
we reach the optimum, the first derivative equals zero. For functions of two or more variables, like
the loss function 𝐽(𝑤, 𝑏), we compute partial derivatives with respect to each variable. We denote
∂𝐽 ∂𝐽
these as for 𝑤 and for 𝑏.
∂𝑤 ∂𝑏
∂𝐽
=0
{∂𝑤
∂𝐽
=0
∂𝑏
Fortunately, the mean squared error function’s structure and the model’s linearity allow us to solve
this system of equations analytically. To illustrate, consider a dataset with three examples:
(𝑥1 , 𝑦1 ) = (150,200), (𝑥2 , 𝑦2 ) = (200,600), and (𝑥3 , 𝑦3 ) = (260,500). For this dataset, the loss
function is:
Navigate to the book’s wiki, from the file thelmbook.com/py/1.1 retrieve the code used
to generate the above plot, run the code, and rotate the graph in 3D to better observe
the minimum.
∂𝐽 ∂𝐽
Now we need to derive the expressions for and . Notice that 𝐽(𝑤, 𝑏) is a composition of the
∂𝑤 ∂𝑏
following functions:
def def def
• Functions 𝑑1 = 150𝑤 + 𝑏 − 200, 𝑑2 = 200𝑤 + 𝑏 − 600, 𝑑3 = 260𝑤 + 𝑏 − 500 are
linear functions of 𝑤 and 𝑏;
def def def
• Functions err1 = 𝑑12 , err2 = 𝑑22 , err3 = 𝑑32 are quadratic functions of 𝑑1 , 𝑑2 , and 𝑑3 ;
def 1
• Function 𝐽 = (err1 + err2 + err3) is a linear function of err1, err2, and err3.
3
A composition of functions means the output of one function becomes the input to
another. For example, with two functions 𝑓 and 𝑔, you first apply 𝑔 to 𝑥, then apply 𝑓
to the result. This is written as 𝑓(𝑔(𝑥)), which means you calculate 𝑔(𝑥) first and then
use that result as the input for 𝑓.
In our loss function 𝐽(𝑤, 𝑏), the process starts by computing the linear functions for 𝑑1 , 𝑑2 , and 𝑑3
using the current values of 𝑤 and 𝑏. These outputs are then passed into the quadratic functions
err1, err2, and err3. The final step is averaging these results to compute 𝐽.
∂𝐽
Using the sum rule and the constant multiple rule of differentiation, is given by:
∂𝑤
The sum rule of differentiation states that the derivative of the sum of two functions
∂ ∂ ∂
equals the sum of their derivatives: ∂𝑥 [𝑓(𝑥) + 𝑔(𝑥)] = ∂𝑥 𝑓(𝑥) + ∂𝑥 𝑔(𝑥).
The constant multiple rule of differentiation states that the derivative of a constant
multiplied by a function equals the constant times the derivative of the function:
∂ ∂
[𝑐 ⋅ 𝑓(𝑥)] = 𝑐 ⋅ 𝑓(𝑥).
∂𝑥 ∂𝑥
By applying the chain rule of differentiation, the partial derivatives of err1, err2, and err3 with
respect to 𝑤 are:
The chain rule of differentiation states that the derivative of a composite function
∂
𝑓(𝑔(𝑥)), written as ∂𝑥 [𝑓(𝑔(𝑥))], is the product of the derivative of 𝑓 with respect to 𝑔
∂ ∂𝑓 ∂𝑔
and the derivative of 𝑔 with respect to 𝑥, or: [𝑓(𝑔(𝑥))] = ⋅ .
∂𝑥 ∂𝑔 ∂𝑥
Then,
Therefore,
∂𝐽
Similarly, we find ∂𝑏:
∂𝐽 1
= (2 ⋅ (150𝑤 + 𝑏 − 200) + 2 ⋅ (200𝑤 + 𝑏 − 600) + 2 ⋅ (260𝑤 + 𝑏 − 500))
∂𝑏 3
1
= (1220𝑤 + 6𝑏 − 2600)
3
Setting the partial derivatives to 0, which is required to locate the optimal values, results in the
following system of equations:
1
(260200𝑤 + 1220𝑏 − 560000) =0
{3
1
(1220𝑤 + 6𝑏 − 2600) =0
3
Simplifying the system and using substitution to solve for the variables gives the optimal values:
𝑤 ∗ = 2.58 and 𝑏∗ = −91.76.
The resulting model 𝑓(𝑥) = 2.58𝑥 − 91.76 is shown in the plot below. It includes the three
examples (blue dots), the model itself (red solid line), and a prediction for a new house with an area
of 240 m2 (dotted orange lines).
A vertical blue dashed line shows the square root of the model’s prediction error compared to the
actual price.1 Smaller errors mean the model fits the data better. The loss, which aggregates these
errors, measures how well the model aligns with the dataset.
When we calculate the loss using the same dataset that trained the model, the result is called the
training loss. The dataset used for training is referred to as the training dataset or training set.
For our model, the training loss is defined by Equation 1.3. Now, we can use the learned parameter
values to compute the loss for the training set:
1It’s the square root of the error because our error, as defined in Equation 1.2, is the square of the difference
between the predicted price and the real price of the house. It’s common practice to take the square root of the
mean squared error because it expresses the error in the same units as the target variable (price in this case). This
makes it easier to interpret the error value.
In our example, we minimized the loss manually by solving a system of two equations with two
variables. This approach works for small systems. However, as models grow in complexity—such
as large language models with billions of parameters—manual approach becomes infeasible. Let’s
now introduce new concepts that will help us address this challenge.
1.4. Vector
To predict a house price, knowing its area alone isn’t enough. Factors like the year of construction
or the number of bedrooms and bathrooms also matter. Suppose we use two attributes: (1) area
and (2) number of bedrooms. In this case, the input 𝐱 becomes a feature vector. This vector
includes two features, also called dimensions or components:
def (1)
𝐱 = [𝑥 (2) ]
𝑥
In this book, vectors are represented with lowercase bold letters, such as 𝐱 or 𝐰. For a given house
𝐱, 𝑥 (1) represents its size in square meters, and 𝑥 (2) represents the number of bedrooms. The
dimensionality of the vector, or its size, refers to the number of components it contains. Here, 𝐱
has two components, so its dimensionality is 2.
With two features, our linear model needs three parameters: the weights 𝑤 (1) and 𝑤 (2) , and the
bias 𝑏. The weights can be grouped into a vector:
def (1)
𝐰 = [𝑤 (2) ]
𝑤
The linear model can then be written compactly as:
𝑦 = 𝐰 ⋅ 𝐱 + 𝑏, (1.4)
where 𝐰 ⋅ 𝐱 is a dot product of two vectors (also known as scalar product). It is defined as:
𝐷
def
𝐰 ⋅ 𝐱 = ∑ 𝑤 (𝑗) 𝑥 (𝑗)
𝑗=1
The dot product combines two vectors of the same dimensionality to produce a scalar, a single
number like 22, 0.67, or −10.5. Scalars in this book are denoted by italic lowercase or uppercase
The equation uses capital-sigma notation, where 𝐷 represents the dimensionality of the input,
and 𝑗 runs from 1 to 𝐷. For example, in the 2-dimensional house scenario,
def
∑2𝑗=1 𝑤 (𝑗) 𝑥 (𝑗) = 𝑤 (1) 𝑥 (1) + 𝑤 (2) 𝑥 (2).
Although the capital-sigma notation suggests the dot product might be implemented as
a loop, modern computers handle it much more efficiently. Optimized linear algebra
libraries like BLAS and cuBLAS compute the dot product using low-level, highly
optimized methods. These libraries leverage hardware acceleration and parallel
processing, achieving speeds far beyond a simple manual loop.
The sum of two vectors 𝐚 and 𝐛, both with the same dimensionality 𝐷, is defined as:
def ⊤
𝐚 + 𝐛 = (𝑎 (1) + 𝑏 (1) , 𝑎(2) + 𝑏(2) , … , 𝑎(𝐷) + 𝑏 (𝐷) )
The computation of the element-wise product for two 3-dimensional vectors is shown below:
2In this chapter’s illustrations, the numbers in the cells indicate the position of an element within an input or output
matrix, or a vector. They do not represent actual values.
The norm of a vector 𝐱, denoted ∥ 𝐱 ∥, represents its length or magnitude. It is defined as the
square root of the sum of the squares of its components:
𝐷
def
∥ 𝐱 ∥= √∑(𝑥 (𝑗) )2
𝑗=1
The cosine of the angle 𝜃 between two vectors 𝐱 and 𝐲 is defined as:
𝐱⋅𝐲
cos(𝜃) = (1.5)
∥ 𝐱 ∥∥ 𝐲 ∥
The cosine of the angle between two vectors quantifies their similarity. For instance, two houses
with similar areas and bedroom counts will have a cosine similarity close to 1. Cosine similarity
is widely used to compare words or documents represented as embedding vectors. This will be
discussed further in Section 2.2.
A zero vector has all components equal to zero. A unit vector has a length of 1. To convert any
non-zero vector 𝐱 into a unit vector 𝐱̂, you divide the vector by its norm:
𝐱
𝐱̂ =
∥𝐱∥
Dividing a vector by a number results in a new vector where each component of the original vector
is divided by that number.
A unit vector preserves the direction of the original vector but has a length of 1. The figure below
demonstrates this with 2-dimensional examples. On the left, aligned vectors have cos(𝜃) = 0.78.
On the right, nearly orthogonal vectors have cos(𝜃) = −0.02.
Unit vectors are valuable because their dot product equals the cosine of the angle
between them, and computing dot products is efficient. When documents are
represented as unit vectors, finding similar ones becomes fast by calculating the dot
product between the query vector and document vectors. This is how vector search
engines and libraries like Faiss, Qdrant, and Weaviate operate.
As dimensions increase, the number of parameters in a linear model becomes too large to solve
manually. These models also face inherent limitations—they can only fit data that follows a straight
line or its higher-dimensional analogues like planes and hyperplanes. (This problem is illustrated
in the next section.)
In high-dimensional spaces, we cannot visually verify if data follows a linear pattern. Even if we
could visualize beyond three dimensions, we would still need more flexible models to handle data
that linear models cannot fit.
The next section explores non-linear models, with a focus on neural networks—the foundation for
understanding large language models, which are a specialized neural network architecture.
Linear models like 𝑤𝑥 + 𝑏 or 𝐰 ⋅ 𝐱 + 𝑏 cannot solve many machine learning problems effectively.
Even if we combine them into a composite function 𝑓2 (𝑓1 (𝑥)), a composite function of linear
functions remains linear. This is straightforward to verify.
def def
Let’s define 𝑦1 = 𝑓1 (𝑥) = 𝑎1 𝑥 and 𝑦2 = 𝑓2 (𝑦1 ) = 𝑎2 𝑦1 . Here, 𝑓2 depends on 𝑓1 , making it a
composite function. We can rewrite 𝑓2 as:
𝑦2 = 𝑎2 𝑦1 = 𝑎2 (𝑎1 𝑥) = (𝑎2 𝑎1 )𝑥
def
Since 𝑎1 and 𝑎2 are constants, we can define 𝑎3 = 𝑎1 𝑎2 , so 𝑦2 = 𝑎3 𝑥, which is linear.
A straight line often fails to capture patterns in one-dimensional data, as demonstrated when linear
regression is applied to non-linear data:
𝑦 = 𝜙(𝑤𝑥 + 𝑏)
The function 𝜙 is a fixed non-linear function, known as an activation. Common choices are:
def
1) ReLU (rectified linear unit): ReLU(𝑧) = max(0, 𝑧), which outputs non-negative values
and is widely used in neural networks;
def 1
2) Sigmoid: 𝜎(𝑧) = 1+𝑒 −𝑧, which outputs values between 0 and 1, making it suitable for
binary classification (e.g., classifying spam emails as 1 and non-spam as 0);
def 𝑒 𝑧−𝑒 −𝑧
3) Tanh (hyperbolic tangent): tanh(𝑧) = ; outputs values between −1 and 1.
𝑒 𝑧+𝑒 −𝑧
These functions are widely used due to their mathematical properties, simplicity, and effectiveness
in diverse applications. This is what they look like:
The structure 𝜙(𝑤𝑥 + 𝑏) enables learning non-linear models but can’t capture all non-linear
curves. By nesting these functions, we build more expressive models. For instance, let
def def
𝑓1 (𝑥) = 𝜙(𝑎𝑥 + 𝑏) and 𝑓2 (𝑧) = 𝜙(𝑐𝑧 + 𝑑). A composite model combining 𝑓1 and 𝑓2 is:
Here, the input 𝑥 is first transformed linearly using parameters 𝑎 and 𝑏, then passed through the
non-linear function 𝜙. The result is further transformed linearly with parameters 𝑐 and 𝑑, followed
by another application of 𝜙.
A computational graph represents the structure of a model. The computational graph above
shows two non-linear units (blue rectangles), often referred to as artificial neurons. Each unit
contains two trainable parameters—a weight and a bias—represented by grey circles. The left
arrow ← denotes that the value on the right is assigned to the variable on the left. This graph
illustrates a basic neural network with two layers, each containing one unit. Most neural networks
in practice are built with more layers and multiple units per layer.
Suppose we have a two-dimensional input. The input layer contains three units, while the output
layer has a single unit. The network’s structure appears as follows:
This structure represents a feedforward neural network (FNN), where information flows in one
direction—left to right—without loops. When units in each layer connect to all units in the
subsequent layer, as shown above, we call it a multilayer perceptron (MLP). A layer where each
unit connects to all units in both adjacent layers is termed a fully connected layer, or dense layer.
In Chapter 3, we will explore recurrent neural networks (RNNs). Unlike FNNs, RNNs have loops,
where outputs from a layer are used as inputs to the same layer.
To simplify diagrams, individual neural units can be replaced with squares. Using this approach,
the above network can be represented more compactly as follows:
If you think this simple model is too weak, look at the figure below. It contains three plots
demonstrating how increasing model size improves performance. The left plot shows a model with
2 units: one input, one output, and ReLU activations. The middle plot is a model with 4 units: three
inputs and one output. The right plot shows a much larger model with 100 units:
The ReLU activation function, despite its simplicity, was a breakthrough in machine
learning. Neural networks before 2012 relied on smooth activations like tanh and
sigmoid, which made training deep models increasingly difficult. We will return to this
subject in Chapter 4 on the Transformer neural network architecture.
Increasing the number of parameters helps the model approximate the data more accurately.
Experiments consistently show that adding more units per layer or increasing the number of layers
in a neural network improves its capacity to fit high-dimensional datasets, such as natural language,
voice, sound, image, and video data.
1.6. Matrix
Neural networks can handle high-dimensional datasets but require substantial memory and
computation. Calculating a layer’s transformation naïvely would involve iterating over thousands
of parameters per unit across thousands of units and dozens of layers, which is both slow and
resource-intensive. Using matrices makes the computations more efficient.
A matrix is a two-dimensional array of numbers arranged into rows and columns, which
generalizes the concept of vectors to higher dimensionalities. Formally, a matrix 𝐀 with 𝑚 rows and
𝑛 columns is written as:
𝑎1,1 𝑎1,2 ⋯ 𝑎1,𝑛
𝑎2,1
def 𝑎2,2 ⋯ 𝑎2,𝑛
𝐀=[ ⋮ ⋮ ⋱ ⋮ ]
𝑎𝑚,1 𝑎𝑚,2 ⋯ 𝑎𝑚,𝑛
Here, 𝑎𝑖,𝑗 represents the element in the 𝑖-th row and 𝑗-th column of the matrix. The dimensions of
the matrix are expressed as 𝑚 × 𝑛 (read as “m by n”).
Matrices are fundamental in machine learning. They compactly represent data and weights and
enable efficient computation through operations such as addition, multiplication, and
transposition. In this book, matrices are represented with uppercase bold letters, such as 𝐗 or 𝐖.
Read first, buy later 25
DRAFT The Hundred-Page Language Models Book DRAFT
The sum of two matrices 𝐀 and 𝐁 of the same dimensionality is defined element-wise as:
def
(𝐀 + 𝐁)𝑖,𝑗 = 𝑎𝑖,𝑗 + 𝑏𝑖,𝑗
For example, for two 2 × 3 matrices 𝐀 and 𝐁, the addition works like this:
Transposing a matrix 𝐀 swaps its rows and columns, resulting in 𝐀⊤, where:
𝑦𝑖 = ∑ 𝑎𝑖,𝑗 𝑥 (𝑗)
𝑗=1
The weights and biases in fully connected layers of neural networks can be compactly represented
using matrices and vectors, enabling the use of highly optimized linear algebra libraries. As a result,
matrix operations form the backbone of neural network training and inference.
Let’s express the model in Figure 1.1 using matrix notation. Let 𝐱 be the 2D input feature vector.
For the first layer, the weights and biases are represented as a 3 × 2 matrix 𝐖1 and a 3D vector 𝐛1 ,
respectively. The 3D output 𝐲1 of the first layer is given by:
𝐲1 = 𝜙(𝐖1 𝐱 + 𝐛1 ) (1.6)
The second layer also uses a weight matrix and a bias. The output 𝑦2 of the second layer is computed
using the output 𝐲1 from the first layer. The weight matrix for the second layer is a 1 × 3 matrix 𝐖2.
The bias for the second layer is a scalar 𝑏2,1. The model output corresponds to the output of the
second layer:
Equation 1.6 and Equation 1.7 capture the operations from input to output in the neural network,
with each layer’s output serving as the input for the next.
Consider a practical example: binary classification. This task assigns input data to one of two
classes, like deciding if an email is spam or not, or detecting whether a website connection request
is a DDoS attack.
This model, called logistic regression, is commonly used for binary classification tasks. Unlike
linear regression, which produces outputs ranging from −∞ to ∞, logistic regression always
outputs values between 0 and 1. It can serve either as a standalone model or as the output layer in
a larger neural network.
Despite being over 80 years old, logistic regression remains one of the most widely
used algorithms in production machine learning systems.
A common choice for the loss function in this case is binary cross-entropy, also called logistic
loss. For a single example 𝑖, the binary cross-entropy loss is defined as:
def
loss(𝑦̃𝑖 , 𝑦𝑖 ) = − [𝑦𝑖 log(𝑦̃𝑖 ) + (1 − 𝑦𝑖 )log(1 − 𝑦̃𝑖 )] (1.9)
In this equation, 𝑦𝑖 represents the actual label of the 𝑖-th example in the dataset, and 𝑦̃𝑖 is the
prediction score, a value between 0 and 1 that the model outputs for input vector 𝐱𝑖 . The function
log denotes the natural logarithm.
Loss functions are usually designed to penalize incorrect predictions while rewarding accurate
ones. To see why logistic loss works for logistic regression, consider two extreme cases:
Here, the loss is zero which is good because the prediction matches the label perfectly.
For an entire dataset 𝒟, the loss is given by the average loss for all examples in the dataset:
𝑁
def1
loss𝒟 = − ∑[𝑦𝑖 log(𝑦̃𝑖 ) + (1 − 𝑦𝑖 )log(1 − 𝑦̃𝑖 )] (1.10)
𝑁
𝑖=1
To simplify the gradient descent derivation, we’ll stick to a single example, 𝑖, and rewrite the
equation by substituting the prediction score 𝑦̃𝑖 with the model’s expression for it:
To minimize loss(𝑦̃𝑖 , 𝑦𝑖 ), we calculate the partial derivatives with respect to each weight 𝑤 (𝑗) and
the bias 𝑏. We will use the chain rule because we have a composition of three functions:
def
• Function 1: 𝑧𝑖 = 𝐰 ⋅ 𝐱𝑖 + 𝑏, a linear function involving the weights 𝐰 and the bias 𝑏;
def 1
• Function 2: 𝑦̃𝑖 = 𝜎(𝑧𝑖 ) = , the sigmoid function applied to 𝑧𝑖 ;
1+𝑒 −𝑧𝑖
• Function 3: loss(𝑦̃𝑖 , 𝑦𝑖 ), as defined in Equation 1.9, which depends on 𝑦̃𝑖 .
Notice that 𝐱𝑖 and 𝑦𝑖 are given: 𝐱𝑖 is the feature vector for example 𝑖, and 𝑦𝑖 ∈ {0,1} is
its label. The notation 𝑦𝑖 ∈ {0,1} means that 𝑦𝑖 belongs to the set {0,1} and, in this case,
indicates that 𝑦𝑖 can only be 0 or 1.
Let’s denote loss(𝑦̃𝑖 , 𝑦𝑖 ) as l𝑖 . For weights 𝑤 (𝑗) , the application of the chain rule gives us:
This is where the beauty of machine learning math truly shines: the activation
function—sigmoid—and loss function—cross-entropy—both arise from 𝑒, Euler’s
number. Their functional properties serve distinct purposes: sigmoid ranges between
0 and 1, ideal for binary classification, while cross-entropy spans from 0 to ∞, perfect
The partial derivatives with respect to 𝑤 (𝑗) and 𝑏 for a single example (𝐱𝑖 , 𝑦𝑖 ) can be extended to
the entire dataset {(𝐱𝑖 , 𝑦𝑖 )}𝑁
𝑖=1 by summing the contributions from all examples and averaging them.
This follows from the sum rule and the constant multiple rule of differentiation:
𝑁
∂loss 1 (𝑗)
= ∑[(𝑦̃𝑖 − 𝑦𝑖 ) ⋅ 𝑥𝑖 ]
∂𝑤 (𝑗) 𝑁
𝑖=1
𝑁 (1.11)
∂loss 1
= ∑[𝑦̃𝑖 − 𝑦𝑖 ]
∂𝑏 𝑁
𝑖=1
Averaging the losses for individual examples ensures that each example contributes equally to the
overall loss, regardless of the total number of examples.
The gradient is a vector that contains all the partial derivatives. The gradient of the loss function,
denoted as ∇loss, is defined as follows:
The gradient descent algorithm uses the gradient of the loss function to iteratively update the
weights and bias, aiming to minimize the loss function. Here’s how it operates:
1. Compute the predictions: For each training example (𝐱𝑖 , 𝑦𝑖 ), compute the predicted
value 𝑦̃𝑖 using the model:
𝑦̃𝑖 ← 𝜎(𝐰 ⋅ 𝐱𝑖 + 𝑏)
2. Compute the gradient: Calculate the partial derivatives of the loss function with respect
to each weight 𝑤 (𝑗) and the bias 𝑏 using Equation 1.11.
3. Update the weights and bias: Adjust the weights and bias in the direction that
decreases the loss function. This adjustment involves taking a small step in the opposite
direction of the gradient. The step size is controlled by the learning rate 𝜂 (explained
below):
∂loss
𝑤 (𝑗) ← 𝑤 (𝑗) − 𝜂
∂𝑤 (𝑗)
∂loss
𝑏 ←𝑏−𝜂
∂𝑏
4. Calculate the loss: Calculate the logistic loss by substituting the updated values of 𝑤 (𝑗)
and 𝑏 into Equation 1.10.
5. Continue the iterative process: Repeat steps 1-4 for a set number of iterations (also
called steps) or until the loss value converges to a minimum.
• Gradients are subtracted from parameters because they point in the direction of steepest
ascent in the loss function. Since our goal is to minimize loss, we move in the opposite
direction—hence, the subtraction.
• The learning rate 𝜂 is a positive value close to 0 and serves as a hyperparameter,—not
learned by the model but set manually. It controls the step size of each update, and
finding its optimal value requires experimentation.
• Convergence occurs when subsequent iterations yield minimal decreases in loss. The
learning rate 𝜂 is crucial here: too small, and progress crawls; too large, and we risk
overshooting the minimum or even seeing the loss increase rather than decrease.
Choosing an appropriate 𝜂 is therefore essential for effective gradient descent.
((22,25), 0), ((25,35), 0), ((47,80), 1), ((52,95), 1), ((46,82), 1), ((56,90), 1),
{ }
((23,27), 0), ((30,50), 1), ((40,60), 1), ((39,57), 0), ((53,95), 1), ((48,88), 1)
In this dataset, 𝐱 𝑖 contains two features: age (in years) and income (in thousands of dollars). The
objective is to predict whether a person will buy a product, with label 𝑦𝑖 being either 0 (will not
buy) or 1 (will buy).
The loss evolution across gradient descent steps and the resulting trained model are shown in the
figure below:
The left plot shows the loss decreasing steadily during gradient descent optimization. The right plot
displays the trained model’s sigmoid function, with training examples positioned by their z-values
(𝑧𝑖 = 𝐰 ∗ ⋅ 𝐱𝑖 + 𝑏 ∗ ), where 𝐰 ∗ and 𝑏 ∗ are the learned weights and bias.
The 0.5 threshold was chosen based on the plot’s clear separation: all “will-buy” examples (blue
dots) lie above it, while all “will-not-buy” examples (red dots) fall below. For new inputs 𝐱, predict
using 𝑦̃ = 𝜎(𝐰 ∗ ⋅ 𝐱 + 𝑏 ∗ ). If 𝑦̃ < 0.5, predict “will not buy;” otherwise, “will buy.”
This is where automatic differentiation (or autograd) comes in. Built into machine learning
frameworks like PyTorch and TensorFlow, this feature computes partial derivatives directly from
model-defining Python code. This eliminates manual derivation, even for the most sophisticated
models.
To use gradient descent in PyTorch, first install it with pip3 like this:
import torch
import torch.nn as nn
import torch.optim as optim
The torch.nn module contains building blocks for creating models. When you use these
components, PyTorch automatically handles derivative calculations. For optimization algorithms
like gradient descent, the torch.optim module has what you need. Here’s how to implement
logistic regression in PyTorch:
model = nn.Sequential(
nn.Linear(n_inputs, n_outputs), ➊
nn.Sigmoid() ➋
)
Our model leverages PyTorch’s sequential API, which is well-suited for simple feedforward neural
networks where data flows sequentially through layers. Each layer’s output naturally becomes the
input for the subsequent layer. The more versatile module API, which we’ll cover in the next
chapter, enables the creation of models with multiple inputs, outputs, or loops.
The input layer, defined in line ➊ using nn.Linear, has input dimensionality (n_inputs) matching
the size of our feature vector 𝐱, while the output dimensionality (n_outputs) determines the
layer’s unit count. For our buy/no-buy classifier—a model assigning classes to inputs—we set
⊤
n_inputs to 2 since 𝐱 = [𝑥 (1) , 𝑥 (2) ] . With the output 𝑧 being scalar, n_outputs becomes 1. Line
➋ transforms 𝑧 through the sigmoid function to produce the output score.
We then proceed to define our dataset, create the model instance, establish the binary cross-
entropy loss function, and set up the gradient descent algorithm:
inputs = torch.tensor([
[22, 25], [25, 35], [47, 80], [52, 95], [46, 82], [56, 90],
[23, 27], [30, 50], [40, 60], [39, 57], [53, 95], [48, 88]
], dtype=torch.float32) ➊
labels = torch.tensor([
[0], [0], [1], [1], [1], [1], [0], [1], [1], [0], [1], [1]
], dtype=torch.float32) ➋
model = nn.Sequential(
nn.Linear(inputs.shape[1], 1),
nn.Sigmoid()
)
optimizer = optim.SGD(model.parameters(), lr=0.001)
criterion = nn.BCELoss() # binary cross-entropy loss
In the above code block, we defined inputs and labels. The inputs form a matrix with 12 rows
and 2 columns, while the labels are a vector with 12 components. The shape attribute of the
inputs tensor return its dimensionality:
>>> inputs.shape
torch.Size([12, 2])
Tensors are PyTorch’s core data structures—multi-dimensional arrays optimized for computation
on both CPU and GPU. Supporting automatic differentiation and flexible data reshaping, tensors
form the foundation for neural network operations. In our example, the inputs tensor contains 12
examples with 2 features each, while the labels tensor holds 12 examples with single labels.
Following standard convention, examples are arranged in rows and their features in columns.
If you’re not familiar with tensors, there’s an introductory chapter on tensors available
on the book’s wiki.
When creating tensors in PyTorch, specifying dtype=torch.float32 in lines ➊ and ➋ sets 32-bit
floating-point precision explicitly. This precision setting is essential for neural network
computations, including weight adjustments, activation functions, and gradient calculations.
The 32-bit floating-point precision is not the only option for neural networks.
Quantization, an advanced technique that uses lower-precision data types like 16-bit
or 8-bit floats and integers, helps reduce model size and improve computational
efficiency. For more information, refer to resources on model optimization and
deployment available on the book’s wiki.
The optim.SGD class implements gradient descent by taking a list of model parameters and
learning rate as inputs.3 Since our model inherits from nn.Module, we can access all trainable
parameters through its parameters method.
Line ➋ calculates the binary cross-entropy loss (Equation 1.10) by evaluating model predictions
against training labels. Line ➌ then uses backpropagation to compute the gradient of this loss with
respect to the model parameters.
Backpropagation applies differentiation rules, particularly the chain rule, to compute gradients
through deep composite functions. This algorithm forms the backbone of neural network training.
When PyTorch operates on tensors, it builds a computational graph as shown in Figure 1.1 from
Section 1.5. This graph tracks all operations performed on the tensors. The loss.backward() call
prompts PyTorch to traverse this graph and compute gradients via the chain rule, eliminating the
need for manual gradient derivation and implementation.
The flow of data from input to output through the computational graph constitutes the forward
pass, while the computation of gradients from output to input through backpropagation represents
the backward pass.
PyTorch accumulates gradients in the .grad attribute of parameters like weights and
biases. While this feature enables multiple gradient computations before parameter
updates—useful for recurrent neural networks (covered in Section 3)—our
implementation doesn’t require gradient accumulation. Line ➊ therefore clears the
gradients at each step’s beginning.
3 While 0.001 is a common default learning rate, optimal values vary by problem and dataset. Finding the best rate
involves systematically testing different values and comparing model performance.
Finally, in line ➍, parameter values are updated by subtracting the product of the learning rate and
the loss function’s partial derivatives, completing step 3 of the gradient descent algorithm
discussed earlier.
One of automatic differentiation’s key advantages is its flexibility with model switching—as long as
you’re using PyTorch’s components, you can readily swap between different architectures. For
instance, you could replace logistic regression with a basic two-layer FNN, defined through the
sequential API:
model = nn.Sequential(
nn.Linear(features.shape[1], 100),
nn.Sigmoid(),
nn.Linear(100, labels.shape[1]),
nn.Sigmoid()
)
In this setup, each of the 100 units in the first layer contains 2 weights and 1 bias, while the output
layer’s single unit has 100 weights and 1 bias. The automatic differentiation system handles
gradient computation internally, so the remaining code stays unchanged.
The next chapter examines how to represent and process text data, beginning with fundamental
techniques for converting documents into numerical representations like bag-of-words and word
embeddings, followed by exploring the count-based language modeling approach.