0% found this document useful (0 votes)

57 views30 pages

Book TheLMbook Sample

The document presents 'The Hundred-Page Language Models Book' by Andriy Burkov, which is a concise introduction to machine learning and language models, praised by various industry leaders for its clarity and accessibility. It follows a 'read first, buy later' principle, allowing readers to explore the content freely while encouraging purchase for continued use. The book is structured into six chapters, covering topics from machine learning basics to advanced language modeling techniques, with practical coding examples provided throughout.

Uploaded by

Miha Karpati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views30 pages

Book TheLMbook Sample

Uploaded by

Miha Karpati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

“Andriy's long-awaited sequel in his "The Hundred-Page" series of ma-

chine learning textbooks is a masterpiece of concision.”

― Bob van Luijt, CEO and Co-Founder of Weaviate

“Andriy has this almost supernatural talent for shrinking epic AI con-
cepts down to bite-sized, ‘Ah, now I get it!’ moments.”
― Jorge Torres, CEO at MindsDB

“Andriy paints for us, in 100 marvelous strokes, the journey from lin-
ear algebra basics to the implementation of transformers.”
― Florian Douetteau, Co-founder and CEO at Dataiku

“Andriy's book is an incredibly concise, clear, and accessible introduc-

tion to machine learning.”
― Andre Zayarni, Co-founder and CEO at Qdrant

“This is one of the most comprehensive yet concise handbooks out

there for truly understanding how LLMs work under the hood.”
― Jerry Liu, Co-founder and CEO at LlamaIndex

Featuring a foreword by Tomáš Mikolov and back cover text by Vint Cerf
The Hundred-Page Language Models Book

Andriy Burkov
Copyright © 2025 Andriy Burkov. All rights reserved.

1. Read First, Buy Later: You are welcome to freely read and share this book with others by
preserving this copyright notice. However, if you find the book valuable or continue to use it,
you must purchase your own copy. This ensures fairness and supports the author.
2. No Unauthorized Use: No part of this work—its text, structure, or derivatives—may be used
to train artificial intelligence or machine learning models, nor to generate any content on
websites, apps, or other services, without the author’s explicit written consent. This restriction
applies to all forms of automated or algorithmic processing.
3. Permission Required If you operate any website, app, or service and wish to use any portion
of this work for the purposes mentioned above—or for any other use beyond personal read-
ing—you must first obtain the author’s explicit written permission. No exceptions or implied
licenses are granted.
4. Enforcement: Any violation of these terms is copyright infringement. It may be pursued le-
gally in any jurisdiction. By reading or distributing this book, you agree to abide by these
conditions.
ISBN 978-1-7780427-2-0
Publisher: True Positive Inc.
To my family, with love
“Language is the source of misunderstandings.”
―Antoine de Saint-Exupéry, The Little Prince

“In mathematics you don't understand things. You just get used to them.”
―John von Neumann

“Computers are useless. They can only give you answers.”

― Pablo Picasso

The book is distributed on the “read first, buy later” principle

Contents
Foreword xi
Preface xiii
Who This Book Is For xiii
What This Book Is Not xiii
Book Structure xiv
Should You Buy This Book? xv
Acknowledgements xv
Chapter 1. Machine Learning Basics 1
1.1. AI and Machine Learning 1
1.2. Model 3
1.3. Four-Step Machine Learning Process 9
1.4. Vector 9
1.5. Neural Network 12
1.6. Matrix 15
1.7. Gradient Descent 18
1.8. Automatic Differentiation 21
Chapter 2. Language Modeling Basics 27
2.1. Bag of Words 27
2.2. Word Embeddings 36
2.3. Byte-Pair Encoding 40
2.4. Language Model 45
2.5. Count-Based Language Model 46
2.6. Evaluating Language Models 51
Chapter 3. Recurrent Neural Network 61
3.1. Elman RNN 61
3.2. Mini-Batch Gradient Descent 62
3.3. Programming an RNN 63
3.4. RNN as a Language Model 65
3.5. Embedding Layer 66
3.6. Training an RNN Language Model 68
3.7. Dataset and DataLoader 70
3.8. Training Data and Loss Computation 72
ix
Chapter 4. Transformer 75
4.1. Decoder Block 75
4.2. Self-Attention 76
4.3. Position-Wise Multilayer Perceptron 80
4.4. Rotary Position Embedding 80
4.5. Multi-Head Attention 85
4.6. Residual Connection 87
4.7. Root Mean Square Normalization 89
4.8. Key-Value Caching 90
4.9. Transformer in Python 91
Chapter 5. Large Language Model 97
5.1. Why Larger Is Better 97
5.2. Supervised Finetuning 102
5.3. Finetuning a Pretrained Model 103
5.4. Sampling From Language Models 114
5.5. Low-Rank Adaptation (LoRA) 117
5.6. LLM as a Classifier 120
5.7. Prompt Engineering 122
5.8. Hallucinations 126
5.9. LLMs, Copyright, and Ethics 128
Chapter 6. Further Reading 131
6.1. Mixture of Experts 131
6.2. Model Merging 131
6.3. Model Compression 131
6.4. Preference-Based Alignment 132
6.5. Advanced Reasoning 132
6.6. Language Model Security 132
6.7. Vision Language Model 132
6.8. Preventing Overfitting 133
6.9. Concluding Remarks 133
6.10. More From the Author 134
Index 135

x
Foreword
First time I got involved in language modeling was already two decades ago. I wanted to improve
some of my data compression algorithms and found out about the n-gram statistics. Very simple
concept, but so hard to beat! Then I quickly gained another motivation—since my childhood, I was
interested in artificial intelligence. I had a vision of machines that would understand patterns in
our world that are hidden from our limited minds. It would be so exciting to talk with such super-
intelligence. And I realized that language modeling could be a way towards such AI.
I started searching for others sharing this vision and did find the works of Solomonoff, Schmidhu-
ber and the Hutter prize competition organized by Matt Mahoney. They all did write about AI
completeness of language modeling and I knew I had to try to make it work. But the world was
very different than it is today. Language modeling was considered a dead research direction, and
I've heard countless times that I should give up as nothing will ever beat n-grams on large data.
I've completed my master's thesis on neural language models, as these models were quite similar
to what I previously developed for data compression, and I did believe the distributed representa-
tions that could be applied to any language is the right way to go. This infuriated a local linguist
who declared my ideas to be a total nonsense as language modeling has to be addressed from the
linguistics point of view, and each language had to be treated differently.
However, I did not give up and did continue working on my vision of AI-complete language models.
Just the summer before starting my PhD, I did come up with the idea to generate text from these
neural models. I was amazed by how much better this text was than text generated from n-grams
models. That was summer 2007 and I quickly realized the only person excited about this at the
Brno University of Technology was actually me. But I did not give up anyways.
In the following years, I did develop a number of algorithms to make neural language models more
useful. To convince others about their qualities, I published open-source toolkit RNNLM in 2010.
It had the first implementations ever of neural text generation, gradient clipping, dynamic evalua-
tion, model adaptation (nowadays called fine-tuning) and other tricks such as hierarchical softmax
or splitting infrequent words into subword units. However, the result I was the most proud of was
when I could demonstrate in my PhD thesis that neural language models not only beat n-grams on
large datasets—something widely considered to be impossible at the time—but the improvements
were actually increasing with the amount of training data. This happened for the first time after
something like fifty years of language modeling research and I still remember the disbelief in faces
of famous researchers when I showed them my work.
Fast forward some fifteen years, and I'm amazed by how much the world has changed. The mindset
completely flipped—what used to be some obscure technology in a dead research direction
is now thriving and gets the attention of CEOs of the largest companies in the world. Language
models are everywhere today. With all this hype, I think it is needed more than ever to actually
understand this technology.
Young students who want to learn about language modeling are flooded with information. Thus,
I was delighted when I learned about Andriy's project to write a short book with only one hundred
pages that would cover some of the most important ideas. I think the book is a good start for
anyone new to language modeling who aspires to improve on state of the art—and if someone tells

xi
you that everything that could have been invented in language modeling has already been discov-
ered, don't believe it.

Tomáš Mikolov, Senior Researcher at Czech Institute of Informatics, Robotics and

Cybernetics, the author of word2vec and FastText

xii
Preface
My interest in text began in the late 1990s during my teenage years, building dynamic websites
using Perl and HTML. This early experience with coding and organizing text into structured for-
mats sparked my fascination with how text could be processed and transformed. Over the years, I
advanced to building web scrapers and text aggregators, developing systems to extract structured
data from webpages. The challenge of processing and understanding text led me to explore more
complex applications, including designing chatbots that could understand and address user needs.
The challenge of extracting meaning from words intrigued me. The complexity of the task only
fueled my determination to “crack” it, using every tool at my disposal—ranging from regular ex-
pressions and scripting languages to text classifiers and named entity recognition models.
The rise of large language models (LLMs) transformed everything. For the first time, computers
could converse with us fluently and follow verbal instructions with remarkable precision. However,
like any tool, their immense power comes with limitations. Some are easy to spot, but others are
more subtle, requiring deep expertise to handle properly. Attempting to build a skyscraper without
fully understanding your tools will only result in a pile of concrete and steel. The same holds true
for language models. Approaching large-scale text processing tasks or creating reliable products
for paying users requires precision and knowledge—guesswork simply isn’t an option.

Who This Book Is For

I wrote this book for those who, like me, are captivated by the challenge of understanding language
through machines. Language models are, at their core, just mathematical functions. However, their
true potential isn’t fully appreciated in theory—you need to implement them to see their power
and how their abilities grow as they scale. This is why I decided to make this book hands-on.
This book serves software developers, data scientists, machine learning engineers, and anyone
curious about language models. Whether your goal is to integrate existing models into applications
or to train your own, you’ll find practical guidance alongside theoretical foundations.
Given its hundred-page format, the book makes certain assumptions about readers. You should
have programming experience, as all hands-on examples use Python.
While familiarity with PyTorch and tensors—PyTorch’s fundamental data types—is beneficial, it’s
not mandatory. If you’re new to these tools, the book’s wiki (thelmbook.com/wiki) provides a
concise introduction with examples and resource links for further learning. This wiki format en-
sures content remains current and addresses reader questions beyond publication.
College-level math knowledge helps, but you needn’t remember every detail or have machine
learning experience. The book introduces concepts systematically, beginning with notations, defi-
nitions, and fundamental vector and matrix operations. From there, it progresses through simple
neural networks to more advanced topics. Mathematical concepts are presented intuitively, with
clear diagrams and examples that facilitate understanding.

What This Book Is Not

This book is focused on understanding and implementing language models. It will not cover:
xiii
• Large-scale training: This book won’t teach you how to train massive models on distrib-
uted systems or how to manage training infrastructure.

• Production deployment: Topics like model serving, API development, scaling for high
traffic, monitoring, and cost optimization are not covered. The code examples focus on un-
derstanding the concepts rather than production readiness.

• Enterprise applications: This book won’t guide you through building commercial LLM
applications, handling user data, or integrating with existing systems.

If you’re interested in learning the mathematical foundations of language models, understanding

how they work, implementing core components yourself, or learning to work effectively with LLMs,
this book is for you. But if you’re primarily looking to deploy models in production or build scalable
applications, you may want to supplement this book with other resources.

Book Structure
To make this book engaging and to deepen the reader’s understanding, I decided to discuss lan-
guage modeling as a whole, including approaches that are often overlooked in modern literature.
While Transformer-based LLMs dominate the spotlight, earlier approaches like count-based meth-
ods and recurrent neural networks (RNNs) remain effective for some tasks.
Learning the math of the Transformer architecture from scratch may seem overwhelming for some-
one starting from scratch. By revisiting these foundational methods, my goal is to gradually build
up the reader’s intuition and mathematical understanding, making the transition to modern Trans-
former architectures feel like a natural progression rather than an intimidating leap.
The book is divided into six chapters, progressing from fundamentals to advanced topics:
• Chapter 1 covers machine learning basics, including key concepts like AI, models, neu-
ral networks, and gradient descent. Even if you’re familiar with these topics, the chapter
provides important foundations for understanding language models.
• Chapter 2 introduces language modeling fundamentals, exploring text representation
methods like bag of words and word embeddings, as well as count-based language mod-
els and evaluation techniques.
• Chapter 3 focuses on recurrent neural networks, covering their implementation, train-
ing, and application as language models.
• Chapter 4 provides a detailed exploration of the Transformer architecture, including key
components like self-attention, position embeddings, and practical implementation.
• Chapter 5 examines large language models (LLMs), discussing why scale matters, fine-
tuning techniques, practical applications, and important considerations around halluci-
nations, copyright, and ethics.
• Chapter 6 concludes with further reading on advanced topics like mixture of experts,
model compression, preference-based alignment, and vision language models, providing
direction for continued learning.
Most chapters contain working code examples you can run and modify. While only essential code
appears in the book, complete code is available as Jupyter notebooks on the book’s website, with

xiv
notebooks referenced in relevant sections. All code in notebooks remains compatible with the latest
stable versions of Python, PyTorch, and other libraries.
The notebooks run on Google Colab, which at the time of writing offers free access to computing
resources including GPUs and TPUs. These resources, though, aren’t guaranteed and have usage
limits that may vary. Some examples might require extended GPU access, potentially involving
wait times for availability. If the free tier proves limiting, Colab’s pay-as-you-go option lets you
purchase compute credits for reliable GPU access. While these credits are relatively affordable by
North American standards, costs may be significant depending on your location.
For those familiar with the Linux command line, GPU cloud services provide another option
through pay-per-time virtual machines with one or more GPUs. The book’s wiki maintains current
information on free and paid notebook or GPU rental services.
Verbatim terms and blocks indicate code, code fragments, or code execution outputs. Bold terms
link to the book’s term index, and occasionally highlight algorithm steps.
In this book, we use pip3 to ensure the packages are installed for Python 3. On most modern
systems, you can use pip instead if it's already set up for Python 3.

Should You Buy This Book?

Like my previous two books, this one is distributed on the read first, buy later principle. I firmly
believe that paying for content before consuming it means buying a pig in a poke. At a dealership,
you can see and try a car. In a department store, you can try on clothes. Similarly, you should be
able to read a book before paying for it.
The read first, buy later principle means you can freely download the book, read it, and share it
with friends and colleagues. If you find the book helpful or useful in your work, business, or stud-
ies—or if you simply enjoy reading it—then buy it.

Acknowledgements
The high quality of this book would be impossible without volunteering editors. I especially thank
Erman Sert, Viet Hoang Tran Duong, Alex Sherstinsky, Kelvin Sundli, and Mladen Korunoski for
their systematic contributions.
I am also grateful to Alireza Bayat Makou, Taras Shalaiko, Domenico Siciliani, Preethi Raju, Sriku-
mar Sundareshwar, Mathieu Nayrolles, Abhijit Kumar, Giorgio Mantovani, Abhinav Jain, Steven
Finkelstein, Ryan Gaughan, Ankita Guha, Harmanan Kohli, Daniel Gross, Kea Kohv, Marcus
Oliveira, Tracey Mercier, Prabin Kumar Nayak, Saptarshi Datta, Gurgen R. Hayrapetyan, Sina Ab-
didizaji, Federico Raimondi Cominesi, Santos Salinas, Anshul Kumar, Arash Mirbagheri, Roman
Stanek, Jeremy Nguyen, Efim Shuf, and Manoj Pillai for their help.
If this is your first time exploring language models, I envy you a little—it’s truly magical to discover
how machines learn to understand the world through natural language.
I hope you enjoy reading this book as much as I enjoyed writing it.
Now grab your tea or coffee, and let’s begin!

xv
xvi
Chapter 1. Machine Learning Basics
This chapter starts with a brief overview of how artificial intelligence has evolved, explains what a
machine learning model is, and presents the four steps of the machine learning process. Then, it
covers some math basics like vectors and matrices, introduces neural networks, and wraps up with
optimization methods like gradient descent and automatic differentiation.

1.1. AI and Machine Learning

The term artificial intelligence (AI) was first introduced in 1955 during a workshop led by John
McCarthy, focusing on exploring how machines could use language, form concepts, solve problems
like humans, and improve over time. Building on these ideas, Joseph Weizenbaum developed the
first chatbot, ELIZA, in 1966. ELIZA simulated conversations by detecting patterns in user input
and replying with preprogrammed responses, giving the impression of understanding.
In AI’s early years, researchers were overly optimistic about achieving human-level intelligence. In
1965, Herbert Simon, a Turing Award recipient, predicted that “machines will be capable, within
twenty years, of doing any work a man can do.” However, progress was slower than expected, lead-
ing to periods of reduced funding and interest, known as “AI winters.”
Interestingly, since the 1950s, experts consistently predicted that human-level AI will be achieved
in about 25 years:

Between 1975 and 1980, and again between 1987 and 2000, AI went through two “winters” where
enthusiasm and funding dropped. Research outcomes did not meet high hopes set by initial suc-
cesses, so investors and policymakers lost confidence. Many projects were halted or slowed down,
leading to a significant decline in AI research and development across academia and industry.

During the first AI winter, even the term “AI” became somewhat taboo. Many research-
ers rebranded their work as “informatics,” “knowledge-based systems,” or “pattern
recognition” to avoid association with AI’s perceived failures.

1
Enthusiasm for AI has grown steadily since the early 1990s. Interest surged around 2012, particu-
larly in machine learning, driven by advances in computational power, access to large datasets,
and improvements in neural network algorithms and frameworks. These developments led to in-
creased funding and a significant AI boom.
Although the focus of artificial intelligence research has evolved, the core goal remains the same:
to create methods that enable machines to solve problems previously considered solvable only by
humans. This is how the term will be used throughout this book.
The term machine learning was introduced in 1959 by Arthur Samuel. In his paper, “Some Studies
in Machine Learning Using the Game of Checkers,” he described it as “programming computers to
learn from experience.”
Early AI researchers primarily focused on symbolic methods and rule-based systems—an approach
later dubbed good old-fashioned AI (GOFAI)—but over time, the field increasingly embraced
machine learning approaches, with neural networks emerging as a particularly powerful technique.
Neural networks, inspired by the brain, aimed to learn patterns directly from examples. One pio-
neering model and algorithm to train it, the Perceptron, was introduced by Frank Rosenblatt in
1958. It became a key step toward later advancements. The Perceptron defines a decision bound-
ary, a line that separates examples of two classes (e.g., spam and not spam):

Decision trees and random forests represent important evolutionary steps in machine learning.
Decision trees, introduced in 1963 by John Sonquist and James Morgan and later advanced by
Ross Quinlan’s ID3 algorithm in 1986, split data into subsets through a tree-like structure. Each
node represents a question about the data, each branch is an answer, and each leaf provides a
prediction. While these models are easy to understand, they can struggle with overfitting, where
they adapt too closely to training data, reducing their ability to perform well on new, unseen data.
To address this limitation, Leo Breiman introduced the random forest algorithm in 2001. A ran-
dom forest builds multiple decision trees using random subsets of data and combines their out-
puts. This approach improves predictive accuracy and reduces overfitting. Random forests remain
widely used for their reliability and performance.

2
Support vector machines (SVMs), introduced in 1992 by Vladimir Vapnik and his colleagues,
were another significant step forward. SVMs identify the optimal hyperplane that separates data
points of different classes with the widest margin. The introduction of kernel methods allowed
SVMs to manage complex, non-linear patterns by mapping data into higher-dimensional spaces,
making it easier to find a suitable separating hyperplane. These advances made SVMs central to
machine learning research.
Today, machine learning is a subfield of AI focused on creating algorithms that learn from collec-
tions of examples. These examples can come from nature, be designed by humans, or be generated
by other algorithms. The process involves gathering a dataset and building a model from it, which
is then used to solve a problem.

I will use “learning” and “machine learning” interchangeably to save keystrokes.

1.2. Model
A model is typically represented by a mathematical equation:
𝑦 = 𝑓(𝑥)
Here, 𝑥 is the input, 𝑦 is the output, and 𝑓 represents a function of 𝑥. A function is a named rule
that describes how one set of values is related to another. Formally, a function 𝑓 maps inputs from
the domain to outputs in the codomain, ensuring each input has exactly one output. The function
uses a specific rule or formula to transform the input into the output.
In machine learning, the goal is to compile a dataset of examples and use them to build 𝑓, so
when 𝑓 is applied to a new, unseen 𝑥, it produces a 𝑦 that gives meaningful insight into 𝑥.
To estimate a house’s price based on its area, the dataset might include (area, price) pairs such as
{(150,200), (200,600), … }. Here, the area is measured in m! , and the price is in thousands.

Curly brackets denote a set. A set containing 𝑁 elements, ranging from 𝑥" to 𝑥# , is
expressed as {𝑥$ }#
$%" .

Imagine we own a house with an area of 250 m! (about 2691 square feet). To find a function 𝑓
that returns a reasonable price for this house, testing every possible function is infeasible. Instead,
we select a specific structure for 𝑓 and focus on functions that match this structure.
Let’s define the structure for 𝑓 as:
def
𝑓(𝑥) = 𝑤𝑥 + 𝑏, (1.1)
which describes a linear function of 𝑥. The formula 𝑤𝑥 + 𝑏 is a linear transformation of 𝑥.

def
The notation = means “equals by definition” or “is defined as.”

3
For linear functions, determining 𝑓 requires only two values: 𝑤 and 𝑏. These are called the param-
eters or weights of the model.
In other texts, 𝑤 might be referred to as the slope, coefficient, or weight term. Similarly, 𝑏 may
be called the intercept, constant term, or bias. In this book, we’ll stick to “weight” for 𝑤 and
“bias” for 𝑏, as these terms are widely used in machine learning. When the meaning is clear, “pa-
rameters” and “weights” will be used interchangeably.
!
For instance, when 𝑤 = & and 𝑏 = 1, the linear function is shown below:

Here, the bias shifts the graph vertically, so the line crosses the 𝑦-axis at 𝑦 = 1. The weight deter-
mines the slope, meaning the line rises by 2 units for every 3 units it moves to the right.

Mathematically, the function 𝑓(𝑥) = 𝑤𝑥 + 𝑏 is an affine transformation, not a linear

one, since true linear transformations require 𝑏 = 0. However, in machine learning, we
often call such models “linear” whenever the parameters appear linearly in the equa-
tion—meaning 𝑤 and 𝑏 are only multiplied by inputs or constants and added, without
multiplying each other, being raised to powers, or appearing inside functions like 𝑒 ' .

Even with a simple model like 𝑓(𝑥) = 𝑤𝑥 + 𝑏, the parameters 𝑤 and 𝑏 can take infinitely many
values. To find the best ones, we need a way to measure optimality. A natural choice is to minimize
the average prediction error when estimating house prices from area. Specifically, we want 𝑓(𝑥) =
𝑤𝑥 + 𝑏 to generate predictions that match the actual prices as closely as possible.
Let our dataset be {(𝑥$ , 𝑦$ )}#
$%" , where 𝑁 is the size of the dataset and {(𝑥" , 𝑦" ), (𝑥! , 𝑦! ), … , (𝑥# , 𝑦# )}
are individual examples , with each 𝑥$ being the input and corresponding 𝑦$ being the target.
When examples contain both inputs and targets, the learning process is called supervised. This
book focuses on supervised machine learning.

Other machine learning types include unsupervised learning, where models learn pat-
terns from inputs alone, and reinforcement learning, where models learn by interact-
ing with environments and receiving rewards or penalties for their actions.

4
When 𝑓(𝑥) is applied to 𝑥$ , it generates a predicted value 𝑦9$ . We can define the prediction error
err(𝑦9$ , 𝑦$ ) for a given example (𝑥$ , 𝑦$ ) as:
def
err(𝑦9$ , 𝑦$ ) = (𝑦9$ − 𝑦$ )! (1.2)
This expression, called squared error, equals 0 when 𝑦9$ = 𝑦$ . This makes sense: no error if pre-
dicted price matches the actual price. The further 𝑦9$ deviates from 𝑦$ , the larger the error becomes.
Squaring ensures the error is always positive, whether the prediction overshoots or undershoots.

We define 𝑤 ∗ and 𝑏∗ as the optimal parameter values for 𝑤 and 𝑏 in our function 𝑓, when they
minimize the average price prediction error across our dataset. This error is calculated using the
following expression:
err(𝑦9" , 𝑦" ) + err(𝑦9! , 𝑦! ) + ⋯ + err(𝑦9# , 𝑦# )
𝑁
Let’s rewrite the above expression by expanding each err(⋅):
(𝑦9" − 𝑦" )! + (𝑦9! − 𝑦! )! + ⋯ + (𝑦9# − 𝑦# )!
𝑁
Let’s assign the name 𝐽(𝑤, 𝑏) to our expression, turning it into a function:

def (𝑤𝑥" + 𝑏 − 𝑦" )! + (𝑤𝑥! + 𝑏 − 𝑦! )! + ⋯ + (𝑤𝑥# + 𝑏 − 𝑦# )!

𝐽(𝑤, 𝑏) = (1.3)
𝑁
In the equation defining 𝐽(𝑤, 𝑏), which represents the average prediction error, the values of 𝑥$
and 𝑦$ for each 𝑖 from 1 to 𝑁 are known since they come from the dataset. The unknowns are 𝑤
and 𝑏. To determine the optimal 𝑤 ∗ and 𝑏∗ , we need to minimize 𝐽(𝑤, 𝑏). As this function is quad-
ratic in two variables, calculus guarantees it has a single minimum.
The expression Equation 1.3 is referred to as the loss function in the machine learning problem
of linear regression. In this case, the loss function is the mean squared error or MSE.
To find the optimum (minimum or maximum) of a function, we calculate its first derivative. When
we reach the optimum, the first derivative equals zero. For functions of two or more variables, like
the loss function 𝐽(𝑤, 𝑏), we compute partial derivatives with respect to each variable. We denote
)* )*
these as )' for 𝑤 and )+ for 𝑏.

To determine 𝑤 ∗ and 𝑏∗ , we solve the following system of two equations:

∂𝐽
=0
?∂𝑤
∂𝐽
=0
∂𝑏
We set the partial derivatives to zero because when this occurs, we are at an optimum.
Fortunately, the MSE function’s structure and the model’s linearity allow us to solve this system of
equations analytically. To illustrate, consider a dataset with three examples: (𝑥" , 𝑦" ) = (150,200),
(𝑥! , 𝑦! ) = (200,600), and (𝑥& , 𝑦& ) = (260,500). For this dataset, the loss function is:
def (150𝑤 + 𝑏 − 200)! + (200𝑤 + 𝑏 − 600)! + (260𝑤 + 𝑏 − 500)!
𝐽(𝑤, 𝑏) =
3

5
Let’s plot it:

Navigate to the book’s wiki, from the file thelmbook.com/py/1.1 retrieve the code used
to generate the above plot, run the code, and rotate the graph to observe the minimum.

)* )*
Now we need to derive the expressions for and . Notice that 𝐽(𝑤, 𝑏) is a composition of the
)' )+
following functions:
def def def
• Functions 𝑑" = 150𝑤 + 𝑏 − 200, 𝑑! = 200𝑤 + 𝑏 − 600, 𝑑& = 260𝑤 + 𝑏 − 500 are linear
functions of 𝑤 and 𝑏;
def def def
• Functions err" = 𝑑"! , err! = 𝑑!! , err& = 𝑑&! are quadratic functions of 𝑑" , 𝑑! , and 𝑑& ;
def "
• Function 𝐽 = & (err" + err! + err& ) is a linear function of err" , err! , and err& .

A composition of functions means the output of one function becomes the input to
another. For example, with two functions 𝑓 and 𝑔, you first apply 𝑔 to 𝑥, then apply 𝑓
to the result. This is written as 𝑓C𝑔(𝑥)D, which means you calculate 𝑔(𝑥) first and then
use that result as the input for 𝑓.

In our loss function 𝐽(𝑤, 𝑏), the process starts by computing the linear functions for 𝑑" , 𝑑! , and 𝑑&
using the current values of 𝑤 and 𝑏. These outputs are then passed into the quadratic functions
err" , err! , and err& . The final step is averaging these results to compute 𝐽.
)*
Using the sum rule and the constant multiple rule of differentiation, )' is given by:
∂𝐽 1 ∂err" ∂err! ∂err&
= E + + F,
∂𝑤 3 ∂𝑤 ∂𝑤 ∂𝑤

6
)err! )err" )err#
where )'
, )'
, and )'
are the partial derivatives of err" , err! , and err& with respect to 𝑤.

The sum rule of differentiation states that the derivative of the sum of two functions
) ) )
equals the sum of their derivatives: ), [𝑓(𝑥) + 𝑔(𝑥)] = ), 𝑓(𝑥) + ), 𝑔(𝑥).

The constant multiple rule of differentiation states that the derivative of a constant
multiplied by a function equals the constant times the derivative of the function:
) )
[𝑐 ⋅ 𝑓(𝑥)] = 𝑐 ⋅ 𝑓(𝑥).
), ),

By applying the chain rule of differentiation, the partial derivatives of err" , err! , and err& with
respect to 𝑤 are:

The chain rule of differentiation states that the derivative of a composite function
)
𝑓C𝑔(𝑥)D, written as ), J𝑓C𝑔(𝑥)DK, is the product of the derivative of 𝑓 with respect to 𝑔
) )- ).
and the derivative of 𝑔 with respect to 𝑥, or: ), J𝑓C𝑔(𝑥)DK = ). ⋅ ), .

Then,

Therefore,

)*
Similarly, we find :
)+

∂𝐽 1
= C2 ⋅ (150𝑤 + 𝑏 − 200) + 2 ⋅ (200𝑤 + 𝑏 − 600) + 2 ⋅ (260𝑤 + 𝑏 − 500)D
∂𝑏 3
1
= (1220𝑤 + 6𝑏 − 2600)
3
7
Setting the partial derivatives to 0 results in the following system of equations:
1
(260200𝑤 + 1220𝑏 − 560000) = 0
?3
1
(1220𝑤 + 6𝑏 − 2600) = 0
3
Simplifying the system and using substitution to solve for the variables gives the optimal values:
𝑤 ∗ = 2.58 and 𝑏∗ = −91.76.
The resulting model 𝑓(𝑥) = 2.58𝑥 − 91.76 is shown in the plot below. It includes the three exam-
ples (blue dots), the model itself (red solid line), and a prediction for a new house with an area of
240 m! (dotted orange lines).

A vertical blue dashed line shows the square root of the model’s prediction error compared to the
actual price.1 Smaller errors mean the model fits the data better. The loss, which aggregates these
errors, measures how well the model aligns with the dataset.
When we calculate the loss using our model’s training dataset (called the training set), we obtain
the training loss. For our model, this training loss is defined by Equation 1.3. Using our learned
parameter values, we can now compute the loss for the training set:
(2.58 ⋅ 150 − 91.76 − 200)! (2.58 ⋅ 200 − 91.76 − 600)!
𝐽(2.58, −91.76) = +
3 3
(2.58 ⋅ 260 − 91.76 − 500)!
+
3
= 15403.19.
The square root of this value is approximately 124.1, indicating an average prediction error of
around $124,100. The interpretation of whether a loss value is high or low depends on the specific

1 It’s the square root of the error because our error, as defined in Equation 1.2, is the square of the difference be-
tween the predicted price and the real price of the house. It’s common practice to take the square root of the mean
squared error because it expresses the error in the same units as the target variable (price in this case). This makes
it easier to interpret the error value.

8
business context and comparative benchmarks. Neural networks and other non-linear models,
which we explore later in this chapter, typically achieve lower loss values.

1.3. Four-Step Machine Learning Process

At this stage, you should clearly understand the four steps involved in supervised learning:
1. Collect a dataset: For example, (𝑥" , 𝑦" ) = (150,200), (𝑥! , 𝑦! ) = (200,600), and
(𝑥& , 𝑦& ) = (260,500).
2. Define the model’s structure: For example, 𝑦 = 𝑤𝑥 + 𝑏.
3. Define the loss function: Such as Equation 1.3.
4. Minimize the loss: Minimize the loss function on the dataset.
In our example, we minimized the loss manually by solving a system of two equations with two
variables. This approach works for small systems. However, as models grow in complexity—such
as large language models with billions of parameters—manual approach becomes infeasible. Let’s
now introduce new concepts that will help us address this challenge.

1.4. Vector
To predict a house price, knowing its area alone isn’t enough. Factors like the year of construction
or the number of bedrooms and bathrooms also matter. Suppose we use two attributes: (1) area
and (2) number of bedrooms. In this case, the input 𝐱 becomes a feature vector. This vector in-
cludes two features, also called dimensions or components:
(")
𝐱 = R𝑥 (!) S
def

𝑥
In this book, vectors are represented with lowercase bold letters, such as 𝐱 or 𝐰. For a given house
𝐱, 𝑥 (") represents its size in square meters, and 𝑥 (!) represents the number of bedrooms.

A vector is usually represented as a column of numbers, called a column vector. How-

ever, in text, it is often written as its transpose, 𝐱 1 . Transposing a column vector con-
def def 1
verts it into a row vector. For example, 𝐱 1 = J𝑥 (") , 𝑥 (!) K or 𝐱 = J𝑥 (") , 𝑥 (!) K .

The dimensionality of the vector, or its size, refers to the number of components it contains. Here,
𝐱 has two components, so its dimensionality is 2.
With two features, our linear model needs three parameters: the weights 𝑤 (") and 𝑤 (!) , and the
bias 𝑏. The weights can be grouped into a vector:
(")
𝐰 = R𝑤 (!) S
def

𝑤
The linear model can then be written compactly as:

𝑦 = 𝐰 ⋅ 𝐱 + 𝑏, (1.4)
where 𝐰 ⋅ 𝐱 is a dot product of two vectors (also known as scalar product). It is defined as:

9
3
def
𝐰 ⋅ 𝐱 = U 𝑤 (2) 𝑥 (2)
2%"
The dot product combines two vectors of the same dimensionality to produce a scalar, a number
like 22, 0.67, or −10.5. Scalars in this book are denoted by italic lowercase or uppercase letters,
such as 𝑥 or 𝐷. The expression 𝐰 ⋅ 𝐱 + 𝑏 generalizes the idea of a linear transformation to vectors.
The equation above uses capital-sigma notation, where 𝐷 represents the dimensionality of the
input, and 𝑗 runs from 1 to 𝐷. For example, in the 2-dimensional house scenario,
def
∑!2%" 𝑤 (2) 𝑥 (2) = 𝑤 (")𝑥 (") + 𝑤 (!) 𝑥 (!) .

Although the capital-sigma notation suggests the dot product might be implemented
as a loop, modern computers handle it much more efficiently. Optimized linear alge-
bra libraries like BLAS and cuBLAS compute the dot product using low-level, highly
optimized methods. These libraries leverage hardware acceleration and parallel pro-
cessing, achieving speeds far beyond a simple loop.

The sum of two vectors 𝐚 and 𝐛, both with the same dimensionality 𝐷, is defined as:
def 1
𝐚 + 𝐛 = J𝑎(") + 𝑏(") , 𝑎(!) + 𝑏(!) , … , 𝑎(3) + 𝑏 (3) K
The calculation for a sum of two 3-dimensional vectors is illustrated below:

In this chapter’s illustrations, the numbers in the cells indicate the position of an ele-
ment within an input or output matrix, or a vector. They do not represent actual values.

The element-wise product of two vectors 𝐚 and 𝐛 of dimensionality 𝐷, is defined as:

def
𝐚 ⊙ 𝐛 = J𝑎(") ⋅ 𝑏(") , 𝑎(!) ⋅ 𝑏(!) , … , 𝑎(3) ⋅ 𝑏(3) ]1
The computation of the element-wise product for two 3-dimensional vectors is shown below:

10
The norm of a vector 𝐱, denoted ∥ 𝐱 ∥, represents its length or magnitude. It is defined as the
square root of the sum of the squares of its components:

3
def
∥ 𝐱 ∥= ^U(𝑥 (2) )!
2%"

For a 2-dimensional vector 𝐱, the norm is:

∥ 𝐱 ∥= _(𝑥 (") )! + (𝑥 (!) )!

The cosine of the angle 𝜃 between two vectors 𝐱 and 𝐲 is defined as:
𝐱⋅𝐲
cos(𝜃) = (1.5)
∥ 𝐱 ∥∥ 𝐲 ∥
The cosine of the angle between two vectors quantifies their similarity. For instance, two houses
with similar areas and bedroom counts will have a cosine similarity close to 1, otherwise the value
will be lower. Cosine similarity is widely used to compare words or documents represented as
embedding vectors. This will be discussed further in Section 2.2.
A zero vector has all components equal to zero. A unit vector has a length of 1. To convert any
non-zero vector 𝐱 into a unit vector 𝐱e, you divide the vector by its norm:
𝐱
𝐱e =
∥𝐱∥
Dividing a vector by a number results in a new vector where each component of the original vector
is divided by that number.
A unit vector preserves the direction of the original vector but has a length of 1. The figure below
demonstrates this with 2-dimensional examples. On the left, aligned vectors have cos(𝜃) = 0.78.
On the right, nearly orthogonal vectors have cos(𝜃) = −0.02.

11
Unit vectors are valuable because their dot product equals the cosine of the angle be-
tween them, and computing dot products is efficient. When documents are represented
as unit vectors, finding similar ones becomes fast by calculating the dot product be-
tween the query vector and document vectors. This is how vector search engines and
libraries like Milvus, Qdrant, and Weaviate operate.

As dimensions increase, the number of parameters in a linear model becomes too large to solve
manually. Furthermore, in high-dimensional spaces, we cannot visually verify if data follows a
linear pattern. Even if we could visualize beyond three dimensions, we would still need more flex-
ible models to handle data that linear models cannot fit.
The next section explores non-linear models, with a focus on neural networks—the foundation for
understanding large language models, which are a specialized neural network architecture.

1.5. Neural Network

A neural network differs from a linear model in two fundamental ways: (1) it applies fixed non-
linear functions to the outputs of trainable linear functions, and (2) its structure is deeper, com-
bining multiple functions hierarchically through layers. Let’s illustrate these differences.
Linear models like 𝑤𝑥 + 𝑏 or 𝐰 ⋅ 𝐱 + 𝑏 cannot solve many machine learning problems effectively.
Even if we combine them into a composite function 𝑓! C𝑓" (𝑥)D, a composite function of linear
functions remains linear. This is straightforward to verify.
def def
Let’s define 𝑦" = 𝑓" (𝑥) = 𝑎" 𝑥 and 𝑦! = 𝑓! (𝑦" ) = 𝑎! 𝑦" . Here, 𝑓! depends on 𝑓" , making it a compo-
site function. We can rewrite 𝑓! as:
𝑦! = 𝑎! 𝑦" = 𝑎!(𝑎" 𝑥) = (𝑎! 𝑎" )𝑥
def
Since 𝑎" and 𝑎! are constants, we can define 𝑎& = 𝑎" 𝑎! , so 𝑦! = 𝑎& 𝑥, which is linear.
A straight line often fails to capture patterns in one-dimensional data, as demonstrated when lin-
ear regression is applied to non-linear data:

12
To overcome this, we introduce non-linearity. For a one-dimensional input, the model becomes:
𝑦 = 𝜙(𝑤𝑥 + 𝑏)
The function 𝜙 is a fixed non-linear function, known as an activation. Common choices are:
def
1) ReLU (rectified linear unit): ReLU(𝑧) = max(0, 𝑧), which outputs non-negative values
and is widely used in neural networks;
def "
2) Sigmoid: 𝜎(𝑧) = "45 $%, which outputs values between 0 and 1, making it suitable for bi-
nary classification (e.g., classifying spam emails as 1 and non-spam as 0);
def 5 % 65 $%
3) Tanh (hyperbolic tangent): tanh(𝑧) = 5 % 45 $%
; outputs values between −1 and 1.

In these equations, 𝑒 denotes Euler’s number, approximately 2.72.

These functions are widely used due to their mathematical properties, simplicity, and effectiveness
in diverse applications. This is what they look like:

The structure 𝜙(𝑤𝑥 + 𝑏) enables learning non-linear models but can’t capture all non-linear curves.
def
By nesting these functions, we build more expressive models. For instance, let 𝑓" (𝑥) = 𝜙(𝑎𝑥 + 𝑏)
def
and 𝑓! (𝑧) = 𝜙(𝑐𝑧 + 𝑑). A composite model combining 𝑓" and 𝑓! is:
𝑦 = 𝑓! C𝑓" (𝑥)D = 𝜙(𝑐𝜙(𝑎𝑥 + 𝑏) + 𝑑)
Here, the input 𝑥 is first transformed linearly using parameters 𝑎 and 𝑏, then passed through the
non-linear function 𝜙. The result is further transformed linearly with parameters 𝑐 and 𝑑, followed
by another application of 𝜙.
Below is the graph representation of the composite model 𝑦 = 𝑓! C𝑓" (𝑥)D:

13
A computational graph represents the structure of a model. The computational graph above
shows two non-linear units (blue rectangles), often referred to as artificial neurons. Each unit
contains two trainable parameters—a weight and a bias—represented by grey circles. The left ar-
row ← denotes that the value on the right is assigned to the variable on the left. This graph illus-
trates a basic neural network with two layers, each containing one unit. Most neural networks in
practice are built with more layers and multiple units per layer.
Suppose we have a two-dimensional input, an input layer with three units, and an output layer
with a single unit. The computational graph appears as follows:

Figure 1.1: A neural network with two layers.

This structure represents a feedforward neural network (FNN), where information flows in one
direction—left to right—without loops. When units in each layer connect to all units in the subse-
quent layer, as shown above, we call it a multilayer perceptron (MLP). A layer where each unit
connects to all units in both adjacent layers is termed a fully connected layer, or dense layer.
In Chapter 3, we will explore recurrent neural networks (RNNs). Unlike FNNs, RNNs have loops,
where outputs from a layer are used as inputs to the same layer.

Convolutional neural networks (CNN) are feedforward neural networks with convo-
lutional layers that are not fully connected. While initially designed for image pro-
cessing, they are effective for tasks like document classification in text data. To learn
more about CNNs refer to the additional materials in the book’s wiki.

To simplify diagrams, individual neural units can be replaced with squares. Using this approach,
the above network can be represented more compactly as follows:

The Hundred-Page Language Models Book - Andriy Burkov
93% (14)
The Hundred-Page Language Models Book - Andriy Burkov
209 pages
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
No ratings yet
OceanofPDF - Com Large Language Models Concepts - John AtkinsonAbutridy
185 pages
Chapter 1 - Vector Analysis
No ratings yet
Chapter 1 - Vector Analysis
54 pages
Applied Math - 1
100% (6)
Applied Math - 1
25 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
Machine Learning With Python A Practical Beginners' Guide (Machine Learning With Python For Beginners Book 2) (Oliver Theobald) (Z-Library)
100% (1)
Machine Learning With Python A Practical Beginners' Guide (Machine Learning With Python For Beginners Book 2) (Oliver Theobald) (Z-Library)
146 pages
Paab G. Foundation Models For Natural Language Processing... 2023
100% (4)
Paab G. Foundation Models For Natural Language Processing... 2023
448 pages
Vectors and Scalars
No ratings yet
Vectors and Scalars
29 pages
Advanced Structural Mechanics
No ratings yet
Advanced Structural Mechanics
37 pages
Mathematicalphysics PDF
No ratings yet
Mathematicalphysics PDF
544 pages
DAE Civil Engineering Syllabus - New
0% (1)
DAE Civil Engineering Syllabus - New
225 pages
Givental A. - Linear Algebra (2016, Sumizdat)
No ratings yet
Givental A. - Linear Algebra (2016, Sumizdat)
218 pages
Nasslli ML 2018 Slides
No ratings yet
Nasslli ML 2018 Slides
243 pages
Mathematics Ii Puc Vector Algebra Questions & Answer: K 3j 2i A Let + + / + + + + + + R R R R
No ratings yet
Mathematics Ii Puc Vector Algebra Questions & Answer: K 3j 2i A Let + + / + + + + + + R R R R
18 pages
Machine Learning: Herbert Jaeger
No ratings yet
Machine Learning: Herbert Jaeger
100 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Module 1 Measurement Vector
No ratings yet
Module 1 Measurement Vector
25 pages
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
100% (2)
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
32 pages
For Module - 1 (Part 1) PDF
No ratings yet
For Module - 1 (Part 1) PDF
48 pages
General Physics 1 Quarter 1 Week 5 SSLM
No ratings yet
General Physics 1 Quarter 1 Week 5 SSLM
6 pages
What's The Difference Between AI, Machine Learning
No ratings yet
What's The Difference Between AI, Machine Learning
21 pages
PHY252 Review of Vectors
No ratings yet
PHY252 Review of Vectors
18 pages
Machine Learning Q and AI 1686653642
67% (3)
Machine Learning Q and AI 1686653642
82 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
1 Leaning Introduction
No ratings yet
1 Leaning Introduction
29 pages
Ses Qui Linear
No ratings yet
Ses Qui Linear
12 pages
More Than A Chatbot
No ratings yet
More Than A Chatbot
133 pages
Machine Learning Analyst
No ratings yet
Machine Learning Analyst
5 pages
Bab 2 Euclidean Vector Spaces (Compatibility Mode) PDF
No ratings yet
Bab 2 Euclidean Vector Spaces (Compatibility Mode) PDF
13 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
8fm0 01 Rms 20240815
100% (1)
8fm0 01 Rms 20240815
27 pages
Introduction To ML
No ratings yet
Introduction To ML
48 pages
Genai Principles
No ratings yet
Genai Principles
12 pages
Foundations of LLM
100% (1)
Foundations of LLM
231 pages
EL4106Intro 2024
No ratings yet
EL4106Intro 2024
69 pages
Machine Learning Q and Ai Sample
No ratings yet
Machine Learning Q and Ai Sample
83 pages
Linear Algebra Quiz
No ratings yet
Linear Algebra Quiz
2 pages
ALL ABout Prompting - Avilash Bhowmick
No ratings yet
ALL ABout Prompting - Avilash Bhowmick
64 pages
Class XI Unsolved Question&Answers (Part B-Unit-1 To 4)
No ratings yet
Class XI Unsolved Question&Answers (Part B-Unit-1 To 4)
13 pages
Vector Spaces
No ratings yet
Vector Spaces
10 pages
Examples - Question - Vectors - Part 1-1
No ratings yet
Examples - Question - Vectors - Part 1-1
3 pages
Support Vector Machines and Perceptrons Learning, Optimization, Classification, and Application To Social Networks
No ratings yet
Support Vector Machines and Perceptrons Learning, Optimization, Classification, and Application To Social Networks
103 pages
MACHINE LEARNING Unit-1
No ratings yet
MACHINE LEARNING Unit-1
23 pages
Module 2 Vector Multiplication
No ratings yet
Module 2 Vector Multiplication
8 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Machine Learning For Beginners The Ultimate Guide To Learn and Understand Machine Learning A Practical Approach To Master Machine Learning To Improve and Increase Business Results - Compress
No ratings yet
Machine Learning For Beginners The Ultimate Guide To Learn and Understand Machine Learning A Practical Approach To Master Machine Learning To Improve and Increase Business Results - Compress
180 pages
Assignment 8
No ratings yet
Assignment 8
4 pages
Linear Transforms
No ratings yet
Linear Transforms
20 pages
AI Professional Workshop
No ratings yet
AI Professional Workshop
32 pages
Chapter 1 Review of Vectors and Maxwell's Equations
No ratings yet
Chapter 1 Review of Vectors and Maxwell's Equations
164 pages
Presentation On Ai
No ratings yet
Presentation On Ai
10 pages
Advanced Vectors
No ratings yet
Advanced Vectors
10 pages
ML 22
No ratings yet
ML 22
29 pages
ML Algorithms
No ratings yet
ML Algorithms
5 pages
Unit 2
No ratings yet
Unit 2
19 pages
Mlall
No ratings yet
Mlall
186 pages
1 Introduction
No ratings yet
1 Introduction
24 pages
Foundations of Robotics
No ratings yet
Foundations of Robotics
26 pages
AIML Overview
No ratings yet
AIML Overview
7 pages
Activity No 4.0 and 4.1 GEN - PHY1
No ratings yet
Activity No 4.0 and 4.1 GEN - PHY1
2 pages
Introduction To Machine Learning For Beginners: Ayush Pant
No ratings yet
Introduction To Machine Learning For Beginners: Ayush Pant
28 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Chapter 5
No ratings yet
Chapter 5
44 pages
Confussion Analysis
No ratings yet
Confussion Analysis
3 pages
Giancoli Chapter 7
No ratings yet
Giancoli Chapter 7
40 pages
LLM Book
No ratings yet
LLM Book
161 pages
Gen AI Content
No ratings yet
Gen AI Content
47 pages
Machine Learning For Beginners The Ultimate Guide
No ratings yet
Machine Learning For Beginners The Ultimate Guide
184 pages
《A Primer on Large Language Models and their Limitations
No ratings yet
《A Primer on Large Language Models and their Limitations
33 pages
Unit 1
No ratings yet
Unit 1
6 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
33 pages
Book Complexity Science Frontmatter
100% (1)
Book Complexity Science Frontmatter
17 pages
A Guide For Beginners - Understand Artificial Intelligence
No ratings yet
A Guide For Beginners - Understand Artificial Intelligence
14 pages
Et Tu Code - Demystifying LLM, AI Mathematics, and Hardware Infra (2024)
No ratings yet
Et Tu Code - Demystifying LLM, AI Mathematics, and Hardware Infra (2024)
541 pages
Mechanics Math Notes
No ratings yet
Mechanics Math Notes
227 pages
Master Language Models Through Mathematics
No ratings yet
Master Language Models Through Mathematics
3 pages
Chapter 4
No ratings yet
Chapter 4
32 pages
Language Models: A Guide For The Perplexed
No ratings yet
Language Models: A Guide For The Perplexed
35 pages
Chapter 6
No ratings yet
Chapter 6
14 pages
5 Reasons Collaborative DS Is Not Enought
No ratings yet
5 Reasons Collaborative DS Is Not Enought
15 pages
A Survey On Large Language Models With Some Insights
No ratings yet
A Survey On Large Language Models With Some Insights
174 pages
L01-04 - ML Introduction and Linear Regression With One Variable 19.11.19 - Removed
No ratings yet
L01-04 - ML Introduction and Linear Regression With One Variable 19.11.19 - Removed
93 pages
Training The Application of LLM
No ratings yet
Training The Application of LLM
68 pages
Simulating Hamiltonian Dynamics Leimkuhler B Reich S Instant Download
No ratings yet
Simulating Hamiltonian Dynamics Leimkuhler B Reich S Instant Download
80 pages
Preface
No ratings yet
Preface
4 pages
WPE - Revision Notes by Prateek Jain Sir
No ratings yet
WPE - Revision Notes by Prateek Jain Sir
121 pages
How Might We Help Designers Understand Systems
No ratings yet
How Might We Help Designers Understand Systems
22 pages
XQ Key Messages by Chapter
No ratings yet
XQ Key Messages by Chapter
4 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Book - Introduction To The Modeling and Analysis of Complex Systems-Errata
No ratings yet
Book - Introduction To The Modeling and Analysis of Complex Systems-Errata
2 pages
Book Experiential Intelligence XQ360 Development Process
No ratings yet
Book Experiential Intelligence XQ360 Development Process
2 pages
Backbase Engagement Banking Revolutions Chapter 7
No ratings yet
Backbase Engagement Banking Revolutions Chapter 7
29 pages
How GenAI Works
No ratings yet
How GenAI Works
5 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Chapter 3
No ratings yet
Chapter 3
24 pages
Financial Services in AI Era
No ratings yet
Financial Services in AI Era
77 pages
PWC Copilot For Microsoft 36 Ebook
No ratings yet
PWC Copilot For Microsoft 36 Ebook
7 pages
Let Users Talk To Your Databases Build A RAG-Powered SQL Assistant With Streamlit
No ratings yet
Let Users Talk To Your Databases Build A RAG-Powered SQL Assistant With Streamlit
29 pages
Preparing For Tomorrows Agentic Workforce v2
No ratings yet
Preparing For Tomorrows Agentic Workforce v2
10 pages
Boussioux Et Al 2024 The Crowdless Future Generative Ai and Creative Problem Solving
No ratings yet
Boussioux Et Al 2024 The Crowdless Future Generative Ai and Creative Problem Solving
19 pages
Causal Inference in Complex Systems. Why Predicting Outcomes Isn't Enough
No ratings yet
Causal Inference in Complex Systems. Why Predicting Outcomes Isn't Enough
16 pages