0% found this document useful (0 votes)

4 views8 pages

UnderstandingDeepLearning 03-26-25 C 31 38

Chapter 2 discusses supervised learning, which involves creating a model that maps inputs to outputs using parameters that define their relationship. The training process aims to minimize a loss function that quantifies the mismatch between predicted and actual outputs, ultimately assessing the model's performance on unseen test data. The chapter also introduces linear regression as a simple example of supervised learning and outlines the importance of model fitting and the potential issues of underfitting and overfitting.

Uploaded by

priscilla_cbisognin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views8 pages

UnderstandingDeepLearning 03-26-25 C 31 38

Uploaded by

priscilla_cbisognin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Chapter 2

Supervised learning

A supervised learning model defines a mapping from one or more inputs to one or more
outputs. For example, the input might be the age and mileage of a second-hand Toyota
Prius, and the output might be the estimated value of the car in dollars.
The model is just a mathematical equation; when the inputs are passed through this
equation, it computes the output, and this is termed inference. The model equation also
contains parameters. Different parameter values change the outcome of the computa-
tion; the model equation describes a family of possible relationships between inputs and
outputs, and the parameters specify the particular relationship.
When we train or learn a model, we find parameters that describe the true relationship
between inputs and outputs. A learning algorithm takes a training set of input/output
pairs and manipulates the parameters until the inputs predict their corresponding out-
puts as closely as possible. If the model works well for these training pairs, then we hope
it will make good predictions for new inputs where the true output is unknown.
The goal of this chapter is to expand on these ideas. First, we describe this framework
more formally and introduce some notation. Then we work through a simple example
in which we use a straight line to describe the relationship between input and output.
This linear model is both familiar and easy to visualize, but nevertheless illustrates all
the main ideas of supervised learning.

2.1 Supervised learning overview

In supervised learning, we aim to build a model that takes an input x and outputs a
prediction y. For simplicity, we assume that both the input x and output y are vectors
of a predetermined and fixed size and that the elements of each vector are always ordered
in the same way; in the Prius example above, the input x would always contain the age
of the car and then the mileage, in that order. This is termed structured or tabular data.
To make the prediction, we need a model f[•] that takes input x and returns y, so:

y = f[x]. (2.1)

Draft: please send errata to [email protected].

18 2 Supervised learning

When we compute the prediction y from the input x, we call this inference.
The model is just a mathematical equation with a fixed form. It represents a family
of different relations between the input and the output. The model also contains param-
eters ϕ. The choice of parameters determines the particular relation between input and
output, so we should really write:

y = f[x, ϕ]. (2.2)

When we talk about learning or training a model, we mean that we attempt to find
parameters ϕ that make sensible output predictions from the input. We learn these
parameters using a training dataset of I pairs of input and output examples {xi , yi }. We
aim to select parameters that map each training input to its associated output as closely
as possible. We quantify the degree of mismatch in this mapping with the loss L. This
is a scalar value that summarizes how poorly the model predicts the training outputs
from their corresponding inputs for parameters ϕ.
We can treat the loss as a function L[ϕ] of these parameters. When we train the
model, we are seeking parameters ϕ̂ that minimize this loss function:1
Appendix 22
Argmin function h i
ϕ̂ = argmin L [ϕ] . (2.3)
ϕ

If the loss is small after this minimization, we have found model parameters that accu-
rately predict the training outputs yi from the training inputs xi .
After training a model, we must now assess its performance; we run the model on
separate test data to see how well it generalizes to examples that it didn’t observe during
training. If the performance is adequate, then we are ready to deploy the model.

2.2 Linear regression example

Let’s now make these ideas concrete with a simple example. We consider a model y =
f[x, ϕ] that predicts a single output y from a single input x. Then we develop a loss
function, and finally, we discuss model training.

2.2.1 1D linear regression model

A 1D linear regression model describes the relationship between input x and output y
as a straight line:

y = f[x, ϕ]
= ϕ0 + ϕ1 x. (2.4)
1 More properly, the loss function also depends on the training data {x , y }, so we should
i i
write L [{xi , yi }, ϕ], but this is rather cumbersome.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
2.2 Linear regression example 19

Figure 2.1 Linear regression model. For

a given choice of parameters ϕ = [ϕ0 , ϕ1 ],
the model makes a prediction for the out-
put (y-axis) based on the input (x-axis).
Different choices for the y-intercept ϕ0
and the slope ϕ1 change these predictions
(cyan, orange, and gray lines). The lin-
ear regression model (equation 2.4) de-
fines a family of input/output relations
(lines) and the parameters determine the
member of the family (the particular
line). (Interactive figure)

This model has two parameters ϕ = [ϕ0 , ϕ1 ], where ϕ0 is the y-intercept of the line and ϕ1
is the slope. Different choices for the y-intercept and slope result in different relations
between input and output (figure 2.1). Hence, equation 2.4 defines a family of possible
input-output relations (all possible lines), and the choice of parameters determines the
member of this family (the particular line).

2.2.2 Loss

For this model, the training dataset (figure 2.2a) consists of I input/output pairs {xi , yi }.
Figures 2.2b–d show three lines defined by three sets of parameters. The green line
in figure 2.2d describes the data more accurately than the other two since it is much
closer to the data points. However, we need a principled approach for deciding which
parameters ϕ are better than others. To this end, we assign a numerical value to each
choice of parameters that quantifies the degree of mismatch between the model and the
data. We term this value the loss; a lower loss means a better fit.
The mismatch is captured by the deviation between the model predictions f[xi , ϕ]
(height of the line at xi ) and the ground truth outputs yi . These deviations are depicted
as orange dashed lines in figures 2.2b–d. We quantify the total mismatch, training error,
or loss as the sum of the squares of these deviations for all I training pairs:

X
I
2
L[ϕ] = (f[xi , ϕ] − yi )
i=1
X
I
2
= (ϕ0 + ϕ1 xi − yi ) . (2.5)
i=1

Since the best parameters minimize this expression, we call this a least-squares loss. The
squaring operation means that the direction of the deviation (i.e., whether the line is

Draft: please send errata to [email protected].

20 2 Supervised learning

Figure 2.2 Linear regression training data, model, and loss. a) The training data
(orange points) consist of I = 12 input/output pairs {xi , yi }. b–d) Each panel
shows the linear regression model with different parameters. Depending on the
choice of y-intercept and slope parameters ϕ = [ϕ0 , ϕ1 ], the model errors (orange
dashed lines) may be larger or smaller. The loss L is the sum of the squares
of these errors. The parameters that define the lines in panels (b) and (c) have
large losses L = 7.07 and L = 10.28, respectively because the models fit badly.
The loss L = 0.20 in panel (d) is smaller because the model fits well; in fact, this
has the smallest loss of all possible lines, so these are the optimal parameters.
(Interactive figure)

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
2.2 Linear regression example 21

Figure 2.3 Loss function for linear regression model with the dataset in figure 2.2a.
a) Each combination of parameters ϕ = [ϕ0 ,ϕ1 ] has an associated loss. The result-
ing loss function L[ϕ] can be visualized as a surface. The three circles represent
the lines from figure 2.2b–d. b) The loss can also be visualized as a heatmap,
where brighter regions represent larger losses; here we are looking straight down
at the surface in (a) from above and gray ellipses represent isocontours. The best
fitting line (figure 2.2d) has the parameters with the smallest loss (green circle).

above or below the data) is unimportant. There are also theoretical reasons for this
choice which we return to in chapter 5.
The loss L is a function of the parameters ϕ; it will be larger when the model fit is
Notebook 2.1
poor (figure 2.2b,c) and smaller when it is good (figure 2.2d). Considered in this light, Supervised
we term L[ϕ] the loss function or cost function. The goal is to find the parameters ϕ̂ learning
that minimize this quantity:

h i
ϕ̂ = argmin L[ϕ]
ϕ
" I #
X 2
= argmin (f[xi , ϕ] − yi )
ϕ i=1
" I #
X 2
= argmin (ϕ0 + ϕ1 xi − yi ) . (2.6)
ϕ i=1

There are only two parameters (the y-intercept ϕ0 and slope ϕ1 ), so we can calculate Problems 2.1–2.2
the loss for every combination of values and visualize the loss function as a surface
(figure 2.3). The “best” parameters are at the minimum of this surface.

Draft: please send errata to [email protected].

22 2 Supervised learning

2.2.3 Training

The process of finding parameters that minimize the loss is termed model fitting, training,
or learning. The basic method is to choose the initial parameters randomly and then
improve them by “walking down” the loss function until we reach the bottom (figure 2.4).
One way to do this is to measure the gradient of the surface at the current position and
take a step in the direction that is most steeply downhill. Then we repeat this process
until the gradient is flat and we can improve no further.2

2.2.4 Testing

Having trained the model, we want to know how it will perform in the real world. We
do this by computing the loss on a separate set of test data. The degree to which the
prediction accuracy generalizes to the test data depends in part on how representative
and complete the training data is. However, it also depends on how expressive the model
is. A simple model like a line might not be able to capture the true relationship between
input and output. This is known as underfitting. Conversely, a very expressive model
may describe statistical peculiarities of the training data that are atypical and lead to
unusual predictions. This is known as overfitting.

2.3 Summary

A supervised learning model is a function y = f[x, ϕ] that relates inputs x to outputs y.

The particular relationship is determined by parameters ϕ. To train the model, we
define a loss function L[ϕ] over a training dataset {xi , yi }. This quantifies the mismatch
between the model predictions f[xi , ϕ] and observed outputs yi as a function of the
parameters ϕ. Then we search for the parameters that minimize the loss. We evaluate
the model on a different set of test data to see how well it generalizes to new inputs.
Chapters 3–9 expand on these ideas. First, we tackle the model itself; 1D linear
regression has the obvious drawback that it can only describe the relationship between the
input and output as a straight line. Shallow neural networks (chapter 3) are only slightly
more complex than linear regression but describe a much larger family of input/output
relationships. Deep neural networks (chapter 4) are just as expressive but can describe
complex functions with fewer parameters and work better in practice.
Chapter 5 investigates loss functions for different tasks and reveals the theoretical
underpinnings of the least-squares loss. Chapters 6 and 7 discuss the training process.
Chapter 8 discusses how to measure model performance. Chapter 9 considers regular-
ization techniques, which aim to improve that performance.

2 This iterative approach is not actually necessary for the linear regression model. Here, it’s possible

to find closed-form expressions for the parameters. However, this gradient descent approach works for
more complex models where there is no closed-form solution and where there are too many parameters
to evaluate the loss for every combination of values.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 23

Figure 2.4 Linear regression training. The goal is to find the y-intercept and slope
parameters that correspond to the smallest loss. a) Iterative training algorithms
initialize the parameters randomly and then improve them by “walking downhill”
until no further improvement can be made. Here, we start at position 0 and move
a certain distance downhill (perpendicular to the contours) to position 1. Then
we re-calculate the downhill direction and move to position 2. Eventually, we
reach the minimum of the function (position 4). b) Each position 0–4 from panel
(a) corresponds to a different y-intercept and slope and so represents a different
line. As the loss decreases, the lines fit the data more closely. (Interactive figure)

Notes

Loss functions vs. cost functions: In much of machine learning and in this book, the terms
loss function and cost function are used interchangeably. However, more properly, a loss function
is the individual term associated with a data point (i.e., each of the squared terms on the right-
hand side of equation 2.5), and the cost function is the overall quantity that is minimized (i.e.,
the entire right-hand side of equation 2.5). A cost function can contain additional terms that
are not associated with individual data points (see section 9.1). More generally, an objective
function is any function that is to be maximized or minimized.

Generative vs. discriminative models: The models y = f[x, ϕ] in this chapter are discrim-
inative models. These make an output prediction y from real-world measurements x. Another
Problem 2.3
approach is to build a generative model x = g[y, ϕ], in which the real-world measurements x
are computed as a function of the output y.
The generative approach has the disadvantage that it doesn’t directly predict y. To perform
inference, we must invert the generative equation as y = g−1 [x, ϕ], and this may be diﬀicult.
However, generative models have the advantage that we can build in prior knowledge about how
the data were created. For example, if we wanted to predict the 3D position and orientation y

Draft: please send errata to [email protected].

24 2 Supervised learning

of a car in an image x, then we could build knowledge about car shape, 3D geometry, and light
transport into the function x = g[y, ϕ].
This seems like a good idea, but in fact, discriminative models dominate modern machine
learning; the advantage gained from exploiting prior knowledge in generative models is usually
trumped by learning very flexible discriminative models with large amounts of training data.

Problems
Problem 2.1 To walk “downhill” on the loss function (equation 2.5), we measure its gradient with
respect to the parameters ϕ0 and ϕ1 . Calculate expressions for the slopes ∂L/∂ϕ0 and ∂L/∂ϕ1 .

Problem 2.2 Show that we can find the minimum of the loss function in closed form by setting
the expression for the derivatives from problem 2.1 to zero and solving for ϕ0 and ϕ1 . Note that
this works for linear regression but not for more complex models; this is why we use iterative
model fitting methods like gradient descent (figure 2.4).

Problem 2.3∗ Consider reformulating linear regression as a generative model, so we have x =

g[y, ϕ] = ϕ0 + ϕ1 y. What is the new loss function? Find an expression for the inverse func-
tion y = g−1 [x, ϕ] that we would use to perform inference. Will this model make the same
predictions as the discriminative version for a given training dataset {xi , yi }? One way to es-
tablish this is to write code that fits a line to three data points using both methods and see if
the result is the same.

This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.

System Design
50% (2)
System Design
58 pages
Predictive Analytics
No ratings yet
Predictive Analytics
46 pages
Cp4252 ML Unit-II
No ratings yet
Cp4252 ML Unit-II
44 pages
Silt Control in Irrigation Channels
100% (1)
Silt Control in Irrigation Channels
36 pages
Conflict Management and Negotiation - Team 5
No ratings yet
Conflict Management and Negotiation - Team 5
34 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Regression
No ratings yet
Regression
45 pages
A-Dec Dental Lights and Monitor Mounts Service Guide
No ratings yet
A-Dec Dental Lights and Monitor Mounts Service Guide
68 pages
Physics Ia (Electricity)
No ratings yet
Physics Ia (Electricity)
5 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
ML 2
No ratings yet
ML 2
155 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Supervised Machine Learning - Linear Regression
No ratings yet
Supervised Machine Learning - Linear Regression
92 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
AI ML 3 Updated
No ratings yet
AI ML 3 Updated
34 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Triaxial Test For Rocks
No ratings yet
Triaxial Test For Rocks
12 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
ML Unit Ii
No ratings yet
ML Unit Ii
30 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
AI Lec 2
No ratings yet
AI Lec 2
49 pages
Ai ML 3
No ratings yet
Ai ML 3
27 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
03 - нематоды птиц
No ratings yet
03 - нематоды птиц
10 pages
2-LR Optim
No ratings yet
2-LR Optim
60 pages
ML 02 Regression 2
No ratings yet
ML 02 Regression 2
30 pages
Week - 03 Week04
No ratings yet
Week - 03 Week04
32 pages
Design Calculation: Hindustan Construction Co. LTD
No ratings yet
Design Calculation: Hindustan Construction Co. LTD
13 pages
Chapter - 2-ML
No ratings yet
Chapter - 2-ML
63 pages
The FSC - Stability
No ratings yet
The FSC - Stability
9 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
CM20315 02 Supervised
No ratings yet
CM20315 02 Supervised
53 pages
DS100 Sp22 Lec 09 - Intro To Modeling, SLR
No ratings yet
DS100 Sp22 Lec 09 - Intro To Modeling, SLR
69 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
Day.9 SML
No ratings yet
Day.9 SML
23 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
S.No. Name of The Agency Contact Details: M/s M.P. Printers
100% (1)
S.No. Name of The Agency Contact Details: M/s M.P. Printers
3 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
No ratings yet
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
23 pages
OU Diary-2020 Informatica PDF
No ratings yet
OU Diary-2020 Informatica PDF
75 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
D2L CH3 Part1
No ratings yet
D2L CH3 Part1
36 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
Unit 2
No ratings yet
Unit 2
35 pages
Linear Regression
No ratings yet
Linear Regression
60 pages
Chapter 13 - Motivation at Work
No ratings yet
Chapter 13 - Motivation at Work
62 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
A Reflective Essay On Veneration Without Understanding
No ratings yet
A Reflective Essay On Veneration Without Understanding
7 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
4bs1 02 Rms 20230824
No ratings yet
4bs1 02 Rms 20230824
25 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Annals of Human Genetics - 2006 - Nasidze - The Gagauz A Linguistic Enclave Are Not A Genetic Isolate
No ratings yet
Annals of Human Genetics - 2006 - Nasidze - The Gagauz A Linguistic Enclave Are Not A Genetic Isolate
11 pages
Lec 6
No ratings yet
Lec 6
19 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
A Clearer View of Crystallizers
No ratings yet
A Clearer View of Crystallizers
5 pages
BEP 328 - Project Management 8: Negotiating Solutions
No ratings yet
BEP 328 - Project Management 8: Negotiating Solutions
13 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Pokétwitch Eng
No ratings yet
Pokétwitch Eng
5 pages
DR Fixit Polymer Mortar PX 75 1
No ratings yet
DR Fixit Polymer Mortar PX 75 1
3 pages
Classification & Regression BDMDM Print
No ratings yet
Classification & Regression BDMDM Print
5 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Class 6 History Worksheet
No ratings yet
Class 6 History Worksheet
5 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
HDFC 5000 Book4 07to31mar25
No ratings yet
HDFC 5000 Book4 07to31mar25
3 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
Dice Resume CV Kelly Carlson
No ratings yet
Dice Resume CV Kelly Carlson
4 pages
Quotation for Air cond - 240108 - eng version (giá gốc)
No ratings yet
Quotation for Air cond - 240108 - eng version (giá gốc)
3 pages
HW - 7 1
No ratings yet
HW - 7 1
4 pages
Futo Digital Bootcamp 2024 Timetable
No ratings yet
Futo Digital Bootcamp 2024 Timetable
3 pages
Power Plant Engg.
No ratings yet
Power Plant Engg.
2 pages
Boolean Xor Based (K, N) Threshold Visual Cryptography For Grayscale Images
No ratings yet
Boolean Xor Based (K, N) Threshold Visual Cryptography For Grayscale Images
4 pages
Nitish Bnkassociate
No ratings yet
Nitish Bnkassociate
2 pages
Intano 11 Cypress - Assignment N1 CS7
No ratings yet
Intano 11 Cypress - Assignment N1 CS7
1 page
Recruitment and Selection
No ratings yet
Recruitment and Selection
2 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

UnderstandingDeepLearning 03-26-25 C 31 38

Uploaded by

UnderstandingDeepLearning 03-26-25 C 31 38

Uploaded by

Chapter 2

2.1 Supervised learning overview

Draft: please send errata to [email protected].

y = f[x, ϕ]. (2.2)

2.2 Linear regression example

2.2.1 1D linear regression model

Figure 2.1 Linear regression model. For

Draft: please send errata to [email protected].

Draft: please send errata to [email protected].

A supervised learning model is a function y = f[x, ϕ] that relates inputs x to outputs y.

Draft: please send errata to [email protected].

Problem 2.3∗ Consider reformulating linear regression as a generative model, so we have x =

You might also like