0% found this document useful (0 votes)

9 views

Lecture 3 - Linear Regression

The document discusses supervised learning and linear regression. It defines regression as having a continuous output space, and classification as having a categorical output space. It provides examples of problems that can be framed as regression or classification. It then describes the key components of machine learning algorithms, and sets up the linear regression problem to minimize a loss function over the model parameters.

Uploaded by

Rajdip Nayek

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Lecture 3 - Linear Regression

Uploaded by

Rajdip Nayek

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

APL 745

Supervised Learning:
Linear regression
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi

E-mail: [email protected]
Learning goals

• Understand the difference between regression and classification

• Formulate the linear regression task mathematically as an optimization problem

• Think about the data points and the model parameters as vectors

• Solve the optimization problem using two different strategies: deriving a closed-form
solution, and applying gradient descent

• Write the algorithm in terms of linear algebra, so that we can think about it more
easily

• Analyzing the generalization performance of an algorithm, and in particular the

problems of overfitting and underfitting

• Making a linear algorithm more powerful using nonlinear basis functions, or features

2
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output

• Objective: To find a relation or function between the inputs and outputs

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

3
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output

• Objective: To find a relation or function between the inputs and outputs

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

4
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output

• Objective: To find a relation or function between the inputs and outputs

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

• In practice, the underlying true relation 𝑓 𝒙 → 𝑡 is never known

• So we hypothesize a model ℎ(𝒙) to approximate 𝑓 𝒙

5
Supervised learning
• Two types of supervised learning

• Classification: range (output space) of 𝑓(𝒙) is categorical

• Regression: range (output space) of 𝑓(𝒙) is continuous

• In both cases, input space can be discrete or continuous

Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡

6
An example of classification
• Problem: Will you enjoy an outdoor sport based on the weather?

Pollution Humidity Wind Sky Enjoy

• Data:
sport
Mild Normal Strong Sunny YES
High High Weak Cloudy NO
Moderate High Strong Sunny YES
Mild Normal Weak Sunny YES

• 𝒙 = 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 is the feature vector which is input to the function 𝑓(𝒙)

• The outputs are binary in this case → 𝑌𝑒𝑠, 𝑁𝑜
• Find the rule/relation between the inputs and outputs
• Again, the function 𝑓(𝒙) is unknown, and we approximate using model ℎ(𝒙)
• Possible models ℎ(𝒙)
• ℎ1 𝒙 : Sky = sunny → Enjoy Sport = Yes
• ℎ2 𝒙 : Sky = sunny and Pollution = Mild → Enjoy Sport = Yes 7
An example of regression

• Problem: Predict the price of house based on floor area of house

• "𝑥 = Floor area” is the input
• The “𝑡 = price of house” is the output which lies in ℝ+
• Data:
Floor area Price of 𝑡
house
1200 40 L
2400 60 L
3000 85 L
4000 100 L
5000 62 L
6000 180 L 𝑥

8
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression

Spam detection Text (set of words)

Stock price Time-series of prices

prediction

Speech recognition Audio signal

9
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression

Digit recognition Images of digits

Housing valuation House features

Sensor data (images,

Weather prediction
wind speed)

10
Key components of any ML algorithm

1. The data that we learn from

2. A model with parameters to learn from the data

3. An loss function that quantifies how well (or badly) the model is
doing

4. An optimization algorithm to adjust the modelʼs parameters to

optimize the objective function

11
Linear regression

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

a scalar output (also called the target)

(𝑖) (𝑖) 𝑁
• Given several pairs 𝒙 , 𝑡 𝑖=1

• Fit a relation between the scalar output 𝑡

as a function of the vector input 𝒙 ∈ ℝ𝐾

12
Linear regression

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

a scalar output (also called the target)

𝑁
• Given several pairs 𝒙(𝑖) , 𝑡 (𝑖) 𝑖=1

• Fit a relation between the scalar output 𝑡 𝑡

as a function of the vector input 𝒙 ∈ ℝ𝐾

• A simple approach is to choose a linear model, such that the output is a

linear function of the input vector
13
Linear regression: Problem setup

(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏

• 𝑦 is the model prediction 𝑦

• 𝒘 is the weight vector

• 𝑏 is the bias (i.e. the intercept) A line in case

of scalar
• 𝒘 and 𝑏 together are the parameters input
of the linear model 𝑥

14
Linear regression: Problem setup

(𝑖) (𝑖) 𝑁
• How good is the model fit with respect to the data pairs 𝒙 , 𝑡 𝑖=1 ?

• To assess model fit, we define errors and accumulate them using a loss function
2
• Squared error for the 𝑖 th example: 𝑡 (𝑖) − 𝑦 (𝑖)

• Loss function: Take average of all errors

𝑁 𝑁
1 (𝑖) (𝑖) 2 1 (𝑖) 𝑇 𝑖 2
ℒ 𝒘, 𝑏 = 2𝑁
෍ 𝑡 −𝑦 = 2𝑁
෍ 𝑡 −𝒘 𝒙 −𝑏
𝑖=1 𝑖=1

1 1
• The 2 factor is for convenience and 𝑁 factor is for taking average over the full dataset

Goal: To choose 𝒘 and 𝑏 such that the loss ℒ 𝒘, 𝑏 is minimized

15
Linear regression: Problem setup summary

(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features

𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏

16
Vectorization
• We can organize all the data into a matrix 𝐗 with one row per example, and all the
output predictions into a vector 𝒚
One feature across
all examples
1 𝑻
𝒙 1 3 −2 0.1 One example
𝐗= ⋮ = ⋮ ⋮ ⋮ ⋮
𝑻
𝒙 𝑁 9 2 0 −0.2
𝑁×𝐾

• Compute the predicted output for the entire dataset

1 𝑻
𝒙 𝒘+𝑏 𝒘𝑇 𝒙 1 𝑦 (1)
+𝑏
𝐗𝐰 + 𝟏𝑏 = ⋮ = ⋮ = ⋮ =𝒚
𝑻
𝒙 𝑁 𝒘+𝑏 𝒘𝑇 𝒙 𝑁 + 𝑏 𝑦 (𝑁)

• Compute the loss function

1 1
ℒ 𝒘, 𝑏 = 2𝑁 𝒕 −𝒚 22 = 2𝑁 𝒕 −𝒚 𝑇 𝒕 −𝒚
17
18
Solving the optimization problem
• We would like to minimize the loss function!

• Recall from calculus class: the minimum of a smooth function occurs at a critical
point, i.e. point where the gradients are all 0.
ℒ(𝑤)

• Two strategies for optimization:

• Direct solution: derive a formula that sets the gradient to 0. This works only in
a handful of cases (e.g. linear regression).

• Iterative methods (e.g. gradient descent): repeatedly apply an update rule

which slightly improves the current solution. This is what we’ll do throughout
the course 19
Direct solution
• Augment the bias term into the weights

ഥ𝒘
𝒚 = 𝐗𝒘 + 𝟏𝑏 = 𝐗 ഥ

• Set gradient of the loss function with respect to weights and bias to zero

• Gradient: partial derivatives of a multivariate function with respect to its arguments

ഥ𝒘
𝒚 =𝐗 ഥ
Denominator
ഥ𝒘 layout
𝑑𝒚 𝑑 𝐗 ഥ
= ഥ𝑇
=𝐗 link
ഥ
𝑑𝒘 𝑑𝒘ഥ

1 𝑇
ℒ= 𝒕 −𝒚 𝒕 −𝒚
2𝑁

𝑑ℒ 1
=− 𝒕 −𝒚
𝑑𝒚 𝑁
20
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁

• Set the gradient to zero 𝑑𝒚

ഥ𝑇
=𝐗
ഥ
𝑑𝒘
ഥ
𝑑ℒ(𝒘) 𝑑ℒ 𝑑𝒚
= =𝟎
𝑑𝒘ഥ ഥ
𝑑𝒚 𝑑 𝒘

1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁

21
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁

• Set the gradient to zero 𝑑𝒚

ഥ𝑇
=𝐗
ഥ
𝑑𝒘
ഥ
𝑑ℒ(𝒘) 𝑑ℒ 𝑑𝒚
= =𝟎
𝑑𝒘ഥ ഥ
𝑑𝒚 𝑑 𝒘

1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁

ഥ𝑇 𝒕 − 𝒚 = 𝟎
𝐗
ഥ𝑇 𝒕 − 𝐗
𝐗 ഥ𝑇 𝐗
ഥ𝒘ഥ =𝟎

ഥ𝑇 𝐗
𝐗 ഥ𝒘 ഥ𝑇 𝒕
ഥ =𝐗

ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ
𝐗𝑇 𝒕
22
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)

23
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)

• We initialize the weights to something reasonable (e.g. random weights) and

repeatedly adjust them in the direction of steepest descent

Gradient
direction →
Direction of
steepest
ascent → the
direction in
which the
function
increases

24
25
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)

• We initialize the weights to something reasonable (e.g. random weights) and

repeatedly adjust them in the direction of steepest descent

ഥ ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒
𝒘
ഥ
𝑑ℒ(𝒘)
ഥ ←𝒘
𝒘 ഥ −𝛼
𝑑𝒘ഥ
𝛼 𝑇
𝒘ഥ ←𝒘 ഥ+ 𝐗 ഥ 𝒕 −𝒚
𝑁
𝛼 𝑇
𝒘ഥ ←𝒘ഥ+ 𝐗 ഥ 𝒕−𝐗ഥ𝒘
ഥ
𝑁
• 𝛼 is the learning rate. Larger values of 𝛼 implies greater change in 𝒘
• Typical values 0.01 or 0.001

• Gradient descent can get stuck in local optima; we will see more on it later 26
Iterative vs direct solution
• By setting the gradients to zero, we compute the direct (or exact) solution. With
gradient descent, we approach it gradually

• Why then would we ever prefer gradient descent (GD)?

• GD can be applied to a much broader class of models, which may not

have closed-form solution, but gradients can be computed

• GD is easy to implement than direct solutions these days, especially with

automatic differentiation software like Autograd

• For regression, the direct solution requires a matrix inverse which can be
computationally very costly for large number of features

ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ 𝑇
𝐗 𝒕

• GD is much more efficient for regression with large number of features

27
Nonlinear feature maps
• We can convert linear models into nonlinear models using nonlinear feature maps
𝑦 = 𝒘𝑇 𝜙 𝒙 + 𝑏

• E.g., if 𝝓 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 𝑇 , then 𝑦 is a polynomial in 𝑥 (no longer linear in 𝑥)

𝑦 = 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑏

• This introduction of nonlinear features does 𝑦

not require changing the algorithm – just
pretend that 𝝓 𝑥 is the input vector

• These nonlinear features allow fitting

nonlinear models, but it is hard to manually
choose good nonlinear features 𝑥

28
Generalization

• We don’t just want a learning algorithm to make correct predictions on the

training examples

• We would like the algorithm to generalize to examples it hasn’t seen before

• The average error on new examples is known as the generalization error

• We would like the generalization error to be as small as possible

• So we divide the total set of examples into training set and test set

Total 𝑵 examples
Training set (e.g. 70%) Test set (e.g.30%)

Training error Generalization

error
29
Generalization
Underfitting: - Model is too simple Overfitting: - Model is too complex
- Does not fit the data - Fits perfectly, does not generalize
𝑦 𝑦
Degree 10
Degree 1 polynomial

𝑥 𝑥

Balanced - Provides reasonable

Hyperparameters are 𝑦 compromise
parameters that we Degree 3
can’t include in the The degree of
polynomial
training procedure polynomial can be
itself, but which we considered as a
need to set using some hyperparameter
other means
𝑥 30
Generalization
• The training set is used to train the model (using a preset hyperparameter)

• The validation set is used to estimate the generalization error of each

hyperparameter setting

• The test set is used at the very end, to estimate the generalization error of the final
model, once all hyperparameters have been chosen

Training set Validation set Test set

Train with degree 1 error = 2.2

Error = 2.1

Train with degree 3 error = 1.3 Test error = 1.6

Error = 1.1

Train with degree 10 error = 7.4

Error = 6.2 31

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (215)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
TDA Birth Certificate Bond Instructions
97% (285)
TDA Birth Certificate Bond Instructions
4 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Lec 03
No ratings yet
Lec 03
42 pages
02-Linear Regression
No ratings yet
02-Linear Regression
17 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Supervised_Learning (2)
No ratings yet
Supervised_Learning (2)
41 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lec06 Matt[1]
No ratings yet
Lec06 Matt[1]
60 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Regression PDF
No ratings yet
Regression PDF
37 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Regression
No ratings yet
Regression
39 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Linear Regression
No ratings yet
Linear Regression
61 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
ML-2
No ratings yet
ML-2
155 pages
Lec6 Linear Model With LSP
No ratings yet
Lec6 Linear Model With LSP
35 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Regression
No ratings yet
Regression
17 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Bruno Gonçalves: Deep Learning From Scratch
No ratings yet
Bruno Gonçalves: Deep Learning From Scratch
95 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages

Lecture 3 - Linear Regression

Uploaded by

Lecture 3 - Linear Regression

Uploaded by

APL 745

• Understand the difference between regression and classification

• Formulate the linear regression task mathematically as an optimization problem

• Analyzing the generalization performance of an algorithm, and in particular the

• Objective: To find a relation or function between the inputs and outputs

• Objective: To find a relation or function between the inputs and outputs

• Objective: To find a relation or function between the inputs and outputs

• In practice, the underlying true relation 𝑓 𝒙 → 𝑡 is never known

• So we hypothesize a model ℎ(𝒙) to approximate 𝑓 𝒙

• Classification: range (output space) of 𝑓(𝒙) is categorical

• Regression: range (output space) of 𝑓(𝒙) is continuous

• In both cases, input space can be discrete or continuous

Pollution Humidity Wind Sky Enjoy

• 𝒙 = 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 is the feature vector which is input to the function 𝑓(𝒙)

• Problem: Predict the price of house based on floor area of house

Spam detection Text (set of words)

Stock price Time-series of prices

Speech recognition Audio signal

Digit recognition Images of digits

Housing valuation House features

Sensor data (images,

1. The data that we learn from

2. A model with parameters to learn from the data

4. An optimization algorithm to adjust the modelʼs parameters to

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

• Fit a relation between the scalar output 𝑡

• 𝒙(𝑖) is a 𝐾-dimensional input vector, 𝑡 (𝑖) is

• Fit a relation between the scalar output 𝑡 𝑡

• A simple approach is to choose a linear model, such that the output is a

• 𝑦 is the model prediction 𝑦

• 𝒘 is the weight vector

• 𝑏 is the bias (i.e. the intercept) A line in case

• Loss function: Take average of all errors

Goal: To choose 𝒘 and 𝑏 such that the loss ℒ 𝒘, 𝑏 is minimized

• Compute the predicted output for the entire dataset

• Compute the loss function

• Two strategies for optimization:

• Iterative methods (e.g. gradient descent): repeatedly apply an update rule

• Gradient: partial derivatives of a multivariate function with respect to its arguments

• Set the gradient to zero 𝑑𝒚

• Set the gradient to zero 𝑑𝒚

• We initialize the weights to something reasonable (e.g. random weights) and

• We initialize the weights to something reasonable (e.g. random weights) and

• Why then would we ever prefer gradient descent (GD)?

• GD can be applied to a much broader class of models, which may not

• GD is easy to implement than direct solutions these days, especially with

• GD is much more efficient for regression with large number of features

• E.g., if 𝝓 𝑥 = 𝑥, 𝑥 2 , 𝑥 3 𝑇 , then 𝑦 is a polynomial in 𝑥 (no longer linear in 𝑥)

• This introduction of nonlinear features does 𝑦

• These nonlinear features allow fitting

• We don’t just want a learning algorithm to make correct predictions on the

• We would like the algorithm to generalize to examples it hasn’t seen before

• The average error on new examples is known as the generalization error

• We would like the generalization error to be as small as possible

Training error Generalization

Balanced - Provides reasonable

• The validation set is used to estimate the generalization error of each

Training set Validation set Test set

Train with degree 1 error = 2.2

Train with degree 3 error = 1.3 Test error = 1.6

Train with degree 10 error = 7.4

You might also like