Linear - Regression - SGD
Linear - Regression - SGD
§ Introduction to Learning
§ Linear Regression
§ Gradient Descent
§ Generalized Linear Regression
A Definition of ML
§ A pattern exist
§ We have data on it
Example: Home Price
} Housing price prediction
400
300
Price ($)
200
in 1000’s
100
0
0 500 1000 1500 2000 2500
Size in feet2
} Perceptron example
Perceptron classifier
} Input 𝒙 = 𝑥1, … , 𝑥𝑑 x2
} Classifier:
} If 𝑑 𝑤𝑖𝑥𝑖 > threshold then output 1
𝑖=1
} else output −1
} Misclassified data 𝒙 𝑛 ,𝑦 𝑛 :
sign(𝒘𝑇𝒙 𝑛 ) ≠ 𝑦(𝑛)
Repeat
Pick a misclassified data 𝒙 𝑛 , 𝑦 𝑛 from training data and
update 𝒘:
𝒘 = 𝒘 + 𝑦(𝑛)𝒙(𝑛)
Until all training data points are correctly classified by 𝑔
Perceptron learning algorithm: Example of weight update
Correct Label Correct Label
x2 x2
x1 x1
Experience (E) in ML
Learning
model
Linear regression, Cost Function and
Gradient Descent
Regression problem
} The goal is to make (real valued) predictions given
features
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1𝑥
𝑥
} Multivariate
Training Set 𝐷
We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0, 𝑤1
500
400
𝑦(𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200
100
0
0 500 1000 1500 2000 2500 3000
𝑥
2
𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘 𝑖
Linear regression: univariate example
500
400
𝑦(𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200
100
0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function:
2
Regression: squared loss
} In the SSE cost function, we used squared error as the
prediction loss:
𝐿𝑜𝑠𝑠 𝑦 ,𝑦6 = 𝑦 −𝑦6 2 𝑦3 = 𝑓(𝒙; 𝒘)
} Minimize 𝐽 𝒘
} Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0,𝑤1)
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
Cost function: univariate example
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(for fixed 𝑤0,𝑤1,this is a function of 𝑥) (function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
Cost function: univariate example
𝑤1
𝑤0
Cost function: univariate example
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
Cost function: univariate example
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
Cost function optimization: univariate
𝑛
𝐽 𝒘 = ∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 2
𝑖=1
𝜕𝐽 𝒘
=0
𝜕𝑤0
𝜕𝐽 𝒘
=0
𝜕𝑤1
Optimality conditions: univariate
𝑛
𝐽 𝒘 = ∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 2
𝑖=1
𝜕𝐽 𝒘 𝑛
=∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1
𝜕𝐽 𝒘 𝑛
=∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1
Exercise!
} A systems of 2 linear equations
Cost function: multivariate for various features
} We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = ∑ 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
1=𝑖
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑
𝒘 = 𝑤0, 𝑤1, . . . , 𝑤𝑑 𝑇
𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1
Cost function and optimal linear model
1 𝑥(1)
⋯ 𝑥(1)
𝑑 𝑤0
1
𝑦(1) 𝑤1
1 𝑥1(2) ⋯ 𝑥(2)
𝒚= ⋮ 𝑿= 𝑑 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦(𝑛) 𝑤𝑑
(𝑛) (𝑛)
𝑥
1 1 ⋯ 𝑥𝑑
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
Minimizing cost function
Optimal linear weight vector (for SSE cost function):
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = 2𝑿𝑇 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇𝑿𝒘 = 𝑿𝑇𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚
𝒘 = 𝑿 †𝒚
𝑿† = 𝑿𝑇𝑿 −𝟏𝑿𝑇
𝑿† is pseudo inverse of 𝑿
Another approach for optimizing the sum squared error
} Iterative approach for solving the following optimization
problem:
𝑛 2
𝐽 𝒘 = : 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1
Gradient descent
} Cost function:𝐽(𝒘)
} Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘
} Steps:
} Start from 𝒘0
} Repeat
} Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
} 𝑡 ←𝑡+1
𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ 𝜕𝑤 , , … , ]
1 𝜕𝑤2 𝜕𝑤𝑑
J(w0,w1)
w1
w0
Problem of gradient descent with non-convex cost functions
J(w0,w1)
w1
w0
Gradient descent for SSE cost function
} Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )
𝒘𝑡+1 = 𝒘𝑡 +𝜂∑ 𝑦 𝑖
− 𝒘 𝑇𝒙 𝑖 𝒙(𝑖)
𝑖=1
Gradient descent for SSE cost function
} Weight update rule:𝑓 𝒙; 𝒘 = 𝒘𝑇𝒙
𝑛
𝒘𝑡+1 = 𝒘𝑡 + 𝜂∑ 𝑦 𝑖 −𝒘𝑇𝒙 𝑖 𝒙(𝑖)
1=𝑖
Batch mode:each step
considers all training data
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
} Example:Linear regression with SSE cost function
𝑖 2
𝐽(𝑖)(𝒘) = 𝑦 − 𝒘 𝑇𝒙 𝑖
𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘 𝑇𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)
𝑛
min ∑ 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 (𝑖) ; 𝜽) Empirical loss
𝜽 𝑖=1
1 𝑛
Empirical (training) loss = 𝑛 ∑ 𝑖=1 𝐿𝑜𝑠𝑠 𝑦(𝑖), 𝑓(𝒙(𝑖); 𝜽)
𝑛 = 10 𝑛 = 20
𝑛 = 50
Linear regression: generalization
} By increasing the number of training examples, will solution be better?
} Why the mean squared error does not decrease
more after reaching a level?
Linear regression: types of errors
} Structural error: the error introduced by the limited
function class (infinite training data):
𝑛
2
𝒘 = argmin ∑ 𝑦(𝑖) 𝑇
−𝒘 𝒙(𝑖)
𝒘
𝑖=1
2
Approximation error:𝐸 𝒙 𝒘 ∗ 𝑇𝒙 − 𝒘𝑇𝒙
−𝟏
} Solution:𝒘 = 𝑿′ 𝑿′
𝑇 𝑿′𝑇𝒚
1 1
1 𝑥 𝑥 1 2
⋯ 𝑥 1 𝑚
𝒘0
𝑦1
𝑥 2 1 2 2 𝑚 𝒘1
𝒚 = ⋮ 𝑿′ = 1 𝑥 2 𝑥
⋯ 𝒘=
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥 𝑛 1
𝑥 𝑛 2
⋯ 𝑥 𝑛 1 𝒘𝑚
Polynomial regression: example
𝑚=3
𝑚=1
𝑚=5 𝑚=7
Generalized linear
} Linear combination of fixed non-linear function of the
input vector
} Polynomial (univariate)
Generalized linear: optimization
𝑛 2
𝐽 𝒘 =∑ 𝑦 𝑖 − 𝑓 𝒙 ;𝒘 𝑖
𝑖=1
𝑛 2
=∑ 𝑦 𝑖 − 𝒘𝑇𝝓 𝒙 𝑖
𝑖=1
(1)
1 𝜙1 (𝒙(1)) ⋯ 𝜙𝑚 (𝒙 ) 𝑤0
𝑦(1) (2) 𝑤1
1 𝜙1 (𝒙(2)) ⋯ 𝜙𝑚 (𝒙 )
𝒚= ⋮ 𝚽= 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦(𝑛) 𝑤𝑚
(𝑛)
1 𝜙1 (𝒙(𝑛)) ⋯ 𝜙𝑚 (𝒙 )
−𝟏
𝒘 = 𝚽𝑇𝚽 𝚽𝑇𝒚
Resource
1 C. M. Bishop, Pattern Recognition and Machine Learning.
2 Y. S. Abu-Mostafa, “Machine learning.” California Institute of Technology, 2012.
3 Machine Learning, Dr. Soleymani, Sharif University