Linear Regression
Linear Regression
Machine Learning
Supervised Learning
Linear Regression
Machine Learning
Weight
Height
Machine Learning
Height
Machine Learning
Weight
Height
Machine Learning
This is the predicted weight
Weight
Height
Machine Learning
Height
Machine Learning
Suppose we have a dataset giving the living areas and prices of houses in
some specific city
Living area Prices
(feet2) (1000$s)
2104 400
1600 330
2400 369
1416 232
3000 540
. .
. .
. .
Introduction to Supervised Learning
Suppose we have a dataset giving the living areas and prices of houses in
some specific city
Living area Prices
(feet2) (1000$s)
2104 400
1600 330
How can we learn to predict the 2400 369
prices of other houses, as a function
of the size of their living areas? 1416 232
3000 540
. .
. .
. .
Notations
(i)
x - input variables, features, Ex: living area
y (i ) - output or target variable that we are trying to predict. Ex: price
(i ) (i )
( x , y ) - training example
Dataset that we will be using to learn- a list of m training examples
{( x ( i ) , y ( i ) ) : i 1,...m} - training set
X – space of input values, Y – space of output values, X=Y=R
x - vector
X -matrix
Supervised Learning
Learning
algorithm
x h predicted y
(living area of house) (predicted price of house)
Supervised Learning
x2(i)
Living area #bedrooms Prices
(feet2) (1000$s) x’s are two-dimensional vectors
x1 (i)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
. . .
. . .
. . .
Linear Regression
x2(i)
Living area #bedrooms Prices
(feet2) (1000$s) x’s are two-dimensional vectors
x1 (i)
2104 3 400
2104 y (1) 400
1600 3 330 x (1)
2400 3 369 3
1416 2 232
3000 4 540 1600
x ( 2)
y ( 2) 330
. . . 3
. . .
. . .
Linear Regression
Measure the
distance from the line
to the data, square
each distance, and then
add them up
Linear Regression
Rotate a line
a little bit more
Linear Regression
slope
𝑦 = 0.1 + 0.78𝑥
y-axis intercept
Linear Regression
θ is the model’s parameter vector, containing the bias term 0 and the feature
weights 1 to n
n is the number of input variables (not counting x0).
Linear Regression
j : j MSE ( )
j
Perform this update simultaneously for all values of j=0, … n
is the learning rate
Linear Regression: Gradient Descent
Linear Regression: Gradient Descent
If the learning rate is too small then many iterations are needed until convergence.
Linear Regression: Gradient Descent
If the learning rate is too high then we might jump across the valley.
This might make the algorithm diverge, failing to find a good solution.
Linear Regression: Gradient Descent
1 m
j
MSE ( )
j m i 1
(h ( x ( i ) ) y ( i ) ) 2
1
2 ( h ( x ) y )
(1) (1)
(h ( x (1) ) y (1) ) ... 2 ( h ( x ( m ) ) y ( m ) ) ( h ( x ( m ) ) y ( m ) )
m j j
1 n n
( i xi y ) ... 2 (h ( x ) y ) ( i xi y ( m ) )
(1) (m)
2 ( h ( x ) y )
(1) (1) (1) (m) (m)
m j i 0 j i 0
2 m
( h ( x (i ) ) y ( i ) ) x (ji )
m i 1
Gradient Descent
Update Rule:
𝜃 =𝜃 −η ∑ (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()
3.5
2
Height
ℎ 𝑥 =𝜃 +𝜃 𝑥
1.5
3.5
2.5
2
Height
0.5 ℎ 𝑥 =𝜃 +𝜃 𝑥
0
0 0.5 1 1.5 2 2.5 3 3.5
Weight
Training set:
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent: Example
3.5
3
Height=intercept + slope *Weight
2.5
2
Height
ℎ 𝑥 =𝜃 +𝜃 𝑥
1.5
0.5
Training set:
0
0 0.5 1 1.5 2 2.5 3 3.5
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
Weight 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
Lets keep Slope fixed for now =0.64 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
Give some random value to Intercept =0 𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent
Training set:
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3
2.5
2
Height
1.5
0.5
0
1 2 3
Weight
Gradient Descent
Training set:
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3
2.5
2
Height
0
1 2 3
Weight
Gradient Descent
Training set:
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3
2.5
2
Height
0
1 2 3
Weight
Gradient Descent
Training set:
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3
2.5
2
Height
0
1 2 3
Weight
1.2
3.5
1
3
0.8
2.5
MSE
2 0.6
Height
1.5
0.4
1
0.2
0.5
0 0
1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Weight Intercept
Gradient Descent
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.25
3.5 1.2
3 1
2.5
0.8
2
Height
MSE
0.6
1.5
0.4
1
0.5 0.2
0 0
1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Weight Intercept
Gradient Descent
3.5 1.4
3 1.2
2.5 1
2 0.8
Height
MSE
1.5 0.6
1 0.4
0.5 0.2
0 0
1 2 3 0 0.5 1 1.5 2 2.5
Weight Intercept
Gradient Descent
3.5 1.4
3 1.2
2.5 1
2 0.8
Height
MSE
1.5 0.6
1 0.4
0.5 0.2
0 0
1 2 3 0 0.5 1 1.5 2 2.5
Weight Intercept
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5
Weight
1
𝑀𝑆𝐸 = (( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3
1.4
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
1.2
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
1
0.8
MSE
0.6
0.4
0.2
Gives us an equation for this curve 0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
1.2
0.8
MSE
0.6
0.4
0.2
0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3
1 𝜕
= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3
The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent
𝞰 = 0.1 1.4
0.8
MSE
0.6
0.4
0.2
0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3
The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent
𝞰 = 0.1
𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.19 − 0.1 ∗ −1.5 = 0.19 + 0.15 = 0.34
1.4
1.2
0.8
MSE
0.6
0.4
0.2
0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3
The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent
𝞰 = 0.1
𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.8 − 0.1 ∗ −0.3 = 0.8 + 0.03 = 0.83
1.4
1.2
0.8
MSE
0.6
0.4
0.2
0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Linear Regression: Batch Gradient
Descent
Repeat until convergence{
()
𝜃 ≔𝜃 -η ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
Linear Regression: Batch Gradient
Descent
Repeat until convergence{
()
𝜃 ≔𝜃 -η ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
Uses the whole training set to compute the gradients at every step
Has to scan through the entire training set before taking a single step
Costly if m is large
Linear Regression: Stochastic Gradient
Descent
Loop{
for i=1 to m{
()
𝜃 ≔ 𝜃 -η*2*(ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
}
Linear Regression: Stochastic Gradient
Descent
Loop{
for i=1 to m{
𝜃 ≔ 𝜃 -η*2*(ℎ 𝑥 − 𝑦 ( ) )𝑥
() (for every j)
}
}
Computes the gradients based on a single training set
Makes progress with each example it looks at
Much faster
May not converge to the minimum but found minimum is generally a good
approximation of a true minimum
Preferred when the training set is large
Linear Regression in Python
Linear Regression: Normal Equation
Suppose we have
m training examples (𝒙 , 𝑦 ( ) )
()
n features, 𝒙( ) = [𝑥 , 𝑥 , … , 𝑥 ] ∈ ℝ
Define design matrix X to be m-by-n matrix (m-by-n+1) if we include intercept term) that
contain the training examples’ input values in its rows
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
𝑿= ⋮ = ⋱
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥
𝑦
𝒚= ⋮ ∈ℝ
𝑦
Linear Regression: Normal Equation
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
𝑿= ⋮ = ⋱
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥
𝑦
𝒚= ⋮ ∈ℝ
𝑦
We express our problem in matrix form 𝑿𝜽 = 𝒚
()
Note that there is also an intercept term 𝑥 = 1, the slope
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
()
So, 𝒙( ) = [𝑥 , 𝑥 , … , 𝑥 ] ∈ ℝ and 𝑿 = ⋮ = ⋱ ∈ℝ ×
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥
Linear Regression: Normal Equation
Let’s expand 𝐽 𝜽 :
𝐽 𝜽 = 𝒚 − 𝑿𝜽 = 𝒚 − 𝑿𝜽 𝒚 − 𝑿𝜽 = 𝒚 𝒚 − 𝑿𝜽 𝒚 − 𝒚 𝑿𝜽 + 𝑿𝜽 𝑿𝜽 = ⋯
… = 𝒚 𝒚 − 𝟐𝜽𝑻 𝑿𝑻 𝒚 + 𝜽𝑻 𝑿𝑻 𝑿𝜽
Minimize 𝐽 𝜽 w.r.t. 𝜽 :
𝜕𝐽(𝜽)
= −2𝑿 𝒚 + 2𝑿 𝑿𝜽 = 0
𝜕𝜽
𝑿 𝑿𝜽=𝑿 𝒚
𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒚
Linear Regression: Normal Equation
Let’s expand 𝐽 𝜽 :
𝐽 𝜽 = 𝒚 − 𝑿𝜽 = 𝒚 − 𝑿𝜽 𝒚 − 𝑿𝜽 = 𝒚 𝒚 − 𝑿𝜽 𝒚 − 𝒚 𝑿𝜽 + 𝑿𝜽 𝑿𝜽 = ⋯
… = 𝒚 𝒚 − 𝟐𝜽𝑻 𝑿𝑻 𝒚 + 𝜽𝑻 𝑿𝑻 𝑿𝜽
Minimize 𝐽 𝜽 w.r.t. 𝜽 :
𝜕𝐽(𝜽)
= −2𝑿 𝒚 + 2𝑿 𝑿𝜽 = 0
𝜕𝜽
𝑿 𝑿𝜽=𝑿 𝒚
𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒚
What if your data is actually more complex than a simple straight line?
We can use a linear model to fit nonlinear data
A model is said to be linear if it is linear in parameters
y 0 1 x1 2 x2
y 0 1 x 2 x 2
Add powers of each feature as new features, then train a linear model on
this extended set of features
Polynomial Regression
y 3.52 0.98 x
Polynomial Regression: Example
x (i ) [ x (i ) , ( x (i ) ) 2 ]
y 2 x 0.5 x 2
1 m
RMSE ( X, h )
m i 1
(h ( x (i ) ) y (i ) ) 2
R2- score or the coefficient of determination explains how much the total variance of
the dependent variable can be reduced by using the least square regression.
Update Rule:
𝜃 =𝜃 −η ∑ 𝑤 ( ) (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()
The magnitude of the update is proportional to the error and the weight of
a sample
Locally Weighted Linear Regression
w exp(
(i ) ( x ( i ) x) 2
) bandwidth parameter
2 2
θ is chosen giving a much higher weight to the (errors on) training examples
close to the query point x.
Linear Regression
y 2 x 0.5 x 2
Linear Regression
y 2 x 0.5 x 2
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
𝑥 = 2.5
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
2) Mean normalization: 𝑥 = 𝑜𝑟 𝑥 =
( )