0% found this document useful (0 votes)
12 views95 pages

Linear Regression

Uploaded by

mertdene10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views95 pages

Linear Regression

Uploaded by

mertdene10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

CMPE 442 Introduction to

Machine Learning
Supervised Learning
Linear Regression
Machine Learning

Weight

Height
Machine Learning

 We can fit a black line to show the trend


Weight

 But we can also use a black line to make


predictions

Height
Machine Learning

Weight

If someone tells that his height is this much

Height
Machine Learning
This is the predicted weight
Weight

Then we can use a black line to predict this


person’s weight

Height
Machine Learning

 The black line is a type of Machine Learning,


because we can use it to make predictions
Weight

Height
Machine Learning

Machine Learning is all about predictions and


classifications
Introduction to Supervised Learning

 Suppose we have a dataset giving the living areas and prices of houses in
some specific city
Living area Prices
(feet2) (1000$s)
2104 400
1600 330
2400 369
1416 232
3000 540
. .
. .
. .
Introduction to Supervised Learning

 Suppose we have a dataset giving the living areas and prices of houses in
some specific city
Living area Prices
(feet2) (1000$s)
2104 400
1600 330
How can we learn to predict the 2400 369
prices of other houses, as a function
of the size of their living areas? 1416 232
3000 540
. .
. .
. .
Notations

(i)
 x - input variables, features, Ex: living area
 y (i ) - output or target variable that we are trying to predict. Ex: price
(i ) (i )
 ( x , y ) - training example
 Dataset that we will be using to learn- a list of m training examples
 {( x ( i ) , y ( i ) ) : i  1,...m} - training set
 X – space of input values, Y – space of output values, X=Y=R
 x - vector
 X -matrix
Supervised Learning

 Given a training set, the goal is to learn a function h : X  Y so that h(x) is a


good predictor for the corresponding value of y.
 h hypothesis
Training
set

Learning
algorithm

x h predicted y
(living area of house) (predicted price of house)
Supervised Learning

 Regression problem the target variable that we are trying to predict is


continuous
 Classification problem the target variable can take on only small number
of discrete values
Linear Regression

Living area #bedrooms Prices


(feet2) (1000$s)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
. . .
. . .
. . .
Linear Regression

x2(i)
Living area #bedrooms Prices
(feet2) (1000$s) x’s are two-dimensional vectors
x1 (i)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
. . .
. . .
. . .
Linear Regression
x2(i)
Living area #bedrooms Prices
(feet2) (1000$s) x’s are two-dimensional vectors
x1 (i)
2104 3 400
 2104  y (1)  400
1600 3 330 x (1)
  
2400 3 369  3 
1416 2 232
3000 4 540 1600 
x ( 2)
   y ( 2)  330
. . .  3 
. . .
. . .
Linear Regression

 How we are going to represent hypotheses h in a computer?


Linear Regression
Linear Regression

First, draw a line


through the data
Linear Regression

Second, measure the


distance from the line
to the data, square
each distance, and then
add them up

The distance from a line


to a data point is called
a “residual”
Linear Regression

Third, rotate a line


a little bit
Linear Regression

Measure the
distance from the line
to the data, square
each distance, and then
add them up
Linear Regression

Rotate a line
a little bit more
Linear Regression

Sum up the squared


residuals
Linear Regression
Linear Regression

This is our least squares


rotation superimposed
on the original data

slope

𝑦 = 0.1 + 0.78𝑥

y-axis intercept
Linear Regression

 A linear model is a linear function of the input x:


h ( x)   0  1 x1   2 x2
  i ' s are the parameters, parameterizing the space of linear functions mapping
from X to Y.
 x0  1 intercept term
n
h( x)    i xi  θT x
i 0

 θ is the model’s parameter vector, containing the bias term  0 and the feature
weights 1 to n
 n is the number of input variables (not counting x0).
Linear Regression

 Given a training set, how do we learn the parameters θ ?


n
h( x)    i xi  θT x
i 0
Linear Regression

 Given a training set, how do we learn the parameters θ ?


n
h( x)    i xi  θT x
i 0

 Make h(x) close to y , at least for the training examples


 Define a function that measures, for each value of  ' s , how close the h( x ( i ) )' s
are to the corresponding y (i ) ' s
 The most common performance measure of a regression model is the Root
Mean Square Error (RMSE) 1 m
RMSE ( X, h ) 
m
 (h ( x
i 1
(i )
)  y (i ) ) 2

 It is simpler to minimize Mean Square Error (MSE)


1 m
MSE ( X, h )   (h ( x (i ) )  y (i ) ) 2
m i 1
Linear Regression: Gradient Descent

 Gradient Descent is a generic optimization algorithm


 Finds optimal solution
 General idea: tweak parameters iteratively in order to minimize a cost
function
o Randomly initialize parameters, then gradually taking one baby step at a time
improve parameters by each step attempting to decrease the cost function
o Size of these steps is determined by learning rate hyperparameter
o Hyperparameter- a parameter whose value is set before learning process begins
Linear Regression: Gradient Descent

 The MSE cost function for a Linear Regression is a convex function


 No local minima, just one global minima
 A continuous function, slope never changes abruptly
 Guaranteed to approach global minima if you wait long enough and if the
learning rate is not too high
 Training a model means searching for a combination of model parameters
that minimizes cost function
 It is a search in the model’s parameter space
Linear Regression: Gradient Descent

 Choose  so as to minimize MSE ( )


 Consider a Gradient Descent algorithm:
 Start with some initial guess for 
 Repeatedly change  to make MSE ( ) smaller, until converges


 j :  j   MSE ( )
 j
 Perform this update simultaneously for all values of j=0, … n
  is the learning rate
Linear Regression: Gradient Descent
Linear Regression: Gradient Descent

If the learning rate is too small then many iterations are needed until convergence.
Linear Regression: Gradient Descent

If the learning rate is too high then we might jump across the valley.
This might make the algorithm diverge, failing to find a good solution.
Linear Regression: Gradient Descent

Some cost functions might have irregular shape.


If the random initialization start the algorithm on the left,
then it will converge to a local minimum.
Gradient Descent

  1 m
 j
MSE ( )  
 j m i 1
(h ( x ( i ) )  y ( i ) ) 2

1   
 2  ( h ( x )  y ) 
(1) (1)
(h ( x (1) )  y (1) )  ...  2  ( h ( x ( m ) )  y ( m ) )  ( h ( x ( m ) )  y ( m ) )
m   j  j 
1  n  n 
(  i xi  y )  ...  2  (h ( x )  y )  (  i xi  y ( m ) ) 
(1) (m)
 2  ( h ( x )  y ) 
(1) (1) (1) (m) (m)

m   j i 0  j i 0 
2 m
  ( h ( x (i ) )  y ( i ) )  x (ji )
m i 1
Gradient Descent

 Update Rule:
 𝜃 =𝜃 −η ∑ (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()

 The magnitude of the update is proportional to the error


Gradient Descent: Example

3.5

3 Height=intercept + slope *Weight


2.5

2
Height

ℎ 𝑥 =𝜃 +𝜃 𝑥
1.5

0.5 Training set:


0 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
0 0.5 1 1.5 2 2.5 3 3.5
Weight 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent: Example

3.5

2.5

2
Height

1.5 Height=intercept + slope *Weight


1

0.5 ℎ 𝑥 =𝜃 +𝜃 𝑥
0
0 0.5 1 1.5 2 2.5 3 3.5
Weight
Training set:
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent: Example

3.5

3
Height=intercept + slope *Weight
2.5

2
Height

ℎ 𝑥 =𝜃 +𝜃 𝑥
1.5

0.5
Training set:
0
0 0.5 1 1.5 2 2.5 3 3.5
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
Weight 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent

 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑆𝑙𝑜𝑝𝑒 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 Training set:

𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
 Lets keep Slope fixed for now =0.64 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
 Give some random value to Intercept =0 𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent
Training set:
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3

2.5

2
Height

1.5

0.5

0
1 2 3
Weight
Gradient Descent
Training set:
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3

2.5

2
Height

ℎ 𝑥 = 0.64 ∗ 0.5 = 0.32


1.5
ℎ 𝑥 = 0.64 ∗ 2.3 = 1.472
1
ℎ 𝑥 = 0.64 ∗ 2.9 = 1.856
0.5

0
1 2 3
Weight
Gradient Descent
Training set:
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3

2.5

2
Height

ℎ 𝑥 = 0.64 ∗ 0.5 = 0.3


1.5
ℎ 𝑥 = 0.64 ∗ 2.3 = 1.5
1
ℎ 𝑥 = 0.64 ∗ 2.9 = 1.9
0.5

0
1 2 3
Weight
Gradient Descent
Training set:
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3

2.5

2
Height

ℎ 𝑥 = 0.64 ∗ 0.5 = 0.3


1.5
ℎ 𝑥 = 0.64 ∗ 2.3 = 1.5
1
ℎ 𝑥 = 0.64 ∗ 2.9 = 1.9
0.5

0
1 2 3
Weight

𝑀𝑆𝐸 𝑋, 𝜃 = 1/3( 0.3 − 1.4 + 1.5 − 1.9 + 1.9 − 3.2 ) = 1.03


Gradient Descent

1.2

3.5
1
3
0.8
2.5

MSE
2 0.6
Height

1.5
0.4
1
0.2
0.5

0 0
1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Weight Intercept
Gradient Descent

𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.25
3.5 1.2

3 1
2.5
0.8
2
Height

MSE
0.6
1.5
0.4
1

0.5 0.2

0 0
1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Weight Intercept
Gradient Descent

As we increase intercept gradually MSE decreases

3.5 1.4

3 1.2

2.5 1

2 0.8
Height

MSE
1.5 0.6

1 0.4

0.5 0.2

0 0
1 2 3 0 0.5 1 1.5 2 2.5
Weight Intercept
Gradient Descent

As we increase intercept gradually MSE decreases

3.5 1.4

3 1.2

2.5 1

2 0.8
Height

MSE
1.5 0.6

1 0.4

0.5 0.2

0 0
1 2 3 0 0.5 1 1.5 2 2.5
Weight Intercept

Pick the line with lowest MSE


Gradient Descent

MSE= ∑ (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )


3.5
1
3 = (( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
2.5
3
2 + ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
Height

1.5 + ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )


1

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5
Weight

𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑠𝑙𝑜𝑝𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡


Gradient Descent

1
𝑀𝑆𝐸 = (( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3
1.4
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
1.2
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
1

0.8

MSE
0.6

0.4

0.2
Gives us an equation for this curve 0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent

We can take the derivative of this function and determine


the slope at any value for the intercept
1.4

1.2

0.8

MSE
0.6

0.4

0.2

0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕
= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1


= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1


= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

For intercept=0 = −1.9

The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 𝞰 ∗ 𝑀𝑆𝐸

 𝞰 = 0.1 1.4

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0 − 0.1 ∗ −1.9 = 0.19 1.2

0.8

MSE
0.6

0.4

0.2

0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1


= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

For intercept=0.19 = −1.5

The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 𝞰 ∗ 𝑀𝑆𝐸

 𝞰 = 0.1
 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.19 − 0.1 ∗ −1.5 = 0.19 + 0.15 = 0.34
1.4

1.2

0.8

MSE
0.6

0.4

0.2

0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1


= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

For intercept=0.8 = −0.3

The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 𝞰 ∗ 𝑀𝑆𝐸

 𝞰 = 0.1
 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.8 − 0.1 ∗ −0.3 = 0.8 + 0.03 = 0.83
1.4

1.2

0.8

MSE
0.6

0.4

0.2

0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Linear Regression: Batch Gradient
Descent
Repeat until convergence{
()
𝜃 ≔𝜃 -η ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
Linear Regression: Batch Gradient
Descent
Repeat until convergence{
()
𝜃 ≔𝜃 -η ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
 Uses the whole training set to compute the gradients at every step
 Has to scan through the entire training set before taking a single step
 Costly if m is large
Linear Regression: Stochastic Gradient
Descent
Loop{
for i=1 to m{
()
𝜃 ≔ 𝜃 -η*2*(ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
}
Linear Regression: Stochastic Gradient
Descent
Loop{
for i=1 to m{
𝜃 ≔ 𝜃 -η*2*(ℎ 𝑥 − 𝑦 ( ) )𝑥
() (for every j)
}
}
 Computes the gradients based on a single training set
 Makes progress with each example it looks at
 Much faster
 May not converge to the minimum but found minimum is generally a good
approximation of a true minimum
 Preferred when the training set is large
Linear Regression in Python
Linear Regression: Normal Equation

 Mathematical equation that gives the result directly closed-form solution


 Minimization is performed explicitly without resorting to iterative algorithm
 An one-step learning algorithm as opposed to Gradient Descent
Linear Regression: Normal Equation

 Suppose we have
 m training examples (𝒙 , 𝑦 ( ) )
()
 n features, 𝒙( ) = [𝑥 , 𝑥 , … , 𝑥 ] ∈ ℝ
 Define design matrix X to be m-by-n matrix (m-by-n+1) if we include intercept term) that
contain the training examples’ input values in its rows
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
 𝑿= ⋮ = ⋱
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥

𝑦
 𝒚= ⋮ ∈ℝ
𝑦
Linear Regression: Normal Equation

( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
 𝑿= ⋮ = ⋱
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥

𝑦
 𝒚= ⋮ ∈ℝ
𝑦
 We express our problem in matrix form 𝑿𝜽 = 𝒚
()
 Note that there is also an intercept term 𝑥 = 1, the slope
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
()
 So, 𝒙( ) = [𝑥 , 𝑥 , … , 𝑥 ] ∈ ℝ and 𝑿 = ⋮ = ⋱ ∈ℝ ×
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥
Linear Regression: Normal Equation

 Let θ be the best fit solution to 𝑿𝜽 ≈ 𝒚


 Try to minimize error 𝒆 = 𝒚 − 𝑿𝜽 (also called residuals)
 We take the square of this error, so the objective is
𝐽 𝜽 = 𝒆 = 𝒚 − 𝑿𝜽
 So our problem is

𝜽 = arg 𝑚𝑖𝑛𝜽 𝐽 𝜽 = arg 𝑚𝑖𝑛𝜽 𝒚 − 𝑿𝜽
Linear Regression: Normal Equation

 Let’s expand 𝐽 𝜽 :
𝐽 𝜽 = 𝒚 − 𝑿𝜽 = 𝒚 − 𝑿𝜽 𝒚 − 𝑿𝜽 = 𝒚 𝒚 − 𝑿𝜽 𝒚 − 𝒚 𝑿𝜽 + 𝑿𝜽 𝑿𝜽 = ⋯
… = 𝒚 𝒚 − 𝟐𝜽𝑻 𝑿𝑻 𝒚 + 𝜽𝑻 𝑿𝑻 𝑿𝜽
Minimize 𝐽 𝜽 w.r.t. 𝜽 :
𝜕𝐽(𝜽)
= −2𝑿 𝒚 + 2𝑿 𝑿𝜽 = 0
𝜕𝜽
𝑿 𝑿𝜽=𝑿 𝒚

𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒚
Linear Regression: Normal Equation

 Let’s expand 𝐽 𝜽 :
𝐽 𝜽 = 𝒚 − 𝑿𝜽 = 𝒚 − 𝑿𝜽 𝒚 − 𝑿𝜽 = 𝒚 𝒚 − 𝑿𝜽 𝒚 − 𝒚 𝑿𝜽 + 𝑿𝜽 𝑿𝜽 = ⋯
… = 𝒚 𝒚 − 𝟐𝜽𝑻 𝑿𝑻 𝒚 + 𝜽𝑻 𝑿𝑻 𝑿𝜽
Minimize 𝐽 𝜽 w.r.t. 𝜽 :
𝜕𝐽(𝜽)
= −2𝑿 𝒚 + 2𝑿 𝑿𝜽 = 0
𝜕𝜽
𝑿 𝑿𝜽=𝑿 𝒚

𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒚

 Gets very slow when the number of features grows large


 Equations is linear with regards to the number of instances in the training set
 Handles large training sets efficiently
Polynomial Regression

 What if your data is actually more complex than a simple straight line?
 We can use a linear model to fit nonlinear data
 A model is said to be linear if it is linear in parameters
y   0  1 x1   2 x2  

y   0  1 x   2 x 2  

y   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  

 Add powers of each feature as new features, then train a linear model on
this extended set of features
Polynomial Regression

 A kth order polynomial model in one variable is given by


y   0  1 x   2 x 2  ...   k x k  

 A second order polynomial model in two variable is given by

y   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  

 An array containing n features can be transformed into an array containing


( n  d )!
features, where d is polynomial order
d !n!
Polynomial Regression: Example

 Let’s generate some nonlinear data based on a simple quadratic equation


 x’s are assigned randomly
y  2  x  0.5 x 2  
Polynomial Regression: Example

A straight line poorly fits nonlinear data


Case for underfitting

y  3.52  0.98 x
Polynomial Regression: Example

Add powers of each feature as new features

x (i )  [ x (i ) , ( x (i ) ) 2 ]

y  2  x  0.5 x 2  

h(x)  2.04  0.96 x  0.51x 2


Polynomial Regression: Example
Polynomial Regression: Example
Evaluating the Performance of the
Model
 RMSE - is the square root of the average of the sum of the squares of residuals

1 m
RMSE ( X, h )  
m i 1
(h ( x (i ) )  y (i ) ) 2

 R2- score or the coefficient of determination explains how much the total variance of
the dependent variable can be reduced by using the least square regression.

𝑅 =1− where, 𝑆𝑆 = ∑ (𝑦 − 𝑦) and 𝑆𝑆 = ∑ (ℎ 𝑥 − 𝑦( ))


Locally Weighted Linear Regression

 In the original linear regression algorithm, to make a prediction at a query


point x, we would:
1. Fit θ to minimize (y i
(i )
 θT x ( i ) ) 2
T (i )
2. Output θ x

 In locally weighted linear regression algorithm


1. Fit θ to minimize w
i
(i )
( y ( i )  θT x ( i ) ) 2
2. Output θ x T (i )
Gradient Descent

 Update Rule:
 𝜃 =𝜃 −η ∑ 𝑤 ( ) (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()

 The magnitude of the update is proportional to the error and the weight of
a sample
Locally Weighted Linear Regression

 w(i)’s are non-negative valued weights.


 Weights depend on the particular point x at which we are trying to
evaluate x.

w  exp(
(i ) ( x ( i )  x) 2
)  bandwidth parameter
2 2

 If x  x is small, then w(i ) is close to 1


(i )

 If x  x is large, then w(i ) is small


(i )

 θ is chosen giving a much higher weight to the (errors on) training examples
close to the query point x.
Linear Regression

y  2  x  0.5 x 2  
Linear Regression

y  2  x  0.5 x 2   h(x)  2.04  0.96 x  0.51x 2


Linear Regression

y  2  x  0.5 x 2   h(x)  2.04  0.96 x  0.51x 2

𝑥 = 2.5 ℎ 2.5 = 2.04 + 0.96 ∗ 2.5 + 0.51 ∗ 2.5 = 7.63


Locally Weighted Linear Regression

y  2  x  0.5 x 2  
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

This line is learnt specifically


y  2  x  0.5 x 2   for x_new

𝑥 = 2.5
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

 Linear regression algorithm is a parametric learning algorithm


 It has fixed, finite number of parameters which are fit to the data
 Once we fit θ’s and store them away, we don’t need to keep training data for
future predictions
 Locally weighted linear regression is a non-parametric algorithm
 To make predictions we need to keep entire training set around
 Amount of stuff we need to keep in order to represent the hypothesis h grows
linearly with the size of the training set
Question

 Is the gradient descent algorithm affected by the scales of features?


Question

 Is the gradient descent algorithm affected by the scales of features?


Feature Normalization

 We can speed up the gradient descent by normalizing features (having


them in approximately equal ranges),
 𝜃 will descent quickly on small ranges
1) Feature scaling: 𝑥 =
( )

2) Mean normalization: 𝑥 = 𝑜𝑟 𝑥 =
( )

 For polynomial regression feature normalization is even more important.


 Not needed for Normal Equation.
Summary on Linear Regression

 Considered the case when the hypothesis function is linear.


 We can make our hypothesis nonlinear by combining existing features and
obtaining more features and make the function quadratic, cubic.
 Closed-form solution: Normal Equations
 Scaling the data for gradient descent algorithm is important.

You might also like