0% found this document useful (0 votes)

12 views95 pages

Linear Regression

Uploaded by

mertdene10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views95 pages

Linear Regression

Uploaded by

mertdene10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

CMPE 442 Introduction to

Machine Learning
Supervised Learning
Linear Regression
Machine Learning

Weight

Height
Machine Learning

 We can fit a black line to show the trend

Weight

 But we can also use a black line to make

predictions

Height
Machine Learning

Weight

If someone tells that his height is this much

Height
Machine Learning
This is the predicted weight
Weight

Then we can use a black line to predict this

person’s weight

Height
Machine Learning

 The black line is a type of Machine Learning,

because we can use it to make predictions
Weight

Height
Machine Learning

Machine Learning is all about predictions and

classifications
Introduction to Supervised Learning

 Suppose we have a dataset giving the living areas and prices of houses in
some specific city
Living area Prices
(feet2) (1000$s)
2104 400
1600 330
2400 369
1416 232
3000 540
. .
. .
. .
Introduction to Supervised Learning

 Suppose we have a dataset giving the living areas and prices of houses in
some specific city
Living area Prices
(feet2) (1000$s)
2104 400
1600 330
How can we learn to predict the 2400 369
prices of other houses, as a function
of the size of their living areas? 1416 232
3000 540
. .
. .
. .
Notations

(i)
 x - input variables, features, Ex: living area
 y (i ) - output or target variable that we are trying to predict. Ex: price
(i ) (i )
 ( x , y ) - training example
 Dataset that we will be using to learn- a list of m training examples
 {( x ( i ) , y ( i ) ) : i  1,...m} - training set
 X – space of input values, Y – space of output values, X=Y=R
 x - vector
 X -matrix
Supervised Learning

 Given a training set, the goal is to learn a function h : X  Y so that h(x) is a

good predictor for the corresponding value of y.
 h hypothesis
Training
set

Learning
algorithm

x h predicted y
(living area of house) (predicted price of house)
Supervised Learning

 Regression problem the target variable that we are trying to predict is

continuous
 Classification problem the target variable can take on only small number
of discrete values
Linear Regression

Living area #bedrooms Prices

(feet2) (1000$s)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
. . .
. . .
. . .
Linear Regression

x2(i)
Living area #bedrooms Prices
(feet2) (1000$s) x’s are two-dimensional vectors
x1 (i)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
. . .
. . .
. . .
Linear Regression
x2(i)
Living area #bedrooms Prices
(feet2) (1000$s) x’s are two-dimensional vectors
x1 (i)
2104 3 400
 2104  y (1)  400
1600 3 330 x (1)
  
2400 3 369  3 
1416 2 232
3000 4 540 1600 
x ( 2)
   y ( 2)  330
. . .  3 
. . .
. . .
Linear Regression

 How we are going to represent hypotheses h in a computer?

Linear Regression
Linear Regression

First, draw a line

through the data
Linear Regression

Second, measure the

distance from the line
to the data, square
each distance, and then
add them up

The distance from a line

to a data point is called
a “residual”
Linear Regression

Third, rotate a line

a little bit
Linear Regression

Measure the
distance from the line
to the data, square
each distance, and then
add them up
Linear Regression

Rotate a line
a little bit more
Linear Regression

Sum up the squared

residuals
Linear Regression
Linear Regression

This is our least squares

rotation superimposed
on the original data

slope

𝑦 = 0.1 + 0.78𝑥

y-axis intercept
Linear Regression

 A linear model is a linear function of the input x:

h ( x)   0  1 x1   2 x2
  i ' s are the parameters, parameterizing the space of linear functions mapping
from X to Y.
 x0  1 intercept term
n
h( x)    i xi  θT x
i 0

 θ is the model’s parameter vector, containing the bias term  0 and the feature
weights 1 to n
 n is the number of input variables (not counting x0).
Linear Regression

 Given a training set, how do we learn the parameters θ ?

n
h( x)    i xi  θT x
i 0
Linear Regression

 Given a training set, how do we learn the parameters θ ?

n
h( x)    i xi  θT x
i 0

 Make h(x) close to y , at least for the training examples

 Define a function that measures, for each value of  ' s , how close the h( x ( i ) )' s
are to the corresponding y (i ) ' s
 The most common performance measure of a regression model is the Root
Mean Square Error (RMSE) 1 m
RMSE ( X, h ) 
m
 (h ( x
i 1
(i )
)  y (i ) ) 2

 It is simpler to minimize Mean Square Error (MSE)

1 m
MSE ( X, h )   (h ( x (i ) )  y (i ) ) 2
m i 1
Linear Regression: Gradient Descent

 Gradient Descent is a generic optimization algorithm

 Finds optimal solution
 General idea: tweak parameters iteratively in order to minimize a cost
function
o Randomly initialize parameters, then gradually taking one baby step at a time
improve parameters by each step attempting to decrease the cost function
o Size of these steps is determined by learning rate hyperparameter
o Hyperparameter- a parameter whose value is set before learning process begins
Linear Regression: Gradient Descent

 The MSE cost function for a Linear Regression is a convex function

 No local minima, just one global minima
 A continuous function, slope never changes abruptly
 Guaranteed to approach global minima if you wait long enough and if the
learning rate is not too high
 Training a model means searching for a combination of model parameters
that minimizes cost function
 It is a search in the model’s parameter space
Linear Regression: Gradient Descent

 Choose  so as to minimize MSE ( )

 Consider a Gradient Descent algorithm:
 Start with some initial guess for 
 Repeatedly change  to make MSE ( ) smaller, until converges


 j :  j   MSE ( )
 j
 Perform this update simultaneously for all values of j=0, … n
  is the learning rate
Linear Regression: Gradient Descent
Linear Regression: Gradient Descent

If the learning rate is too small then many iterations are needed until convergence.
Linear Regression: Gradient Descent

If the learning rate is too high then we might jump across the valley.
This might make the algorithm diverge, failing to find a good solution.
Linear Regression: Gradient Descent

Some cost functions might have irregular shape.

If the random initialization start the algorithm on the left,
then it will converge to a local minimum.
Gradient Descent

  1 m
 j
MSE ( )  
 j m i 1
(h ( x ( i ) )  y ( i ) ) 2

1   
 2  ( h ( x )  y ) 
(1) (1)
(h ( x (1) )  y (1) )  ...  2  ( h ( x ( m ) )  y ( m ) )  ( h ( x ( m ) )  y ( m ) )
m   j  j 
1  n  n 
(  i xi  y )  ...  2  (h ( x )  y )  (  i xi  y ( m ) ) 
(1) (m)
 2  ( h ( x )  y ) 
(1) (1) (1) (m) (m)

m   j i 0  j i 0 
2 m
  ( h ( x (i ) )  y ( i ) )  x (ji )
m i 1
Gradient Descent

 Update Rule:
 𝜃 =𝜃 −η ∑ (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()

 The magnitude of the update is proportional to the error

Gradient Descent: Example

3.5

3 Height=intercept + slope *Weight

2.5

2
Height

ℎ 𝑥 =𝜃 +𝜃 𝑥
1.5

0.5 Training set:

0 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
0 0.5 1 1.5 2 2.5 3 3.5
Weight 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent: Example

3.5

2.5

2
Height

1.5 Height=intercept + slope *Weight

0.5 ℎ 𝑥 =𝜃 +𝜃 𝑥
0
0 0.5 1 1.5 2 2.5 3 3.5
Weight
Training set:
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent: Example

3.5

3
Height=intercept + slope *Weight
2.5

2
Height

ℎ 𝑥 =𝜃 +𝜃 𝑥
1.5

0.5
Training set:
0
0 0.5 1 1.5 2 2.5 3 3.5
𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
Weight 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent

 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑆𝑙𝑜𝑝𝑒 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 Training set:

𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
 Lets keep Slope fixed for now =0.64 𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
 Give some random value to Intercept =0 𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
Gradient Descent
Training set:
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3

2.5

2
Height

1.5

0.5

0
1 2 3
Weight
Gradient Descent
Training set:
 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑒𝑖𝑔ℎ𝑡 = 0 + 0.64 ∗ 𝑊𝑒𝑖𝑔ℎ𝑡 𝑥 ( ) = 0.5 𝑦 ( ) = 1.4
𝑥 ( ) = 2.3 𝑦 ( ) = 1.9
3.5
𝑥 ( ) = 2.9 𝑦 ( ) = 3.2
3

2.5

2
Height

ℎ 𝑥 = 0.64 ∗ 0.5 = 0.32

1.5
ℎ 𝑥 = 0.64 ∗ 2.3 = 1.472
1
ℎ 𝑥 = 0.64 ∗ 2.9 = 1.856
0.5

2.5

2
Height

ℎ 𝑥 = 0.64 ∗ 0.5 = 0.3

1.5
ℎ 𝑥 = 0.64 ∗ 2.3 = 1.5
1
ℎ 𝑥 = 0.64 ∗ 2.9 = 1.9
0.5

2.5

2
Height

ℎ 𝑥 = 0.64 ∗ 0.5 = 0.3

1.5
ℎ 𝑥 = 0.64 ∗ 2.3 = 1.5
1
ℎ 𝑥 = 0.64 ∗ 2.9 = 1.9
0.5

0
1 2 3
Weight

𝑀𝑆𝐸 𝑋, 𝜃 = 1/3( 0.3 − 1.4 + 1.5 − 1.9 + 1.9 − 3.2 ) = 1.03

Gradient Descent

1.2

3.5
1
3
0.8
2.5

MSE
2 0.6
Height

1.5
0.4
1
0.2
0.5

0 0
1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Weight Intercept
Gradient Descent

𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.25
3.5 1.2

3 1
2.5
0.8
2
Height

MSE
0.6
1.5
0.4
1

0.5 0.2

0 0
1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Weight Intercept
Gradient Descent

As we increase intercept gradually MSE decreases

3.5 1.4

3 1.2

2.5 1

2 0.8
Height

MSE
1.5 0.6

1 0.4

0.5 0.2

0 0
1 2 3 0 0.5 1 1.5 2 2.5
Weight Intercept
Gradient Descent

As we increase intercept gradually MSE decreases

3.5 1.4

3 1.2

2.5 1

2 0.8
Height

MSE
1.5 0.6

1 0.4

0.5 0.2

0 0
1 2 3 0 0.5 1 1.5 2 2.5
Weight Intercept

Pick the line with lowest MSE

Gradient Descent

MSE= ∑ (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )

3.5
1
3 = (( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
2.5
3
2 + ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
Height

1.5 + ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5
Weight

𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑠𝑙𝑜𝑝𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡

Gradient Descent

1
𝑀𝑆𝐸 = (( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3
1.4
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
1.2
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
1

0.8

MSE
0.6

0.4

0.2
Gives us an equation for this curve 0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent

We can take the derivative of this function and determine

the slope at any value for the intercept
1.4

1.2

0.8

MSE
0.6

0.4

0.2

0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕
= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1

= ( ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4)
3 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.3 − 1.9)
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
=2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2 ∗ 1
𝜕
+ ( 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 2.9 − 3.2) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Gradient Descent
𝜕 𝜕 1
𝑀𝑆𝐸 = (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑎𝑐𝑡𝑢𝑎𝑙( ) )
𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝜕𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 3

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1

For intercept=0 = −1.9

The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 𝞰 ∗ 𝑀𝑆𝐸

 𝞰 = 0.1 1.4

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0 − 0.1 ∗ −1.9 = 0.19 1.2

0.8

MSE
0.6

0.4

0.2

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1

For intercept=0.19 = −1.5

The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 𝞰 ∗ 𝑀𝑆𝐸

 𝞰 = 0.1
 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.19 − 0.1 ∗ −1.5 = 0.19 + 0.15 = 0.34
1.4

1.2

0.8

MSE
0.6

0.4

0.2

1 𝜕 =2∗ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 0.64 ∗ 0.5 − 1.4 ∗ 1

For intercept=0.8 = −0.3

The closer we get to optimal intercept the closer the slope gets to zero
Gradient Descent

 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 − 𝞰 ∗ 𝑀𝑆𝐸

 𝞰 = 0.1
 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = 0.8 − 0.1 ∗ −0.3 = 0.8 + 0.03 = 0.83
1.4

1.2

0.8

MSE
0.6

0.4

0.2

0
0 0.25 0.5 0.75 0.85 1 1.25 1.5 1.75 2
Intercept
Linear Regression: Batch Gradient
Descent
Repeat until convergence{
()
𝜃 ≔𝜃 -η ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
Linear Regression: Batch Gradient
Descent
Repeat until convergence{
()
𝜃 ≔𝜃 -η ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
 Uses the whole training set to compute the gradients at every step
 Has to scan through the entire training set before taking a single step
 Costly if m is large
Linear Regression: Stochastic Gradient
Descent
Loop{
for i=1 to m{
()
𝜃 ≔ 𝜃 -η*2*(ℎ 𝑥 − 𝑦 ( ) )𝑥 (for every j)
}
}
Linear Regression: Stochastic Gradient
Descent
Loop{
for i=1 to m{
𝜃 ≔ 𝜃 -η*2*(ℎ 𝑥 − 𝑦 ( ) )𝑥
() (for every j)
}
}
 Computes the gradients based on a single training set
 Makes progress with each example it looks at
 Much faster
 May not converge to the minimum but found minimum is generally a good
approximation of a true minimum
 Preferred when the training set is large
Linear Regression in Python
Linear Regression: Normal Equation

 Mathematical equation that gives the result directly closed-form solution

 Minimization is performed explicitly without resorting to iterative algorithm
 An one-step learning algorithm as opposed to Gradient Descent
Linear Regression: Normal Equation

 Suppose we have
 m training examples (𝒙 , 𝑦 ( ) )
()
 n features, 𝒙( ) = [𝑥 , 𝑥 , … , 𝑥 ] ∈ ℝ
 Define design matrix X to be m-by-n matrix (m-by-n+1) if we include intercept term) that
contain the training examples’ input values in its rows
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
 𝑿= ⋮ = ⋱
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥

𝑦
 𝒚= ⋮ ∈ℝ
𝑦
Linear Regression: Normal Equation

( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
 𝑿= ⋮ = ⋱
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥

𝑦
 𝒚= ⋮ ∈ℝ
𝑦
 We express our problem in matrix form 𝑿𝜽 = 𝒚
()
 Note that there is also an intercept term 𝑥 = 1, the slope
( ) ( )
− (𝒙(𝟏) )𝑻 − 𝑥 ⋯ 𝑥
()
 So, 𝒙( ) = [𝑥 , 𝑥 , … , 𝑥 ] ∈ ℝ and 𝑿 = ⋮ = ⋱ ∈ℝ ×
( ) ( )
− (𝒙 )𝑻
(𝒎) − 𝑥 ⋯ 𝑥
Linear Regression: Normal Equation

 Let θ be the best fit solution to 𝑿𝜽 ≈ 𝒚

 Try to minimize error 𝒆 = 𝒚 − 𝑿𝜽 (also called residuals)
 We take the square of this error, so the objective is
𝐽 𝜽 = 𝒆 = 𝒚 − 𝑿𝜽
 So our problem is
⏞
𝜽 = arg 𝑚𝑖𝑛𝜽 𝐽 𝜽 = arg 𝑚𝑖𝑛𝜽 𝒚 − 𝑿𝜽
Linear Regression: Normal Equation

 Let’s expand 𝐽 𝜽 :
𝐽 𝜽 = 𝒚 − 𝑿𝜽 = 𝒚 − 𝑿𝜽 𝒚 − 𝑿𝜽 = 𝒚 𝒚 − 𝑿𝜽 𝒚 − 𝒚 𝑿𝜽 + 𝑿𝜽 𝑿𝜽 = ⋯
… = 𝒚 𝒚 − 𝟐𝜽𝑻 𝑿𝑻 𝒚 + 𝜽𝑻 𝑿𝑻 𝑿𝜽
Minimize 𝐽 𝜽 w.r.t. 𝜽 :
𝜕𝐽(𝜽)
= −2𝑿 𝒚 + 2𝑿 𝑿𝜽 = 0
𝜕𝜽
𝑿 𝑿𝜽=𝑿 𝒚

𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒚
Linear Regression: Normal Equation

𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒚

 Gets very slow when the number of features grows large

 Equations is linear with regards to the number of instances in the training set
 Handles large training sets efficiently
Polynomial Regression

 What if your data is actually more complex than a simple straight line?
 We can use a linear model to fit nonlinear data
 A model is said to be linear if it is linear in parameters
y   0  1 x1   2 x2  

y   0  1 x   2 x 2  

y   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  

 Add powers of each feature as new features, then train a linear model on
this extended set of features
Polynomial Regression

 A kth order polynomial model in one variable is given by

y   0  1 x   2 x 2  ...   k x k  

 A second order polynomial model in two variable is given by

y   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  

 An array containing n features can be transformed into an array containing

( n  d )!
features, where d is polynomial order
d !n!
Polynomial Regression: Example

 Let’s generate some nonlinear data based on a simple quadratic equation

 x’s are assigned randomly
y  2  x  0.5 x 2  
Polynomial Regression: Example

A straight line poorly fits nonlinear data

Case for underfitting

y  3.52  0.98 x
Polynomial Regression: Example

Add powers of each feature as new features

x (i )  [ x (i ) , ( x (i ) ) 2 ]

y  2  x  0.5 x 2  

h(x)  2.04  0.96 x  0.51x 2

Polynomial Regression: Example
Polynomial Regression: Example
Evaluating the Performance of the
Model
 RMSE - is the square root of the average of the sum of the squares of residuals

1 m
RMSE ( X, h )  
m i 1
(h ( x (i ) )  y (i ) ) 2

 R2- score or the coefficient of determination explains how much the total variance of
the dependent variable can be reduced by using the least square regression.

𝑅 =1− where, 𝑆𝑆 = ∑ (𝑦 − 𝑦) and 𝑆𝑆 = ∑ (ℎ 𝑥 − 𝑦( ))

Locally Weighted Linear Regression

 In the original linear regression algorithm, to make a prediction at a query

point x, we would:
1. Fit θ to minimize (y i
(i )
 θT x ( i ) ) 2
T (i )
2. Output θ x

 In locally weighted linear regression algorithm

1. Fit θ to minimize w
i
(i )
( y ( i )  θT x ( i ) ) 2
2. Output θ x T (i )
Gradient Descent

 Update Rule:
 𝜃 =𝜃 −η ∑ 𝑤 ( ) (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()

 The magnitude of the update is proportional to the error and the weight of
a sample
Locally Weighted Linear Regression

 w(i)’s are non-negative valued weights.

 Weights depend on the particular point x at which we are trying to
evaluate x.

w  exp(
(i ) ( x ( i )  x) 2
)  bandwidth parameter
2 2

 If x  x is small, then w(i ) is close to 1

(i )

 If x  x is large, then w(i ) is small

(i )

 θ is chosen giving a much higher weight to the (errors on) training examples
close to the query point x.
Linear Regression

y  2  x  0.5 x 2  
Linear Regression

y  2  x  0.5 x 2   h(x)  2.04  0.96 x  0.51x 2

Linear Regression

y  2  x  0.5 x 2   h(x)  2.04  0.96 x  0.51x 2

𝑥 = 2.5 ℎ 2.5 = 2.04 + 0.96 ∗ 2.5 + 0.51 ∗ 2.5 = 7.63

Locally Weighted Linear Regression

y  2  x  0.5 x 2  
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

This line is learnt specifically

y  2  x  0.5 x 2   for x_new

𝑥 = 2.5
Locally Weighted Linear Regression

y  2  x  0.5 x 2  

𝑥 = 2.5
Locally Weighted Linear Regression

 Linear regression algorithm is a parametric learning algorithm

 It has fixed, finite number of parameters which are fit to the data
 Once we fit θ’s and store them away, we don’t need to keep training data for
future predictions
 Locally weighted linear regression is a non-parametric algorithm
 To make predictions we need to keep entire training set around
 Amount of stuff we need to keep in order to represent the hypothesis h grows
linearly with the size of the training set
Question

 Is the gradient descent algorithm affected by the scales of features?

Question

 Is the gradient descent algorithm affected by the scales of features?

Feature Normalization

 We can speed up the gradient descent by normalizing features (having

them in approximately equal ranges),
 𝜃 will descent quickly on small ranges
1) Feature scaling: 𝑥 =
( )

2) Mean normalization: 𝑥 = 𝑜𝑟 𝑥 =
( )

 For polynomial regression feature normalization is even more important.

 Not needed for Normal Equation.
Summary on Linear Regression

 Considered the case when the hypothesis function is linear.

 We can make our hypothesis nonlinear by combining existing features and
obtaining more features and make the function quadratic, cubic.
 Closed-form solution: Normal Equations
 Scaling the data for gradient descent algorithm is important.

3.linear Regression
No ratings yet
3.linear Regression
18 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear Regression
No ratings yet
Linear Regression
130 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Lecture 2. Regression
No ratings yet
Lecture 2. Regression
61 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lec2 Regression
No ratings yet
Lec2 Regression
96 pages
Notes 1
No ratings yet
Notes 1
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
Regression
No ratings yet
Regression
30 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
Module 3
No ratings yet
Module 3
27 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37
No ratings yet
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37
115 pages
Week 4
No ratings yet
Week 4
101 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
L4 More On Linear Regression and Polynomial Regression
No ratings yet
L4 More On Linear Regression and Polynomial Regression
37 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Linear Regression
100% (1)
Linear Regression
51 pages
Experiment N1
No ratings yet
Experiment N1
7 pages
Lecture LinearRegression
No ratings yet
Lecture LinearRegression
42 pages
Lecture 5 - Linear Regression
No ratings yet
Lecture 5 - Linear Regression
51 pages
Cost Function
No ratings yet
Cost Function
17 pages
Week 04
No ratings yet
Week 04
101 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
2.1 Supervised Regression
No ratings yet
2.1 Supervised Regression
26 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Linear Regression For Absolute Beginners With Implementation in Python
No ratings yet
Linear Regression For Absolute Beginners With Implementation in Python
17 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
Chapter04 Training Models
No ratings yet
Chapter04 Training Models
33 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
GR 1 Report Week 7
No ratings yet
GR 1 Report Week 7
6 pages
CM20315 06 Fitting
No ratings yet
CM20315 06 Fitting
67 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Foundations of Image Science
From Everand
Foundations of Image Science
Harrison H. Barrett
No ratings yet
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Artificial life: Random walk
From Everand
Artificial life: Random walk
Mietek Szyszkowicz
No ratings yet
Vertex Computer Graphics: Exploring the Intersection of Vertex Computer Graphics and Computer Vision
From Everand
Vertex Computer Graphics: Exploring the Intersection of Vertex Computer Graphics and Computer Vision
Fouad Sabry
No ratings yet
Structural Equation Modeling: Petri Nokelainen
No ratings yet
Structural Equation Modeling: Petri Nokelainen
145 pages
15 Types of Regression in Data Science PDF
No ratings yet
15 Types of Regression in Data Science PDF
42 pages
241-Article Text-641-1-10-20221025
No ratings yet
241-Article Text-641-1-10-20221025
11 pages
Usage Note 40724: Comparing Covariance Structures, Testing Covariance Parameters Using The COVTEST Statement in PROC GLIMMIX
No ratings yet
Usage Note 40724: Comparing Covariance Structures, Testing Covariance Parameters Using The COVTEST Statement in PROC GLIMMIX
8 pages
236 484 02 Stromversorung k4 Zasilacz Heidenhain Manual
No ratings yet
236 484 02 Stromversorung k4 Zasilacz Heidenhain Manual
123 pages
Random Vs Systematic Error
No ratings yet
Random Vs Systematic Error
2 pages
Homoscedasticity, Heteroscedasticity and Multicollinearity
100% (1)
Homoscedasticity, Heteroscedasticity and Multicollinearity
10 pages
Mardikaningsih & Sinambela, 2022
No ratings yet
Mardikaningsih & Sinambela, 2022
5 pages
Calculating ANOVA For The RCBD Data
No ratings yet
Calculating ANOVA For The RCBD Data
3 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
18 pages
Unit-Iii: Statistical Estimation Theory Unbiased Estimates
No ratings yet
Unit-Iii: Statistical Estimation Theory Unbiased Estimates
14 pages
A Comparative Performance Analysis of ANN Algorithms For MPPT Energy Harvesting in Solar PV System
No ratings yet
A Comparative Performance Analysis of ANN Algorithms For MPPT Energy Harvesting in Solar PV System
16 pages
Chapter-1: 1.1 Surface Roughness
100% (1)
Chapter-1: 1.1 Surface Roughness
50 pages
Machine Learning QB
No ratings yet
Machine Learning QB
32 pages
Tree Breeding
No ratings yet
Tree Breeding
31 pages
P15 - 178380 - Eviews Guide
No ratings yet
P15 - 178380 - Eviews Guide
7 pages
Distributions of Sample Statistics
No ratings yet
Distributions of Sample Statistics
112 pages
Managerial Economics Project
No ratings yet
Managerial Economics Project
9 pages
Performance Matched Discretionary Accrual Measures: S.P. Kothari, Andrew J. Leone, Charles E. Wasley
No ratings yet
Performance Matched Discretionary Accrual Measures: S.P. Kothari, Andrew J. Leone, Charles E. Wasley
35 pages
(ROBLEE) The Geoindez Model For Practical Design Selection of Nonlinear Soil Properties
No ratings yet
(ROBLEE) The Geoindez Model For Practical Design Selection of Nonlinear Soil Properties
11 pages
Model-Based Control of The Mitsubishi PA-10 Robot Arm: Application To Robot-Assisted Surgery
No ratings yet
Model-Based Control of The Mitsubishi PA-10 Robot Arm: Application To Robot-Assisted Surgery
6 pages
C 13
No ratings yet
C 13
33 pages
Varun Beverages
No ratings yet
Varun Beverages
16 pages
Effect of Heat Input On Dilution and Heat Affected Zone in Submerged Arc Welding Process PDF
No ratings yet
Effect of Heat Input On Dilution and Heat Affected Zone in Submerged Arc Welding Process PDF
23 pages
Agricultural Extension Services and Maize Yield in Malawi
100% (1)
Agricultural Extension Services and Maize Yield in Malawi
27 pages
Loq and Lod
No ratings yet
Loq and Lod
5 pages
Variables Entered
No ratings yet
Variables Entered
15 pages
Conducting A Nonlinear Fit Analysis in MATLAB: Using The Function
No ratings yet
Conducting A Nonlinear Fit Analysis in MATLAB: Using The Function
2 pages
Notes On Estimation
No ratings yet
Notes On Estimation
76 pages