0% found this document useful (0 votes)

6 views61 pages

Lecture 2. Regression

Uploaded by

thaotrau55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views61 pages

Lecture 2. Regression

Uploaded by

thaotrau55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Learning Systems (DT8008)

Regression

Dr. Mohamed-Rafik Bouguelia

[email protected]

Halmstad University
(1) Linear Regression
Example of Linear Regression
with One Feature

3
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
Price (1000$)

Size (feet²) 4
1250
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price
is linear.
Price (1000$)

2. Find a line that fits the

training dataset well.

Size (feet²) 5
1250
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price
is linear.
Price (1000$)

2. Find a line that fits the

training dataset well.

?
𝒂𝒂 • How do we find the best
parameters 𝜃𝜃0 and 𝜃𝜃1
𝜽𝜽𝟎𝟎 𝒃𝒃 (i.e. the best fitting line)
𝒂𝒂
𝜽𝜽𝟏𝟏 = Size (feet²) 6
𝒃𝒃
Linear Regression with one feature
• Choose 𝜃𝜃0 , 𝜃𝜃1 so that ℎ𝜃𝜃 (𝑥𝑥 (𝑖𝑖) ) is close to 𝑦𝑦 (𝑖𝑖) for our training examples
𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 , ∀𝑖𝑖 = 0 … 𝑛𝑛

• We want to find the parameters 𝜃𝜃 = (𝜃𝜃0 , 𝜃𝜃1 ) that minimizes the error
function 𝐸𝐸(𝜃𝜃).
Price (1000$)

Error function 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )

(mean squared error cost
function)

Size (feet²) 7
Linear Regression with one feature
• To simply the explanation, let’s first assume that 𝜃𝜃0 = 0, so
our model ℎ𝜃𝜃 is on the form: ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃1 𝑥𝑥

In this case, we need to find

the optimal value for 𝜃𝜃1
Price (1000$)

minim𝑖𝑖𝑖𝑖𝑖𝑖 E(𝜃𝜃1 )
𝜃𝜃1

Size (feet²)

8
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error

𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏
𝒊𝒊
𝒉𝒉𝜽𝜽 𝒙𝒙 − 𝒚𝒚(𝒊𝒊)

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏

Size (feet²) 9
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error

𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏

Size (feet²) 10
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error

𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏

Size (feet²) 11
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1

Price
𝒉𝒉𝜽𝜽 𝒙𝒙

𝑬𝑬 𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏

𝒙𝒙
𝜽𝜽𝟏𝟏 𝜽𝜽𝟎𝟎
Size (feet²) 12
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Contour plot of 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )

Price
𝒉𝒉𝜽𝜽 𝒙𝒙

𝜽𝜽𝟏𝟏

𝒙𝒙
Size (feet²) 𝜽𝜽𝟎𝟎 13
Linear Regression with one feature
• To find the parameters that minimize the error function, we can use an optimization algorithm called:
Gradient Descent.

• Gradient Descent is a general optimization algorithm, which is not only specific for this function error
function.

• We will see later that for this specific error function, we can solve the minimization problem without
the need to use the gradient descent algorithm.

Optimization problem:

Error/cost function:

Hypothesis function:

14
Gradient Descent
Algorithm for optimization

15
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we hopefully end up
at a minimum.
– At each update, how do we decide if we should increase or decrease
each of the parameters?

Note:
For the purpose of
explanation, the non-
convex error
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) function presented in
this example is not
the mean squared
error (MSE). The
MSE function is
convex.
𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
16
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we hopefully end up
at a minimum.
– At each update, how do we decide if we should increase or decrease
each of the parameters?  next slide

Depending on
initial parameter
values, we might
end-up at a
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) different (local)
minimum.

local minimum

𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
17
Gradient Descent – Basic idea
Example with a convex error function (MSE).
Only one (global) minimum.

𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
MSE

𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
18
Gradient Descent – Algorithm

Learning Derivative of 𝐸𝐸
rate 𝛼𝛼 > 0 with respect to 𝜃𝜃𝑗𝑗

• At each iteration, we need to update 𝜃𝜃0 and 𝜃𝜃1 simultaneously (at the same time)

Correct (simultaneous update) : Incorrect :

19
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1

𝜃𝜃1

20
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

𝜃𝜃1

21
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

𝜃𝜃1

22
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

Derivative: Looks at the

𝜃𝜃1 slope of the (red) line
which is tangent to the
function at that point.

23
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

≥0
𝜃𝜃1 (positive slope)

In this case, since the derivative is positive and 𝛼𝛼 ≥ 0, then

𝜃𝜃1 will decrease, and get’s closer to the optimal value.
24
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

If our initial value of 𝜃𝜃1 was too

small (i.e. on the left side)
𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

≤ 0
𝜃𝜃1 (negative slope)

In this case, since the derivative is negative and 𝛼𝛼 ≥ 0, then

𝜃𝜃1 will increase, and get’s closer to the optimal value.
25
Gradient Descent – Algorithm
 Reasonably small value of 𝛼𝛼  Very large value of 𝛼𝛼

𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1

𝜃𝜃1 𝜃𝜃1
Notice that as we approach a local minimum, If 𝛼𝛼 is too large, it may fail to
gradient descent will automatically take smaller converge, or may even diverge.
steps (why?). So, no need to decrease 𝛼𝛼 over time.
26
Gradient Descent – Algorithm
 Reasonably small value of 𝛼𝛼  Very large value of 𝛼𝛼

𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1

The derivative (slop of

the tangent line) get’s
closer to zero as we get
closer to the minimum.

𝜃𝜃1 𝜃𝜃1
Notice that as we approach a local minimum, If 𝛼𝛼 is too large, it may fail to
gradient descent will automatically take smaller converge, or may even diverge.
steps (why?). So, no need to decrease 𝛼𝛼 over time.
27
Gradient Descent – Local minimum
Function (𝜃𝜃1 )

Assume that we have

reached a local optimum
(local minimum here).

𝜃𝜃1 at a local minimum

𝜃𝜃1
Current value of 𝜃𝜃1

The derivative will be equal to 0, so

convergence. We get stuck at this local
minimum.
28
Using Gradient Descent
for Linear Regression

(with one feature)

29
Gradient descent for linear regression

• Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟎𝟎

30
Gradient descent for linear regression

• Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟏𝟏

31
Gradient descent for linear regression
• Pick some initial values for 𝜃𝜃0 and 𝜃𝜃1
• .

o Batch Gradient Descent for linear regression with 1 feature.

o “Batch” = each step of the GD uses all training examples. 32
Gradient descent for linear regression
• The batch gradient descent uses all the
training examples, to update the model
parameters.

• The online (also called stochastic)

gradient descent updates the model
parameters based on one training example
at a time.This can be useful when:
– e.g. (1) you don’t have all the training dataset beforehand. Your training examples
arrive one by one over time, as a stream.
– e.g. (2) your training dataset is very big (computationally expensive to use batch
GD, or the dataset doesn’t fit in memory).

33
Multivariate Linear Regression

(with multiple features)

34
Multivariate linear regression
Size (𝒙𝒙𝟏𝟏 ) Nb rooms (𝒙𝒙𝟐𝟐 ) Location (𝒙𝒙𝟑𝟑 ) Nb floors (𝒙𝒙𝟒𝟒 ) Age (𝒙𝒙𝟓𝟓 ) … Price (𝒚𝒚)
2104 6 2 2 45 … 460
1416 5 10 1 40 … 230
1534 5 3 2 30 … 315
852 4 2 1 35 … 178
… … … … … … …

• For convenience of notation, define 𝑥𝑥0 = 1

(𝑖𝑖)
 Think of it as an additional feature which equals 1 for all data-points: 𝑥𝑥0 = 0, ∀𝑖𝑖 =
1 … 𝑛𝑛

𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥

35
Multivariate linear regression
Gradient descent

Simultaneously
update all 𝜃𝜃𝑗𝑗 for
𝑗𝑗 = 0, … , 𝑑𝑑

36
More about gradient descent
Convergence & Selecting 𝛼𝛼
min 𝐸𝐸 𝜃𝜃
𝜃𝜃 • For a sufficiently small 𝛼𝛼, the 𝐸𝐸(𝜃𝜃) (on the
training set) should decrease at every
iteration.

• One can consider convergence (thus stop)

if 𝐸𝐸(𝜃𝜃) decreases by less than 𝜖𝜖 (small
number e.g. 0.0001) in one iteration.
Nbr. of iterations of GD

𝐸𝐸 𝜃𝜃1
min 𝐸𝐸 𝜃𝜃 min 𝐸𝐸 𝜃𝜃
𝜃𝜃 𝜃𝜃 In these cases, you
should use a
smaller 𝛼𝛼

Note: if 𝛼𝛼 is too 𝜃𝜃1

small, GD can be
Nbr. of iterations Nbr. of iterations slow to converge.
37
Linear Regression with the
Normal Equation

(without using gradient descent)

38
Linear regression with the normal equation
• Method to solve for 𝜃𝜃 analytically.
• The derivative at the optimal point equals to 0. So, set the derivative to 0
𝜕𝜕
𝐸𝐸 𝜃𝜃 = … = 0, and solve for 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑑𝑑
𝜕𝜕𝜃𝜃𝑗𝑗

• The solution will be:

pseudo inverse
39
Linear regression with the normal equation

40
Gradient Descent vs. Normal Equation
(for linear regression)

Gradient Descent Normal Equation

• Need to choose 𝛼𝛼. • No need to choose 𝛼𝛼.

• Needs several iterations. • Don’t needs to iterate.

• Works well even when 𝑑𝑑 is • Need to compute 𝑋𝑋 𝑇𝑇 𝑋𝑋 −1 ,

large. which is slow when 𝑑𝑑 is very
large.

• Does not apply to other

optimization problems.

41
(2) Non-linear Regression
Non-linear regression
• The output ℎ𝜃𝜃 (𝑥𝑥) is a non-linear function
• e.g. polynomial function of degree > 1

• Examples with one variable:

𝒉𝒉𝜽𝜽 𝒙𝒙 = 𝜽𝜽𝟎𝟎 + 𝜽𝜽𝟏𝟏 𝒙𝒙 + 𝜽𝜽𝟐𝟐 𝒙𝒙𝟐𝟐
𝒉𝒉𝜽𝜽 𝒙𝒙 = 𝜽𝜽𝟎𝟎 + 𝜽𝜽𝟏𝟏 𝒙𝒙 + 𝜽𝜽𝟐𝟐 𝒙𝒙𝟐𝟐 + 𝜽𝜽𝟑𝟑 𝒙𝒙𝟑𝟑

𝒚𝒚

𝒙𝒙 𝒙𝒙

43
Non-linear regression
• Examples of polynomial regression models with
two features 𝑥𝑥1 , 𝑥𝑥2
• A second order polynomial would be:
 ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥1 𝑥𝑥2 + 𝜃𝜃4 𝑥𝑥12 + 𝜃𝜃5 𝑥𝑥22

44
Generalized Linear Model
for Non-linear Regression

45
Generalized Linear Model
• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥1 𝑥𝑥2 + 𝜃𝜃4 𝑥𝑥12 + 𝜃𝜃5 𝑥𝑥22

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3 𝑧𝑧4 𝑧𝑧5

• ℎ𝜃𝜃 𝑥𝑥 = 𝜽𝜽𝑻𝑻 𝒛𝒛 = 𝜃𝜃0 + 𝜃𝜃1 𝒙𝒙𝟏𝟏 + 𝜃𝜃2 𝒙𝒙𝟐𝟐 + 𝜃𝜃3 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 + 𝜃𝜃4 𝒙𝒙𝟐𝟐𝟏𝟏 + 𝜃𝜃5 𝒙𝒙𝟐𝟐𝟐𝟐
 This can be seen as just creating and adding new features based on the two
original features 𝑥𝑥1 , 𝑥𝑥2

 We can still find the parameters 𝜃𝜃 of the non-linear model (in 𝑥𝑥) using a
linear model based on 𝑧𝑧.
 So, we can use the methods we studied previously in the linear regression lecture.
46
Generalized Linear Model
• It can be any kind of new features (also called: basis functions)
– e.g. log 𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥1 𝑥𝑥2 , 𝑥𝑥12 , 𝑥𝑥23 , …
• Requires a good guess of relevant features for your problem …

price

• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2

• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 𝑥𝑥2
• New feature:
 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙_𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 × 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 𝑥𝑥1 𝑥𝑥2 • ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 𝑥𝑥2

47
K-Nearest-Neighbors (KNN)
for Non-linear Regression

48
K-Nearest-Neighbors Regression
• KNN is a non-parametric model
– This does not mean parameter-free (e.g. K is a hyper-parameters).
• Parametric: we select a hypothesis space and adjust a fixed set of parameter
using the training data.
• Non-parametric: the model is not characterized by a fixed set of parameters.

• In KNN, we just save/memorize the training dataset (there is no

training as such).

• To make a prediction on a new data-point 𝑥𝑥, we look at the 𝑘𝑘 most

similar (i.e. closest/nearest) data-points from the training dataset. We
can take:
– the average output from these 𝑘𝑘 examples,
– or a weighted average output from these 𝑘𝑘 examples.

• Need a distance measure (inverse of: similarity measure).

49
K-Nearest-Neighbors Regression

Predicting using KNN Predicting using Weighted KNN

Same as the weighted KNN

with 𝑤𝑤𝑙𝑙𝑖𝑖 = 1
50
K-Nearest-Neighbors Regression
Algorithm
• Given:
– Training dataset { 𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2 , 𝑦𝑦 2 , … , 𝑥𝑥 𝑛𝑛
, 𝑦𝑦 𝑛𝑛
}
– Testing data-point 𝑥𝑥 to predict its output.
• Algorithm:
– Compute ||𝑥𝑥 − 𝑥𝑥 (𝑖𝑖) || for 𝑖𝑖 = 1, … , 𝑛𝑛
– Select the 𝑘𝑘 training examples closest to 𝑥𝑥 (with the smallest distance to 𝑥𝑥)
𝑙𝑙1 𝑙𝑙1 𝑙𝑙2 𝑙𝑙2 𝑙𝑙𝑘𝑘 𝑙𝑙𝑘𝑘
• { 𝑥𝑥 , 𝑦𝑦 , 𝑥𝑥 , 𝑦𝑦 , … , 𝑥𝑥 , 𝑦𝑦 }
– Output the mean (or weighted mean) of 𝑦𝑦 𝑙𝑙1 , 𝑦𝑦 𝑙𝑙2 , … , 𝑦𝑦 𝑙𝑙𝑘𝑘

KNN with uniform weights (k=5) Weighted KNN (k=5)

𝑦𝑦 𝑦𝑦

𝑥𝑥1 𝑥𝑥1
51
Kernel Regression
(Non-linear Regression)

52
Kernel Regression
• Very similar to the weighted KNN method.
• A common kernel regression model is the Nadaraya-Watson estimator, with a
Gaussian kernel function.

width of the
Gaussian

• Data-points closer to 𝑥𝑥 have a larger weight, i.e.

influences the prediction more.
• Larger values of 𝜎𝜎 implies that more data-points
will influence the prediction.
• Too small 𝜎𝜎 may lead to overfitting. Too large 𝜎𝜎
may lead to underfitting.
53
Kernel Regression
• Example with KNN Regression and Kernel Regression

Simple KNN Kernel Regression

𝑦𝑦 𝑦𝑦

𝑥𝑥1 𝑥𝑥1
54
Features Normalization /
Scaling
Features Normalization
• Feature normalization is a preprocessing step used to normalize the range
of the features.

• It is important when the features have very different scales.

– For example, if the values of feature 𝟏𝟏 are ∈ [0, 1] but the values of feature 𝟐𝟐
are ∈ [120, 190], then normalizing the features is important.

• Motivation:
– Suppose that some ML algorithm computes the Euclidean distance between two
points. If one of the features has a broad range of values, the distance will be
governed by this particular feature. Therefore, the range of all features should be
normalized so that each feature contributes approximately proportionately to
the final distance.

0.11 0.52
− = 3.027
182 179

56
min-max Features scaling

𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
max 𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )

• 𝑣𝑣𝑗𝑗 is a column (corresponding to feature 𝑗𝑗) from the data matrix 𝑋𝑋.
• 𝑣𝑣′𝑗𝑗 are the normalized values of feature 𝑗𝑗. These values will be ∈ [0, 1]

57
min-max Features scaling
• Before Features Scaling • After Features Scaling

58
Features Standardization

𝑣𝑣𝑗𝑗 − mean(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
stdev 𝑣𝑣𝑗𝑗

• 𝑣𝑣𝑗𝑗 is a column (corresponding to feature 𝑗𝑗) from the data matrix 𝑋𝑋.
• 𝑣𝑣′𝑗𝑗 are the normalized values of feature 𝑗𝑗. These values will be ∈ [0, 1].
• To normalize, we just subtract the mean and divide by the standard deviation.

59
Features Standardization
• Before • After

60
• NOTE: do not rescale or standardize the output (target variable).

𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
max 𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑗𝑗 − mean(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
stdev 𝑣𝑣𝑗𝑗

Lecture 5. Support Vector Machines SVM
No ratings yet
Lecture 5. Support Vector Machines SVM
47 pages
Lecture 3. Classification
No ratings yet
Lecture 3. Classification
60 pages
Caie Igcse Maths 0580 Model Answers v1
No ratings yet
Caie Igcse Maths 0580 Model Answers v1
22 pages
Maths 4 Improtant Questions
No ratings yet
Maths 4 Improtant Questions
2 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Week 4
No ratings yet
Week 4
101 pages
Chap6 (Regression)
No ratings yet
Chap6 (Regression)
74 pages
CDS Maths Syllabus Topics: Number System
No ratings yet
CDS Maths Syllabus Topics: Number System
2 pages
Linear Regression
No ratings yet
Linear Regression
54 pages
Surya Siddhantha: (Surya Siddhanta (Sun Theory) - Cosmograhical Analysis - The Basis For Astrology.)
100% (1)
Surya Siddhantha: (Surya Siddhanta (Sun Theory) - Cosmograhical Analysis - The Basis For Astrology.)
4 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Lecture 04
No ratings yet
Lecture 04
24 pages
LinearRegression) Byimran
No ratings yet
LinearRegression) Byimran
47 pages
Machine Learning - 5
No ratings yet
Machine Learning - 5
50 pages
Lecture2-Linear Regression With One Variable
No ratings yet
Lecture2-Linear Regression With One Variable
49 pages
Revised-L3-Linear Regression
No ratings yet
Revised-L3-Linear Regression
41 pages
Linear Regression: Jia-Bin Huang Virginia Tech
No ratings yet
Linear Regression: Jia-Bin Huang Virginia Tech
59 pages
Lecture 6,7-Linear Regression
No ratings yet
Lecture 6,7-Linear Regression
47 pages
1-Review of Linear Regression
No ratings yet
1-Review of Linear Regression
29 pages
Lecture LinearRegression
No ratings yet
Lecture LinearRegression
42 pages
Linear Regression
No ratings yet
Linear Regression
95 pages
TMI04.2 Linear Regression
No ratings yet
TMI04.2 Linear Regression
36 pages
TMI04.2 Linear Regression PDF
No ratings yet
TMI04.2 Linear Regression PDF
36 pages
Linear Regression
No ratings yet
Linear Regression
130 pages
Application of Pde
No ratings yet
Application of Pde
25 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
10 SM Math English 2019 20
No ratings yet
10 SM Math English 2019 20
272 pages
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37
No ratings yet
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37
115 pages
Linear Regression
No ratings yet
Linear Regression
64 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture3 - Linear Regression and Logistic Regression
No ratings yet
Lecture3 - Linear Regression and Logistic Regression
60 pages
Lect03 CSN382
No ratings yet
Lect03 CSN382
31 pages
Math Holiday Task
No ratings yet
Math Holiday Task
2 pages
Exploration of Solution of Loop Closure Equations For RRRP (Slider-Crank and Inversions) Mechanisms
No ratings yet
Exploration of Solution of Loop Closure Equations For RRRP (Slider-Crank and Inversions) Mechanisms
9 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Week 04
No ratings yet
Week 04
101 pages
College and Advanced Algebra
100% (1)
College and Advanced Algebra
11 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Board of Civil Engineering Exam Scope
93% (14)
Board of Civil Engineering Exam Scope
8 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
VA&C - UNIT - II (Part-Two)
No ratings yet
VA&C - UNIT - II (Part-Two)
107 pages
Lecture 1.1. Introduction
No ratings yet
Lecture 1.1. Introduction
48 pages
Lesson 26 Differential Equation of First Order: Dy DX F (X, Y) MDX + Ndy 0 M N X y
No ratings yet
Lesson 26 Differential Equation of First Order: Dy DX F (X, Y) MDX + Ndy 0 M N X y
6 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
Solving Systems of Equations in Mathcad
No ratings yet
Solving Systems of Equations in Mathcad
11 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Cost Function
No ratings yet
Cost Function
17 pages
Ohio State University Press
No ratings yet
Ohio State University Press
8 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Linear Regression
100% (1)
Linear Regression
51 pages
Lecture 4.2. Generalization and Regularization
No ratings yet
Lecture 4.2. Generalization and Regularization
23 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
X
No ratings yet
X
6 pages
CAT 2025 Batch 12 Complete Schedule
No ratings yet
CAT 2025 Batch 12 Complete Schedule
8 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Problem Set Linear Regression and Gradient Descent
No ratings yet
Problem Set Linear Regression and Gradient Descent
3 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
Assignment 1
No ratings yet
Assignment 1
14 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Algebra CHAPTER TWO and THREE
No ratings yet
Algebra CHAPTER TWO and THREE
17 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Add Maths Form 5 - First Monthly Test
100% (1)
Add Maths Form 5 - First Monthly Test
2 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
A Maple Package of Automated Derivation of Homotopy Analysis Solution For Periodic Nonlinear Oscillations
No ratings yet
A Maple Package of Automated Derivation of Homotopy Analysis Solution For Periodic Nonlinear Oscillations
24 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Corepure1 Chapter 2::: Argand Diagrams
No ratings yet
Corepure1 Chapter 2::: Argand Diagrams
28 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Pordelshahri 2013
No ratings yet
Pordelshahri 2013
11 pages
Week 1 - The Conic Section-3
No ratings yet
Week 1 - The Conic Section-3
4 pages
LAB No 2
No ratings yet
LAB No 2
3 pages
Grid Generation
No ratings yet
Grid Generation
4 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
3 pages
Chapter 3 - Linear Equations
No ratings yet
Chapter 3 - Linear Equations
21 pages
Class 10 Maths
100% (3)
Class 10 Maths
166 pages
Name: Score: Section: Date:: RC - Al Khwarizmi International College Foundation, Inc. Science Laboratory School
No ratings yet
Name: Score: Section: Date:: RC - Al Khwarizmi International College Foundation, Inc. Science Laboratory School
3 pages
Grade 10
100% (2)
Grade 10
30 pages
Cauchy-Euler Equation: Higher-Order Differential Equations
No ratings yet
Cauchy-Euler Equation: Higher-Order Differential Equations
7 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Lecture 2. Regression

Uploaded by

Lecture 2. Regression

Uploaded by

Learning Systems (DT8008)

Dr. Mohamed-Rafik Bouguelia

2. Find a line that fits the

2. Find a line that fits the

Error function 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )

In this case, we need to find

Price Mean Squared Error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏

Price Mean Squared Error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏

Price Mean Squared Error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏

Correct (simultaneous update) : Incorrect :

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

Derivative: Looks at the

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

In this case, since the derivative is positive and 𝛼𝛼 ≥ 0, then

If our initial value of 𝜃𝜃1 was too

• We want to update 𝜃𝜃1 :

In this case, since the derivative is negative and 𝛼𝛼 ≥ 0, then

The derivative (slop of

Assume that we have

𝜃𝜃1 at a local minimum

The derivative will be equal to 0, so

(with one feature)

• Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟎𝟎

• Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟏𝟏

o Batch Gradient Descent for linear regression with 1 feature.

• The online (also called stochastic)

(with multiple features)

• For convenience of notation, define 𝑥𝑥0 = 1

• One can consider convergence (thus stop)

Note: if 𝛼𝛼 is too 𝜃𝜃1

(without using gradient descent)

• The solution will be:

Gradient Descent Normal Equation

• Need to choose 𝛼𝛼. • No need to choose 𝛼𝛼.

• Works well even when 𝑑𝑑 is • Need to compute 𝑋𝑋 𝑇𝑇 𝑋𝑋 −1 ,

• Does not apply to other

• Examples with one variable:

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3 𝑧𝑧4 𝑧𝑧5

• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2

• In KNN, we just save/memorize the training dataset (there is no

• To make a prediction on a new data-point 𝑥𝑥, we look at the 𝑘𝑘 most

• Need a distance measure (inverse of: similarity measure).

Predicting using KNN Predicting using Weighted KNN

Same as the weighted KNN

KNN with uniform weights (k=5) Weighted KNN (k=5)

• Data-points closer to 𝑥𝑥 have a larger weight, i.e.

Simple KNN Kernel Regression

• It is important when the features have very different scales.

You might also like