0% found this document useful (0 votes)
6 views61 pages

Lecture 2. Regression

Uploaded by

thaotrau55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views61 pages

Lecture 2. Regression

Uploaded by

thaotrau55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Learning Systems (DT8008)

Regression

Dr. Mohamed-Rafik Bouguelia


[email protected]

Halmstad University
(1) Linear Regression
Example of Linear Regression
with One Feature

3
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
Price (1000$)

Size (feet²) 4
1250
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price
is linear.
Price (1000$)

2. Find a line that fits the


training dataset well.

Size (feet²) 5
1250
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price
is linear.
Price (1000$)

2. Find a line that fits the


training dataset well.

?
𝒂𝒂 • How do we find the best
parameters 𝜃𝜃0 and 𝜃𝜃1
𝜽𝜽𝟎𝟎 𝒃𝒃 (i.e. the best fitting line)
𝒂𝒂
𝜽𝜽𝟏𝟏 = Size (feet²) 6
𝒃𝒃
Linear Regression with one feature
• Choose 𝜃𝜃0 , 𝜃𝜃1 so that ℎ𝜃𝜃 (𝑥𝑥 (𝑖𝑖) ) is close to 𝑦𝑦 (𝑖𝑖) for our training examples
𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 , ∀𝑖𝑖 = 0 … 𝑛𝑛

• We want to find the parameters 𝜃𝜃 = (𝜃𝜃0 , 𝜃𝜃1 ) that minimizes the error
function 𝐸𝐸(𝜃𝜃).
Price (1000$)

Error function 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )


(mean squared error cost
function)

Size (feet²) 7
Linear Regression with one feature
• To simply the explanation, let’s first assume that 𝜃𝜃0 = 0, so
our model ℎ𝜃𝜃 is on the form: ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃1 𝑥𝑥

In this case, we need to find


the optimal value for 𝜃𝜃1
Price (1000$)

minim𝑖𝑖𝑖𝑖𝑖𝑖 E(𝜃𝜃1 )
𝜃𝜃1

Size (feet²)

8
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error


𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏
𝒊𝒊
𝒉𝒉𝜽𝜽 𝒙𝒙 − 𝒚𝒚(𝒊𝒊)

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏


Size (feet²) 9
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error


𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏


Size (feet²) 10
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1

Price Mean Squared Error


𝒉𝒉𝜽𝜽 𝒙𝒙 𝑬𝑬 𝜽𝜽𝟏𝟏

minimum
error

𝒙𝒙 optimal 𝜃𝜃1 𝜽𝜽𝟏𝟏


Size (feet²) 11
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1

Price
𝒉𝒉𝜽𝜽 𝒙𝒙

𝑬𝑬 𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏

𝒙𝒙
𝜽𝜽𝟏𝟏 𝜽𝜽𝟎𝟎
Size (feet²) 12
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function

For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Contour plot of 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )

Price
𝒉𝒉𝜽𝜽 𝒙𝒙

𝜽𝜽𝟏𝟏

𝒙𝒙
Size (feet²) 𝜽𝜽𝟎𝟎 13
Linear Regression with one feature
• To find the parameters that minimize the error function, we can use an optimization algorithm called:
Gradient Descent.

• Gradient Descent is a general optimization algorithm, which is not only specific for this function error
function.

• We will see later that for this specific error function, we can solve the minimization problem without
the need to use the gradient descent algorithm.

Optimization problem:

Error/cost function:

Hypothesis function:

14
Gradient Descent
Algorithm for optimization

15
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we hopefully end up
at a minimum.
– At each update, how do we decide if we should increase or decrease
each of the parameters?

Note:
For the purpose of
explanation, the non-
convex error
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) function presented in
this example is not
the mean squared
error (MSE). The
MSE function is
convex.
𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
16
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we hopefully end up
at a minimum.
– At each update, how do we decide if we should increase or decrease
each of the parameters?  next slide

Depending on
initial parameter
values, we might
end-up at a
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) different (local)
minimum.

local minimum

𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
17
Gradient Descent – Basic idea
Example with a convex error function (MSE).
Only one (global) minimum.

𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
MSE

𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
18
Gradient Descent – Algorithm

Learning Derivative of 𝐸𝐸
rate 𝛼𝛼 > 0 with respect to 𝜃𝜃𝑗𝑗

• At each iteration, we need to update 𝜃𝜃0 and 𝜃𝜃1 simultaneously (at the same time)

Correct (simultaneous update) : Incorrect :

19
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1

𝜃𝜃1

20
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

𝜃𝜃1

21
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

𝜃𝜃1

22
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

Derivative: Looks at the


𝜃𝜃1 slope of the (red) line
which is tangent to the
function at that point.

23
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

≥0
𝜃𝜃1 (positive slope)

In this case, since the derivative is positive and 𝛼𝛼 ≥ 0, then


𝜃𝜃1 will decrease, and get’s closer to the optimal value.
24
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1

If our initial value of 𝜃𝜃1 was too


small (i.e. on the left side)
𝐸𝐸 𝜃𝜃1 • Pick some initial value for 𝜃𝜃1

• We want to update 𝜃𝜃1 :

≤ 0
𝜃𝜃1 (negative slope)

In this case, since the derivative is negative and 𝛼𝛼 ≥ 0, then


𝜃𝜃1 will increase, and get’s closer to the optimal value.
25
Gradient Descent – Algorithm
 Reasonably small value of 𝛼𝛼  Very large value of 𝛼𝛼

𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1

𝜃𝜃1 𝜃𝜃1
Notice that as we approach a local minimum, If 𝛼𝛼 is too large, it may fail to
gradient descent will automatically take smaller converge, or may even diverge.
steps (why?). So, no need to decrease 𝛼𝛼 over time.
26
Gradient Descent – Algorithm
 Reasonably small value of 𝛼𝛼  Very large value of 𝛼𝛼

𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1

The derivative (slop of


the tangent line) get’s
closer to zero as we get
closer to the minimum.

𝜃𝜃1 𝜃𝜃1
Notice that as we approach a local minimum, If 𝛼𝛼 is too large, it may fail to
gradient descent will automatically take smaller converge, or may even diverge.
steps (why?). So, no need to decrease 𝛼𝛼 over time.
27
Gradient Descent – Local minimum
Function (𝜃𝜃1 )

Assume that we have


reached a local optimum
(local minimum here).

𝜃𝜃1 at a local minimum

𝜃𝜃1
Current value of 𝜃𝜃1

The derivative will be equal to 0, so


convergence. We get stuck at this local
minimum.
28
Using Gradient Descent
for Linear Regression

(with one feature)

29
Gradient descent for linear regression

• Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟎𝟎

30
Gradient descent for linear regression

• Derivative of 𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) with respect to 𝜽𝜽𝟏𝟏

31
Gradient descent for linear regression
• Pick some initial values for 𝜃𝜃0 and 𝜃𝜃1
• .

o Batch Gradient Descent for linear regression with 1 feature.


o “Batch” = each step of the GD uses all training examples. 32
Gradient descent for linear regression
• The batch gradient descent uses all the
training examples, to update the model
parameters.

• The online (also called stochastic)


gradient descent updates the model
parameters based on one training example
at a time.This can be useful when:
– e.g. (1) you don’t have all the training dataset beforehand. Your training examples
arrive one by one over time, as a stream.
– e.g. (2) your training dataset is very big (computationally expensive to use batch
GD, or the dataset doesn’t fit in memory).

33
Multivariate Linear Regression

(with multiple features)

34
Multivariate linear regression
Size (𝒙𝒙𝟏𝟏 ) Nb rooms (𝒙𝒙𝟐𝟐 ) Location (𝒙𝒙𝟑𝟑 ) Nb floors (𝒙𝒙𝟒𝟒 ) Age (𝒙𝒙𝟓𝟓 ) … Price (𝒚𝒚)
2104 6 2 2 45 … 460
1416 5 10 1 40 … 230
1534 5 3 2 30 … 315
852 4 2 1 35 … 178
… … … … … … …

• For convenience of notation, define 𝑥𝑥0 = 1


(𝑖𝑖)
 Think of it as an additional feature which equals 1 for all data-points: 𝑥𝑥0 = 0, ∀𝑖𝑖 =
1 … 𝑛𝑛

𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥

35
Multivariate linear regression
Gradient descent

Simultaneously
update all 𝜃𝜃𝑗𝑗 for
𝑗𝑗 = 0, … , 𝑑𝑑

=1

36
More about gradient descent
Convergence & Selecting 𝛼𝛼
min 𝐸𝐸 𝜃𝜃
𝜃𝜃 • For a sufficiently small 𝛼𝛼, the 𝐸𝐸(𝜃𝜃) (on the
training set) should decrease at every
iteration.

• One can consider convergence (thus stop)


if 𝐸𝐸(𝜃𝜃) decreases by less than 𝜖𝜖 (small
number e.g. 0.0001) in one iteration.
Nbr. of iterations of GD

𝐸𝐸 𝜃𝜃1
min 𝐸𝐸 𝜃𝜃 min 𝐸𝐸 𝜃𝜃
𝜃𝜃 𝜃𝜃 In these cases, you
should use a
smaller 𝛼𝛼

Note: if 𝛼𝛼 is too 𝜃𝜃1


small, GD can be
Nbr. of iterations Nbr. of iterations slow to converge.
37
Linear Regression with the
Normal Equation

(without using gradient descent)

38
Linear regression with the normal equation
• Method to solve for 𝜃𝜃 analytically.
• The derivative at the optimal point equals to 0. So, set the derivative to 0
𝜕𝜕
𝐸𝐸 𝜃𝜃 = … = 0, and solve for 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑑𝑑
𝜕𝜕𝜃𝜃𝑗𝑗

• The solution will be:

pseudo inverse
39
Linear regression with the normal equation

40
Gradient Descent vs. Normal Equation
(for linear regression)

Gradient Descent Normal Equation

• Need to choose 𝛼𝛼. • No need to choose 𝛼𝛼.


• Needs several iterations. • Don’t needs to iterate.

• Works well even when 𝑑𝑑 is • Need to compute 𝑋𝑋 𝑇𝑇 𝑋𝑋 −1 ,


large. which is slow when 𝑑𝑑 is very
large.

• Does not apply to other


optimization problems.

41
(2) Non-linear Regression
Non-linear regression
• The output ℎ𝜃𝜃 (𝑥𝑥) is a non-linear function
• e.g. polynomial function of degree > 1

• Examples with one variable:


𝒉𝒉𝜽𝜽 𝒙𝒙 = 𝜽𝜽𝟎𝟎 + 𝜽𝜽𝟏𝟏 𝒙𝒙 + 𝜽𝜽𝟐𝟐 𝒙𝒙𝟐𝟐
𝒉𝒉𝜽𝜽 𝒙𝒙 = 𝜽𝜽𝟎𝟎 + 𝜽𝜽𝟏𝟏 𝒙𝒙 + 𝜽𝜽𝟐𝟐 𝒙𝒙𝟐𝟐 + 𝜽𝜽𝟑𝟑 𝒙𝒙𝟑𝟑

𝒚𝒚

𝒙𝒙 𝒙𝒙

43
Non-linear regression
• Examples of polynomial regression models with
two features 𝑥𝑥1 , 𝑥𝑥2
• A second order polynomial would be:
 ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥1 𝑥𝑥2 + 𝜃𝜃4 𝑥𝑥12 + 𝜃𝜃5 𝑥𝑥22

44
Generalized Linear Model
for Non-linear Regression

45
Generalized Linear Model
• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥1 𝑥𝑥2 + 𝜃𝜃4 𝑥𝑥12 + 𝜃𝜃5 𝑥𝑥22

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3 𝑧𝑧4 𝑧𝑧5

• ℎ𝜃𝜃 𝑥𝑥 = 𝜽𝜽𝑻𝑻 𝒛𝒛 = 𝜃𝜃0 + 𝜃𝜃1 𝒙𝒙𝟏𝟏 + 𝜃𝜃2 𝒙𝒙𝟐𝟐 + 𝜃𝜃3 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 + 𝜃𝜃4 𝒙𝒙𝟐𝟐𝟏𝟏 + 𝜃𝜃5 𝒙𝒙𝟐𝟐𝟐𝟐
 This can be seen as just creating and adding new features based on the two
original features 𝑥𝑥1 , 𝑥𝑥2

 We can still find the parameters 𝜃𝜃 of the non-linear model (in 𝑥𝑥) using a
linear model based on 𝑧𝑧.
 So, we can use the methods we studied previously in the linear regression lecture.
46
Generalized Linear Model
• It can be any kind of new features (also called: basis functions)
– e.g. log 𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥1 𝑥𝑥2 , 𝑥𝑥12 , 𝑥𝑥23 , …
• Requires a good guess of relevant features for your problem …

price

• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2


• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 𝑥𝑥2
• New feature:
 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙_𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 × 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 𝑥𝑥1 𝑥𝑥2 • ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 𝑥𝑥2

47
K-Nearest-Neighbors (KNN)
for Non-linear Regression

48
K-Nearest-Neighbors Regression
• KNN is a non-parametric model
– This does not mean parameter-free (e.g. K is a hyper-parameters).
• Parametric: we select a hypothesis space and adjust a fixed set of parameter
using the training data.
• Non-parametric: the model is not characterized by a fixed set of parameters.

• In KNN, we just save/memorize the training dataset (there is no


training as such).

• To make a prediction on a new data-point 𝑥𝑥, we look at the 𝑘𝑘 most


similar (i.e. closest/nearest) data-points from the training dataset. We
can take:
– the average output from these 𝑘𝑘 examples,
– or a weighted average output from these 𝑘𝑘 examples.

• Need a distance measure (inverse of: similarity measure).

49
K-Nearest-Neighbors Regression

Predicting using KNN Predicting using Weighted KNN

Same as the weighted KNN


with 𝑤𝑤𝑙𝑙𝑖𝑖 = 1
50
K-Nearest-Neighbors Regression
Algorithm
• Given:
– Training dataset { 𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2 , 𝑦𝑦 2 , … , 𝑥𝑥 𝑛𝑛
, 𝑦𝑦 𝑛𝑛
}
– Testing data-point 𝑥𝑥 to predict its output.
• Algorithm:
– Compute ||𝑥𝑥 − 𝑥𝑥 (𝑖𝑖) || for 𝑖𝑖 = 1, … , 𝑛𝑛
– Select the 𝑘𝑘 training examples closest to 𝑥𝑥 (with the smallest distance to 𝑥𝑥)
𝑙𝑙1 𝑙𝑙1 𝑙𝑙2 𝑙𝑙2 𝑙𝑙𝑘𝑘 𝑙𝑙𝑘𝑘
• { 𝑥𝑥 , 𝑦𝑦 , 𝑥𝑥 , 𝑦𝑦 , … , 𝑥𝑥 , 𝑦𝑦 }
– Output the mean (or weighted mean) of 𝑦𝑦 𝑙𝑙1 , 𝑦𝑦 𝑙𝑙2 , … , 𝑦𝑦 𝑙𝑙𝑘𝑘

KNN with uniform weights (k=5) Weighted KNN (k=5)

𝑦𝑦 𝑦𝑦

𝑥𝑥1 𝑥𝑥1
51
Kernel Regression
(Non-linear Regression)

52
Kernel Regression
• Very similar to the weighted KNN method.
• A common kernel regression model is the Nadaraya-Watson estimator, with a
Gaussian kernel function.

width of the
Gaussian

• Data-points closer to 𝑥𝑥 have a larger weight, i.e.


influences the prediction more.
• Larger values of 𝜎𝜎 implies that more data-points
will influence the prediction.
• Too small 𝜎𝜎 may lead to overfitting. Too large 𝜎𝜎
may lead to underfitting.
53
Kernel Regression
• Example with KNN Regression and Kernel Regression

Simple KNN Kernel Regression

𝑦𝑦 𝑦𝑦

𝑥𝑥1 𝑥𝑥1
54
Features Normalization /
Scaling
Features Normalization
• Feature normalization is a preprocessing step used to normalize the range
of the features.

• It is important when the features have very different scales.


– For example, if the values of feature 𝟏𝟏 are ∈ [0, 1] but the values of feature 𝟐𝟐
are ∈ [120, 190], then normalizing the features is important.

• Motivation:
– Suppose that some ML algorithm computes the Euclidean distance between two
points. If one of the features has a broad range of values, the distance will be
governed by this particular feature. Therefore, the range of all features should be
normalized so that each feature contributes approximately proportionately to
the final distance.

0.11 0.52
− = 3.027
182 179

56
min-max Features scaling

𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
max 𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )

• 𝑣𝑣𝑗𝑗 is a column (corresponding to feature 𝑗𝑗) from the data matrix 𝑋𝑋.
• 𝑣𝑣′𝑗𝑗 are the normalized values of feature 𝑗𝑗. These values will be ∈ [0, 1]

57
min-max Features scaling
• Before Features Scaling • After Features Scaling

58
Features Standardization

𝑣𝑣𝑗𝑗 − mean(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
stdev 𝑣𝑣𝑗𝑗

• 𝑣𝑣𝑗𝑗 is a column (corresponding to feature 𝑗𝑗) from the data matrix 𝑋𝑋.
• 𝑣𝑣′𝑗𝑗 are the normalized values of feature 𝑗𝑗. These values will be ∈ [0, 1].
• To normalize, we just subtract the mean and divide by the standard deviation.

59
Features Standardization
• Before • After

60
• NOTE: do not rescale or standardize the output (target variable).

𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
max 𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑗𝑗 − mean(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
stdev 𝑣𝑣𝑗𝑗

61

You might also like