Lecture 2. Regression
Lecture 2. Regression
Regression
Halmstad University
(1) Linear Regression
Example of Linear Regression
with One Feature
3
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
Price (1000$)
Size (feet²) 4
1250
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price
is linear.
Price (1000$)
Size (feet²) 5
1250
Linear Regression with one feature
• Suppose that we are given the following dataset:
House size in feet² (𝒙𝒙) House price in 1000$ (𝒚𝒚) Question:
𝑥𝑥 (1) = 2104 𝑦𝑦 (1) = 460 Given a new house with
𝑥𝑥 (2) = 1416 𝑦𝑦 (2) = 230 a size of 1250 feet², how
𝑥𝑥 (3) = 1534 𝑦𝑦 (3) = 315 do we predict its price ?
𝑥𝑥 (4) = 852 𝑦𝑦 (4) = 178
… …
1. Assume that the relation
between size and price
is linear.
Price (1000$)
?
𝒂𝒂 • How do we find the best
parameters 𝜃𝜃0 and 𝜃𝜃1
𝜽𝜽𝟎𝟎 𝒃𝒃 (i.e. the best fitting line)
𝒂𝒂
𝜽𝜽𝟏𝟏 = Size (feet²) 6
𝒃𝒃
Linear Regression with one feature
• Choose 𝜃𝜃0 , 𝜃𝜃1 so that ℎ𝜃𝜃 (𝑥𝑥 (𝑖𝑖) ) is close to 𝑦𝑦 (𝑖𝑖) for our training examples
𝑥𝑥 𝑖𝑖 , 𝑦𝑦 𝑖𝑖 , ∀𝑖𝑖 = 0 … 𝑛𝑛
• We want to find the parameters 𝜃𝜃 = (𝜃𝜃0 , 𝜃𝜃1 ) that minimizes the error
function 𝐸𝐸(𝜃𝜃).
Price (1000$)
Size (feet²) 7
Linear Regression with one feature
• To simply the explanation, let’s first assume that 𝜃𝜃0 = 0, so
our model ℎ𝜃𝜃 is on the form: ℎ𝜃𝜃 (𝑥𝑥) = 𝜃𝜃1 𝑥𝑥
minim𝑖𝑖𝑖𝑖𝑖𝑖 E(𝜃𝜃1 )
𝜃𝜃1
Size (feet²)
8
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function
For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1
minimum
error
For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1
minimum
error
For a fixed 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameter 𝜃𝜃1
minimum
error
For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Price
𝒉𝒉𝜽𝜽 𝒙𝒙
𝑬𝑬 𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏
𝒙𝒙
𝜽𝜽𝟏𝟏 𝜽𝜽𝟎𝟎
Size (feet²) 12
Linear Regression with one feature
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥
Hypothesis function (model) Error (cost) function
For fixed 𝜃𝜃0 , 𝜃𝜃1 , this is a function of the input 𝑥𝑥 Function of the parameters 𝜃𝜃0 , 𝜃𝜃1
Contour plot of 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 )
Price
𝒉𝒉𝜽𝜽 𝒙𝒙
𝜽𝜽𝟏𝟏
𝒙𝒙
Size (feet²) 𝜽𝜽𝟎𝟎 13
Linear Regression with one feature
• To find the parameters that minimize the error function, we can use an optimization algorithm called:
Gradient Descent.
• Gradient Descent is a general optimization algorithm, which is not only specific for this function error
function.
• We will see later that for this specific error function, we can solve the minimization problem without
the need to use the gradient descent algorithm.
Optimization problem:
Error/cost function:
Hypothesis function:
14
Gradient Descent
Algorithm for optimization
15
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we hopefully end up
at a minimum.
– At each update, how do we decide if we should increase or decrease
each of the parameters?
Note:
For the purpose of
explanation, the non-
convex error
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) function presented in
this example is not
the mean squared
error (MSE). The
MSE function is
convex.
𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
16
Gradient Descent – Basic idea
1. Start with some values for the parameters 𝜃𝜃0 , 𝜃𝜃1
2. Keep updating 𝜃𝜃0 , 𝜃𝜃1 to reduce 𝐸𝐸(𝜃𝜃0 , 𝜃𝜃1 ) until we hopefully end up
at a minimum.
– At each update, how do we decide if we should increase or decrease
each of the parameters? next slide
Depending on
initial parameter
values, we might
end-up at a
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 ) different (local)
minimum.
local minimum
𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
17
Gradient Descent – Basic idea
Example with a convex error function (MSE).
Only one (global) minimum.
𝑬𝑬(𝜽𝜽𝟎𝟎 , 𝜽𝜽𝟏𝟏 )
MSE
𝜽𝜽𝟏𝟏
𝜽𝜽𝟎𝟎
18
Gradient Descent – Algorithm
Learning Derivative of 𝐸𝐸
rate 𝛼𝛼 > 0 with respect to 𝜃𝜃𝑗𝑗
• At each iteration, we need to update 𝜃𝜃0 and 𝜃𝜃1 simultaneously (at the same time)
19
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
𝐸𝐸 𝜃𝜃1
𝜃𝜃1
20
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
𝜃𝜃1
21
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
𝜃𝜃1
22
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
23
Gradient Descent – Algorithm
Assume for now that we have only one parameter 𝜃𝜃1
≥0
𝜃𝜃1 (positive slope)
≤ 0
𝜃𝜃1 (negative slope)
𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1
𝜃𝜃1 𝜃𝜃1
Notice that as we approach a local minimum, If 𝛼𝛼 is too large, it may fail to
gradient descent will automatically take smaller converge, or may even diverge.
steps (why?). So, no need to decrease 𝛼𝛼 over time.
26
Gradient Descent – Algorithm
Reasonably small value of 𝛼𝛼 Very large value of 𝛼𝛼
𝐸𝐸 𝜃𝜃1 𝐸𝐸 𝜃𝜃1
𝜃𝜃1 𝜃𝜃1
Notice that as we approach a local minimum, If 𝛼𝛼 is too large, it may fail to
gradient descent will automatically take smaller converge, or may even diverge.
steps (why?). So, no need to decrease 𝛼𝛼 over time.
27
Gradient Descent – Local minimum
Function (𝜃𝜃1 )
𝜃𝜃1
Current value of 𝜃𝜃1
29
Gradient descent for linear regression
30
Gradient descent for linear regression
31
Gradient descent for linear regression
• Pick some initial values for 𝜃𝜃0 and 𝜃𝜃1
• .
33
Multivariate Linear Regression
34
Multivariate linear regression
Size (𝒙𝒙𝟏𝟏 ) Nb rooms (𝒙𝒙𝟐𝟐 ) Location (𝒙𝒙𝟑𝟑 ) Nb floors (𝒙𝒙𝟒𝟒 ) Age (𝒙𝒙𝟓𝟓 ) … Price (𝒚𝒚)
2104 6 2 2 45 … 460
1416 5 10 1 40 … 230
1534 5 3 2 30 … 315
852 4 2 1 35 … 178
… … … … … … …
𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑥𝑥
35
Multivariate linear regression
Gradient descent
Simultaneously
update all 𝜃𝜃𝑗𝑗 for
𝑗𝑗 = 0, … , 𝑑𝑑
=1
36
More about gradient descent
Convergence & Selecting 𝛼𝛼
min 𝐸𝐸 𝜃𝜃
𝜃𝜃 • For a sufficiently small 𝛼𝛼, the 𝐸𝐸(𝜃𝜃) (on the
training set) should decrease at every
iteration.
𝐸𝐸 𝜃𝜃1
min 𝐸𝐸 𝜃𝜃 min 𝐸𝐸 𝜃𝜃
𝜃𝜃 𝜃𝜃 In these cases, you
should use a
smaller 𝛼𝛼
38
Linear regression with the normal equation
• Method to solve for 𝜃𝜃 analytically.
• The derivative at the optimal point equals to 0. So, set the derivative to 0
𝜕𝜕
𝐸𝐸 𝜃𝜃 = … = 0, and solve for 𝜃𝜃0 , 𝜃𝜃1 , … , 𝜃𝜃𝑑𝑑
𝜕𝜕𝜃𝜃𝑗𝑗
pseudo inverse
39
Linear regression with the normal equation
40
Gradient Descent vs. Normal Equation
(for linear regression)
41
(2) Non-linear Regression
Non-linear regression
• The output ℎ𝜃𝜃 (𝑥𝑥) is a non-linear function
• e.g. polynomial function of degree > 1
𝒚𝒚
𝒙𝒙 𝒙𝒙
43
Non-linear regression
• Examples of polynomial regression models with
two features 𝑥𝑥1 , 𝑥𝑥2
• A second order polynomial would be:
ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥1 𝑥𝑥2 + 𝜃𝜃4 𝑥𝑥12 + 𝜃𝜃5 𝑥𝑥22
44
Generalized Linear Model
for Non-linear Regression
45
Generalized Linear Model
• ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥1 𝑥𝑥2 + 𝜃𝜃4 𝑥𝑥12 + 𝜃𝜃5 𝑥𝑥22
• ℎ𝜃𝜃 𝑥𝑥 = 𝜽𝜽𝑻𝑻 𝒛𝒛 = 𝜃𝜃0 + 𝜃𝜃1 𝒙𝒙𝟏𝟏 + 𝜃𝜃2 𝒙𝒙𝟐𝟐 + 𝜃𝜃3 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 + 𝜃𝜃4 𝒙𝒙𝟐𝟐𝟏𝟏 + 𝜃𝜃5 𝒙𝒙𝟐𝟐𝟐𝟐
This can be seen as just creating and adding new features based on the two
original features 𝑥𝑥1 , 𝑥𝑥2
We can still find the parameters 𝜃𝜃 of the non-linear model (in 𝑥𝑥) using a
linear model based on 𝑧𝑧.
So, we can use the methods we studied previously in the linear regression lecture.
46
Generalized Linear Model
• It can be any kind of new features (also called: basis functions)
– e.g. log 𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥1 𝑥𝑥2 , 𝑥𝑥12 , 𝑥𝑥23 , …
• Requires a good guess of relevant features for your problem …
price
47
K-Nearest-Neighbors (KNN)
for Non-linear Regression
48
K-Nearest-Neighbors Regression
• KNN is a non-parametric model
– This does not mean parameter-free (e.g. K is a hyper-parameters).
• Parametric: we select a hypothesis space and adjust a fixed set of parameter
using the training data.
• Non-parametric: the model is not characterized by a fixed set of parameters.
49
K-Nearest-Neighbors Regression
𝑦𝑦 𝑦𝑦
𝑥𝑥1 𝑥𝑥1
51
Kernel Regression
(Non-linear Regression)
52
Kernel Regression
• Very similar to the weighted KNN method.
• A common kernel regression model is the Nadaraya-Watson estimator, with a
Gaussian kernel function.
width of the
Gaussian
𝑦𝑦 𝑦𝑦
𝑥𝑥1 𝑥𝑥1
54
Features Normalization /
Scaling
Features Normalization
• Feature normalization is a preprocessing step used to normalize the range
of the features.
• Motivation:
– Suppose that some ML algorithm computes the Euclidean distance between two
points. If one of the features has a broad range of values, the distance will be
governed by this particular feature. Therefore, the range of all features should be
normalized so that each feature contributes approximately proportionately to
the final distance.
0.11 0.52
− = 3.027
182 179
56
min-max Features scaling
𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
max 𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
• 𝑣𝑣𝑗𝑗 is a column (corresponding to feature 𝑗𝑗) from the data matrix 𝑋𝑋.
• 𝑣𝑣′𝑗𝑗 are the normalized values of feature 𝑗𝑗. These values will be ∈ [0, 1]
57
min-max Features scaling
• Before Features Scaling • After Features Scaling
58
Features Standardization
𝑣𝑣𝑗𝑗 − mean(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
stdev 𝑣𝑣𝑗𝑗
• 𝑣𝑣𝑗𝑗 is a column (corresponding to feature 𝑗𝑗) from the data matrix 𝑋𝑋.
• 𝑣𝑣′𝑗𝑗 are the normalized values of feature 𝑗𝑗. These values will be ∈ [0, 1].
• To normalize, we just subtract the mean and divide by the standard deviation.
59
Features Standardization
• Before • After
60
• NOTE: do not rescale or standardize the output (target variable).
𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
max 𝑣𝑣𝑗𝑗 − min(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑗𝑗 − mean(𝑣𝑣𝑗𝑗 )
𝑣𝑣𝑣𝑗𝑗 =
stdev 𝑣𝑣𝑗𝑗
61