Lecture 3 - Linear Regression
Lecture 3 - Linear Regression
Supervised Learning:
Linear regression
Dr. Rajdip Nayek
Block V, Room 418D
Department of Applied Mechanics
Indian Institute of Technology Delhi
E-mail: [email protected]
Learning goals
• Think about the data points and the model parameters as vectors
• Solve the optimization problem using two different strategies: deriving a closed-form
solution, and applying gradient descent
• Write the algorithm in terms of linear algebra, so that we can think about it more
easily
• Making a linear algorithm more powerful using nonlinear basis functions, or features
2
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output
3
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output
4
Supervised learning
• Given a set of examples in the form (𝒙, 𝑡)
• Ex. 𝒙 is 𝐾-dimensional input, 𝑡 is scalar output
5
Supervised learning
• Two types of supervised learning
Domain Range
(input) (output)
𝑓(𝒙)
𝒙 𝑡
6
An example of classification
• Problem: Will you enjoy an outdoor sport based on the weather?
8
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression
9
More examples of classification/regression
Classification/
Problem Input domain Output range
Regression
10
Key components of any ML algorithm
3. An loss function that quantifies how well (or badly) the model is
doing
11
Linear regression
(𝑖) (𝑖) 𝑁
• Given several pairs 𝒙 , 𝑡 𝑖=1
12
Linear regression
𝑁
• Given several pairs 𝒙(𝑖) , 𝑡 (𝑖) 𝑖=1
(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏
14
Linear regression: Problem setup
(𝑖) (𝑖) 𝑁
• How good is the model fit with respect to the data pairs 𝒙 , 𝑡 𝑖=1 ?
• To assess model fit, we define errors and accumulate them using a loss function
2
• Squared error for the 𝑖 th example: 𝑡 (𝑖) − 𝑦 (𝑖)
𝑁 𝑁
1 (𝑖) (𝑖) 2 1 (𝑖) 𝑇 𝑖 2
ℒ 𝒘, 𝑏 = 2𝑁
𝑡 −𝑦 = 2𝑁
𝑡 −𝒘 𝒙 −𝑏
𝑖=1 𝑖=1
1 1
• The 2 factor is for convenience and 𝑁 factor is for taking average over the full dataset
(𝑖) (𝑖) 𝑁
• Find a relation between vector 𝒙 and 𝑡, given data 𝒙 , 𝑡 𝑖=1
• A linear model with 𝐾 input features
𝑦 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝐾 𝑥𝐾 + 𝑏 = 𝒘𝑇 𝒙 + 𝑏
16
Vectorization
• We can organize all the data into a matrix 𝐗 with one row per example, and all the
output predictions into a vector 𝒚
One feature across
all examples
1 𝑻
𝒙 1 3 −2 0.1 One example
𝐗= ⋮ = ⋮ ⋮ ⋮ ⋮
𝑻
𝒙 𝑁 9 2 0 −0.2
𝑁×𝐾
1 1
ℒ 𝒘, 𝑏 = 2𝑁 𝒕 −𝒚 22 = 2𝑁 𝒕 −𝒚 𝑇 𝒕 −𝒚
17
18
Solving the optimization problem
• We would like to minimize the loss function!
• Recall from calculus class: the minimum of a smooth function occurs at a critical
point, i.e. point where the gradients are all 0.
ℒ(𝑤)
• Direct solution: derive a formula that sets the gradient to 0. This works only in
a handful of cases (e.g. linear regression).
ഥ𝒘
𝒚 = 𝐗𝒘 + 𝟏𝑏 = 𝐗 ഥ
• Set gradient of the loss function with respect to weights and bias to zero
ഥ𝒘
𝒚 =𝐗 ഥ
Denominator
ഥ𝒘 layout
𝑑𝒚 𝑑 𝐗 ഥ
= ഥ𝑇
=𝐗 link
ഥ
𝑑𝒘 𝑑𝒘ഥ
1 𝑇
ℒ= 𝒕 −𝒚 𝒕 −𝒚
2𝑁
𝑑ℒ 1
=− 𝒕 −𝒚
𝑑𝒚 𝑁
20
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁
1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁
21
Direct solution
𝑑ℒ 1
• Use chain rule for derivatives =− 𝒕 −𝒚
𝑑𝒚 𝑁
1 𝑇
⇒ ഥ 𝒕 −𝒚 =0
− 𝐗
𝑁
ഥ𝑇 𝒕 − 𝒚 = 𝟎
𝐗
ഥ𝑇 𝒕 − 𝐗
𝐗 ഥ𝑇 𝐗
ഥ𝒘ഥ =𝟎
ഥ𝑇 𝐗
𝐗 ഥ𝒘 ഥ𝑇 𝒕
ഥ =𝐗
ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ
𝐗𝑇 𝒕
22
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)
23
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)
Gradient
direction →
Direction of
steepest
ascent → the
direction in
which the
function
increases
24
25
Iterative solution using gradient descent
• Gradient descent is an iterative algorithm, which means we apply an update
repeatedly until some convergence criterion is met (e.g. loss function stops changing
much)
ഥ ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒
𝒘
ഥ
𝑑ℒ(𝒘)
ഥ ←𝒘
𝒘 ഥ −𝛼
𝑑𝒘ഥ
𝛼 𝑇
𝒘ഥ ←𝒘 ഥ+ 𝐗 ഥ 𝒕 −𝒚
𝑁
𝛼 𝑇
𝒘ഥ ←𝒘ഥ+ 𝐗 ഥ 𝒕−𝐗ഥ𝒘
ഥ
𝑁
• 𝛼 is the learning rate. Larger values of 𝛼 implies greater change in 𝒘
• Typical values 0.01 or 0.001
• Gradient descent can get stuck in local optima; we will see more on it later 26
Iterative vs direct solution
• By setting the gradients to zero, we compute the direct (or exact) solution. With
gradient descent, we approach it gradually
• For regression, the direct solution requires a matrix inverse which can be
computationally very costly for large number of features
ഥ𝑇 𝐗
ഥ = 𝐗
𝒘 ഥ −𝟏 ഥ 𝑇
𝐗 𝒕
27
Nonlinear feature maps
• We can convert linear models into nonlinear models using nonlinear feature maps
𝑦 = 𝒘𝑇 𝜙 𝒙 + 𝑏
28
Generalization
• So we divide the total set of examples into training set and test set
Total 𝑵 examples
Training set (e.g. 70%) Test set (e.g.30%)
𝑥 𝑥
• The test set is used at the very end, to estimate the generalization error of the final
model, once all hyperparameters have been chosen