Linear Regression
Fundamentals of Data Science
29 October, 3, 5, 10 November 2020
Prof. Fabio Galasso
ALVINN’97
2 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
3 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
4 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Linear Regression
500
Housing Prices
400
(Portland, OR)
300
Price 200
(in 1000s 100
of dollars) 0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-‐valued output
each example in the data.
Training set of Size in feet2 (x) Price ($) in 1000's (y)
housing prices 2104 460
1416 232
(Portland, OR)
1534 315
852 178
… …
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Training Set How do we represent h ?
Learning Algorithm
Size of h Estimated
house price
Linear regression with one variable.
Univariate linear regression.
h maps from x’s to y’s
Multiple features (variables)
Size (feet2) Price ($1000)
2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables)
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
Hypothesis:
With one variable:
For convenience of notation, define .
Multivariate linear regression
Linear Regression:
Cost Function
Size in feet2 (x) Price ($) in 1000's (y)
Training Set
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y
Idea: Choose so that
is close to for our
training examples
Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameters )
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
28 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Gradient Descent
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J()
J()
Gradient descent algorithm
Correct: Simultaneous update Incorrect:
Gradient descent algorithm
If α is too small, gradient descent
can be slow.
If α is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
at local optima
Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient Descent:
for Linear Regression
Gradient descent algorithm Linear Regression Model
Gradient descent algorithm
update
and
simultaneously
J()
J()
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
“Batch”: Each step of gradient descent
uses all the training examples.
Gradient Descent
• The gradient is the result of considering all dataset samples
• This may be very slow
• Can we do something more efficient?
56 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Stochastic Gradient Descent (SGD)
• For large training sets, the evaluation the gradient over all samples may be expensive
• Stochastic or “online” gradient descent approximates the true gradient by a gradient at a
single example
Pseudocode:
- Choose an initial vector of parameters 𝜽𝟎 and learning rate 𝛼
- Repeat until convergence:
• Randomly shuffle examples in the training set
• For 𝑘 = 1,2, … 𝑚 , do:
𝜽(𝑖+1) = 𝜽(𝑖) − 𝛼 ∇𝐽(𝑘) (𝜽(𝑖) )
57 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Stochastic Gradient Descent (SGD)
• For large training sets, the evaluation the gradient over all samples may be expensive
• Stochastic or “online” gradient descent approximates the true gradient by a gradient at a
single example
Pseudocode:
- Choose an initial vector of parameters 𝜽𝟎 and learning rate 𝛼
- Repeat until convergence:
• Randomly shuffle examples in the training set
• For 𝑘 = 1,2, … 𝑚 , do:
𝜽(𝑖+1) = 𝜽(𝑖) − 𝛼 ∇𝐽(𝑘) (𝜽(𝑖) )
• Normally preferable: mini-batch gradient descent
Consider a mini-batch at each step
This normally results in smoother convergence
This normally is faster, thanks to vectorization libraries
58 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Gradient Descent:
for Multiple Variables
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update for every )
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)
(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-‐2000 feet2) size (feet2)
= number of bedrooms (1-‐5)
number of bedrooms
Feature Scaling
Get every feature into approximately a range.
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
Gradient Descent:
Learning Rate
Gradient descent
-‐ “Debugging”: How to make sure gradient
descent is working correctly.
-‐ How to choose learning rate .
Making sure gradient descent is working correctly.
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations
-‐ For sufficiently small , should decrease on every iteration.
-‐ But if is too small, gradient descent can be slow to converge.
Summary:
-‐ If is too small: slow convergence.
-‐ If is too large: may not decrease on
every iteration; may not converge.
To choose , try
Linear Regression:
Features and Polynomial regression
Housing prices prediction
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
Dangers of (Polynomial) Regression
Overfitting and Underfitting
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
75 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Normal Equation
Gradient Descent
𝛿𝐽
𝛿Θ0
Θ ≔ Θ − 𝛼 ∇Θ 𝐽 ∇Θ 𝐽 = ⋮ ∈ ℝ𝑛+1
𝛿𝐽
𝛿Θ𝑛
Gradient Descent
𝛿𝐽
𝛿Θ0
Θ ≔ Θ − 𝛼 ∇Θ 𝐽 ∇Θ 𝐽 = ⋮ ∈ ℝ𝑛+1
𝛿𝐽
𝛿Θ𝑛
Normal equation: what about solving for Θ analytically?
Useful notation
• For 𝑓: ℝ𝑚𝑥𝑛 ↦ ℝ, define:
𝑓 𝐴 ∈ ℝ , 𝐴 ∈ ℝ𝑚𝑥𝑛
• Trace:
n
If 𝐴 ∈ ℝ𝑛𝑥𝑛 tr A = Aii
i=1
80 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Facts
• Some facts of matrix derivatives (without proof)
tr 𝐴𝐵 = tr 𝐵𝐴
tr 𝐴𝐵𝐶 = tr 𝐶𝐴𝐵 = tr 𝐵𝐶𝐴
T
f(𝐴) = tr 𝐵𝐴 A tr AB = B
T
tr A = tr A
If 𝑎 ∈ ℝ tr a = a
T T T
A tr ABA C = CAB + C AB
81 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
examples ; features.
Cost function
1
ℎ 𝑥 − 𝑦 (1) 𝑛
(𝑖)
𝑋Θ − 𝑦 = ⋮ where Θ𝑇 𝑥 (𝑖) = Θ𝑗 𝑥𝑗
ℎ 𝑥 𝑚 − 𝑦 (𝑚) 𝑗=0
Recall z 𝑇 𝑧 = 𝑧𝑖2
𝑖
1 𝑇
1
𝑋Θ − 𝑦 𝑋Θ − 𝑦 = = 𝐽(Θ)
2 2
Intuition: If 1D
Solve for 𝚯 analytically
attained when
Expanding J(Θ):
Solve for 𝚯 analytically
T T T T
Recall A tr ABA C = CAB + C AB A tr AB = B
Solve for 𝚯 analytically
Normal equation
Examples:
(feet22)
Size (feet Number
Numberof
of Number
Numberofof Age
Ageofofhome
home Price ($1000)
bedrooms floors (years)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416
1416 33 22 40
40 232
232
1 1534
1534 33 22 30
30 315
315
1 852
852 22 11 36
36 178
178
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation
-‐ What if is non-‐invertible? (singular/
degenerate)
What if is non-‐invertible?
• Redundant features (linearly dependent).
E.g. size in feet2
size in m2
• Too many features (e.g. ).
-‐ Delete some features, or use regularization.
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
94 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Linear Regression and Correlation
Summary So Far
▪ Given: Set of known (x,y) points
▪ Find: function f(x)=ax+b that “best fits” the
known points, i.e., f(x) is close to y
▪ Use function to predict y values for new x’s
➢ Also can be used to test correlation
96 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation and Causation
Correlation – Values track each other
• Height and Shoe Size
• Grades and Entrance Exam Scores
Causation – One value directly influences another
• Education Level → Starting Salary
• Temperature → Cold Drink Sales
97 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation and Causation (from Overview)
Correlation – Values track each other
• Height and Shoe Size
• Grades and Entrance Exam Scores
Find: function f(x)=ax+b that “best fits” the
known points, i.e., f(x) is close to y
The better the function
fits the points, the more
correlated x and y are
98 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Regression and Correlation
The better the function fits the points,
the more correlated x and y are
▪ Linear functions only
▪ Correlation – Values track each other
Positively – when one goes up the other goes up
▪ Also negative correlation
When one goes up the other goes down
• Latitude versus temperature
• Car weight versus gas mileage
• Class absences versus final grade
99 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares
▪ Given a point and a line, the error for the point
is its vertical distance d from the line, and the
squared error is d 2
▪ Given a set of points and a line, the sum of
squared error (SSE) is the sum of the squared
errors for all the points
▪ Goal: Given a set of points, find the line that
minimizes the SSE
100 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares
d4 d5
d2
d3
d1
SSE = d12 + d22 + d32 + d42 + d52
101 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares
Goal: Find the line that
minimizes the SSE d4 d5
d2
d3
d1
- Gradient Descent
- Normal equation
- software packages,
e.g. Numpy polyfit
SSE = d12 + d22 + d32 + d42 + d52
102 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Measuring Correlation
More help from software packages…
Pearson’s Product Moment Correlation (PPMC)
• “Pearson coefficient”, “correlation coefficient”
• Value r between 1 and -1
1 maximum positive correlation
0 no correlation
-1 maximum negative correlation Swapping x and y axes
yields same values
103 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Measuring Correlation
More help from software packages…
Pearson’s Product Moment Correlation (PPMC)
• “Pearson coefficient”, “correlation coefficient”
• Value r between 1 and -1
1 maximum positive correlation
0 no correlation
-1 maximum negative correlation
“The better the function
Coefficient of determination fits the points, the more
• r2, R2, “R squared” correlated x and y are”
• Measures fit of any line/curve to set of points
• Usually between 0 and 1
• For simple linear regression R2 = Pearson2
104 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation Game
http://aionet.eu/corguess (*)
Try to get:
Right answers ≥ 10, Guesses ≤ Right answers × 2
Anti-cheating: Pictures = Right answers + 1
(*) Improved version of “Wilderdom correlation guessing game” thanks to
Poland participant Marcin Piotrowski
Other correlation games:
http://guessthecorrelation.com/
http://www.rossmanchance.com/applets/GuessCorrelation.html
http://www.istics.net/Correlations/
105 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
106 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Locally-Weighted Regression
Recap
examples ; features.
𝑦 (𝑖) ∈ ℝ x0 = 1
𝑛 𝑚
1 2
ℎΘ 𝑥 = Θ𝑗 𝑥𝑗 = Θ𝑇 𝑥 𝐽 Θ = ℎΘ 𝑥 (𝑖) − 𝑦 (𝑖)
2
𝑗=0 𝑖=1
Recap: choice of features
Θ0 + Θ1 𝑥1
Θ0 + Θ1 𝑥 + Θ2 𝑥 2
Price
(y) Θ0 + Θ1 𝑥 + Θ2 𝑥 + Θ3 log(𝑥)
Size (x)
Recap: dangers of polynomial regression
Overfitting and Underfitting
Locally-weighted regression
• “Parametric learning algorithm”:
fixed set of parameters
• “Non-parametric learning algorithm”:
parameters grow with the data
(more data to keep in memory)
• Locally-weighted regression y
also named Loess or Lowess
111 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Locally-weighted regression
112 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline
• Linear regression
One or multiple variables
Cost function
• Gradient descent, incl. batch/stochastic GD
• Normal equation
• MSE and Correlation
• Locally-Weighted Regression
• Probabilistic Interpretation of Least Squares
113 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Probabilistic Interpretation of Least Squares
Why Least Squares
• Assume 𝑦 (𝑖) = Θ𝑇 𝑥 (𝑖) + 𝜖 (𝑖)
115 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Why Least Squares
• Likelihood 𝐿(Θ) = 𝑃(𝑦|𝑋;
Ԧ Θ)
116 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
References
• Chapter 3.1 in [Bishop, 2006. Pattern Recognition and Machine Learning]
117 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Thank you
Acknowledges: slides and material from Andrew Ng, Eric Xing, Matthew R. Gormley, Jessica Wu