Linear Regression
Yijun Zhao
Northeastern University
Fall 2016
Yijun Zhao Linear Regression
Regression Examples
Any Attributes Continuous Value
x =⇒ y
{age, major , gender , race} ⇒GPA
{income, credit score, profession} ⇒ loan
{college, major , GPA} ⇒ future income
..
.
Yijun Zhao Linear Regression
Regression Examples
Data often has/can be converted into matrix form:
Age Gender Race Major GPA
20 0 A Art 3.85
22 0 C Engineer 3.90
25 1 A Engineer 3.50
24 0 AA Art 3.60
19 1 H Art 3.70
18 1 C Engineer 3.00
30 0 AA Engineer 3.80
25 0 C Engineer 3.95
28 1 A Art 4.00
26 0 C Engineer 3.20
Yijun Zhao Linear Regression
Formal Problem Setup
Given N observations
{(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}
a regression problem tries to uncover the function
yi = f (xi ) ∀i = 1, 2. . . . , n
such that for a new input value x∗ , we can
accurately predict the corresponding value
y∗ = f (x∗ ).
Yijun Zhao Linear Regression
Linear Regression
Assume the function f is a linear combination
of components in x
Formally, let x = (1, x1 , x2 , . . . , xd )T , we have
y = ω0 + ω1 x 1 + ω2 x 2 + · · · + ω d x d
= wT x
where w = (ω0 , ω1 , ω2 , . . . , ωd )T
w is the parameter to estimate !
Prediction:
y∗ = wT x∗
Yijun Zhao Linear Regression
Visual Illustration
Figure: 1D and 2D linear regression
Yijun Zhao Linear Regression
Error Measure
Mean Squared Error (MSE):
N
1X T
E (w) = (w xn − yn )2
N n=1
1
= k Xw − y k2
where N
— x1 T — y1
— x2 T — y2
X= y=
...
...
T yN
— xN —
Yijun Zhao Linear Regression
Minimizing Error Measure
1
E (w) = N k Xw − y k2
5E (w) = N2 XT(Xw − y) = 0
XTXw = XTy
w = X† y
where X† = (XTX)−1XT is the
’pseudo-inverse’ of X
Yijun Zhao Linear Regression
LR Algorithm Summary
Ordinary Least Squares (OLS) Algorithm
Construct matrix X and the vector y from
the dataset {(x1 , y1 ), x2 , y2 ), . . . , (xN , yN )}
(each x includes x0 = 1) as follows:
— xT 1 — y1
— xT — y2
X= 2 y =
... ...
— xT N — yN
Compute X† = (XT X)−1 XT
Return w = X† y
Yijun Zhao Linear Regression
Gradient Descent
Why?
Minimize our target function (E (w )) by
moving down in the steepest direction
Yijun Zhao Linear Regression
Gradient Descent
Gradient Descent Algorithm
Initialize the w(0) for time t = 0
for t = 0, 1, 2, . . . do
Compute the gradient gt = 5E (w(t))
Set the direction to move, vt = −gt
Update w(t + 1) = w(t) + ηvt
Iterate until it is time to stop
Return the final weights w
Yijun Zhao Linear Regression
Gradient Descent
How η affects the algorithm?
Use 0.1 (practical observation)
Use variable size: ηt = η k 5E k
Yijun Zhao Linear Regression
OLS or Gradient Descent?
Yijun Zhao Linear Regression
Computational Complexity
OLS Gradient Descent
OLS is expensive when D is large!
Yijun Zhao Linear Regression
Linear Regression
What is the Probabilistic Interpretation?
Yijun Zhao Linear Regression
Normal Distribution
Right Skewed Left Skewed Random
Normal Distribution
Yijun Zhao Linear Regression
Normal Distribution
mean = median = mode
symmetry about the center
1 2
x ∼ N(µ, σ 2 ) =⇒ f (x) = σ√12π e − 2σ2 (x−µ)
Yijun Zhao Linear Regression
Central Limit Theorem
All things bell shaped!
Random occurrences over a large population
tend to wash out the asymmetry and
uniformness of individual events. A more
’natural’ distribution ensues. The name for it is
the Normal distribution (the bell curve).
Formal definition: If (y1 , . . . , yn ) are i.i.d. and
0 < σy2 < ∞, then when n is large the
distribution of ȳ is well approximated by a
σ2
normal distribution N(µy , ny ).
Yijun Zhao Linear Regression
Central Limit Theorem
Example:
Yijun Zhao Linear Regression
LR: Probabilistic Interpretation
Yijun Zhao Linear Regression
LR: Probabilistic Interpretation
1 − 12 (wT xi −yi )2
prob(yi |xi ) = √
2πσ
e 2σ
Yijun Zhao Linear Regression
LR: Probabilistic Interpretation
Likelihood of the entire dataset:
− 12 (wT xi −yi )2
Y
L ∝ e 2σ
− 12 (wT xi −yi )2
P
2σ
=e i
P T
Maximize L ⇐⇒ Minimize (w xi − yi )2
i
Yijun Zhao Linear Regression
Non-linear Transformation
Linear is limited:
Linear models become powerful when we
consider non-linear feature transformations:
Xi = (1, xi , xi2 ) =⇒ yi = ω0 + ω1 xi + ω2 xi2
Yijun Zhao Linear Regression
Overfitting
Yijun Zhao Linear Regression
Overfitting
How do we know we overfitted?
Ein : Error from the training data
Eout : Error from the test data
Example:
Yijun Zhao Linear Regression
Overfitting
How to avoid overfitting?
Use more data
Evaluate on a parameter tuning set
Regularization
Yijun Zhao Linear Regression
Regularization
Attempts to impose ”Occam’s razor” principle
Add a penalty term for model complexity
Most commonly used :
L2 regularization (ridge regression) minimizes:
E (w) =k Xw − y k2 + λ k w k2
where λ ≥ 0 and k w k2 = wT w
L1 regularization (LASSO) minimizes:
E (w) =k Xw − y k2 + λ|w|1
D
P
where λ ≥ 0 and |w|1 = |ωi |
i=1
Yijun Zhao Linear Regression
Regularization
L2: closed form solution
w = (XT X + λI)−1 XT y
L1: No closed form solution. Use quadratic
programming:
minimize k Xw − y k2 s.t. k w k1 ≤ s
Yijun Zhao Linear Regression
L2 Regularization Example
Yijun Zhao Linear Regression
Model Selection
Which model?
A central problem in supervised learning
Simple model: ”underfit” the data
Constant function
Linear model applied to quadratic data
Complex model: ”overfit” the data
High degree polynomials
Model with hidden logics that fits the data to
completion
Yijun Zhao Linear Regression
Bias-Variance Trade-off
N
1
(wT xn − yn )2 let ŷ = wT xn
P
Consider E N
n=1
E (ŷ − yn )2 can be decomposed into (reading):
var {noise} + bias 2 + var {yi }
var {noise}: can’t be reduced
bias 2 + var {yi } is what counts for prediction
High bias 2 : model mismatch, often due to
”underfitting”
High var {yi }: training set and test set
mismatch, often due to ”overfitting”
Yijun Zhao Linear Regression
Bias-Variance Trade-off
Often: low bias ⇒ high variance
low variance ⇒ high bias
Trade-off:
Yijun Zhao Linear Regression
How to choose λ ?
But we still need to pick λ.
Use the test set data ? NO!
Set aside another evaluation set
Small evaluation set ⇒ inaccurate estimated error
Large evaluation set ⇒ small training set
CrossValidation
Yijun Zhao Linear Regression
Cross Validation (CV)
Divide data into K folds
Alternatively train on all except k th folds, and
test on k th fold
Yijun Zhao Linear Regression
Cross Validation (CV)
How to choose K?
Common choice of K = 5, 10, or N (LOOCV)
Measure on average performance
Cost of computation: K folds × choices of λ
Yijun Zhao Linear Regression
Learning Curve
A learning curve plots the performance of the
algorithm as a function of the size of training data
Yijun Zhao Linear Regression
Learning Curve
Yijun Zhao Linear Regression