0% found this document useful (0 votes)
23 views

Python Tutorial

Uploaded by

Sozha Vendhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Python Tutorial

Uploaded by

Sozha Vendhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Linear Regression

Yijun Zhao
Northeastern University

Fall 2016

Yijun Zhao Linear Regression


Regression Examples
Any Attributes Continuous Value
x =⇒ y
{age, major , gender , race} ⇒GPA
{income, credit score, profession} ⇒ loan
{college, major , GPA} ⇒ future income
..
.

Yijun Zhao Linear Regression


Regression Examples
Data often has/can be converted into matrix form:

Age Gender Race Major GPA


20 0 A Art 3.85
22 0 C Engineer 3.90
25 1 A Engineer 3.50
24 0 AA Art 3.60
19 1 H Art 3.70
18 1 C Engineer 3.00
30 0 AA Engineer 3.80
25 0 C Engineer 3.95
28 1 A Art 4.00
26 0 C Engineer 3.20

Yijun Zhao Linear Regression


Formal Problem Setup
Given N observations

{(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )}

a regression problem tries to uncover the function

yi = f (xi ) ∀i = 1, 2. . . . , n

such that for a new input value x∗ , we can


accurately predict the corresponding value

y∗ = f (x∗ ).

Yijun Zhao Linear Regression


Linear Regression
Assume the function f is a linear combination
of components in x
Formally, let x = (1, x1 , x2 , . . . , xd )T , we have
y = ω0 + ω1 x 1 + ω2 x 2 + · · · + ω d x d
= wT x
where w = (ω0 , ω1 , ω2 , . . . , ωd )T
w is the parameter to estimate !
Prediction:
y∗ = wT x∗

Yijun Zhao Linear Regression


Visual Illustration

Figure: 1D and 2D linear regression

Yijun Zhao Linear Regression


Error Measure
Mean Squared Error (MSE):
N
1X T
E (w) = (w xn − yn )2
N n=1
1
= k Xw − y k2
where N
   
— x1 T — y1
 — x2 T —   y2 
X=  y= 
 ... 
 ... 
T yN
— xN —

Yijun Zhao Linear Regression


Minimizing Error Measure
1
E (w) = N k Xw − y k2

5E (w) = N2 XT(Xw − y) = 0

XTXw = XTy

w = X† y

where X† = (XTX)−1XT is the


’pseudo-inverse’ of X

Yijun Zhao Linear Regression


LR Algorithm Summary
Ordinary Least Squares (OLS) Algorithm
Construct matrix X and the vector y from
the dataset {(x1 , y1 ), x2 , y2 ), . . . , (xN , yN )}
(each x includes x0 = 1) as follows:
   
— xT 1 — y1
 — xT —   y2 
X= 2  y =  
 ...   ... 
— xT N — yN
Compute X† = (XT X)−1 XT
Return w = X† y

Yijun Zhao Linear Regression


Gradient Descent
Why?
Minimize our target function (E (w )) by
moving down in the steepest direction

Yijun Zhao Linear Regression


Gradient Descent

Gradient Descent Algorithm


Initialize the w(0) for time t = 0
for t = 0, 1, 2, . . . do
Compute the gradient gt = 5E (w(t))
Set the direction to move, vt = −gt
Update w(t + 1) = w(t) + ηvt
Iterate until it is time to stop
Return the final weights w

Yijun Zhao Linear Regression


Gradient Descent
How η affects the algorithm?

Use 0.1 (practical observation)


Use variable size: ηt = η k 5E k

Yijun Zhao Linear Regression


OLS or Gradient Descent?

Yijun Zhao Linear Regression


Computational Complexity

OLS Gradient Descent

OLS is expensive when D is large!

Yijun Zhao Linear Regression


Linear Regression

What is the Probabilistic Interpretation?

Yijun Zhao Linear Regression


Normal Distribution

Right Skewed Left Skewed Random

Normal Distribution
Yijun Zhao Linear Regression
Normal Distribution
mean = median = mode
symmetry about the center
1 2
x ∼ N(µ, σ 2 ) =⇒ f (x) = σ√12π e − 2σ2 (x−µ)

Yijun Zhao Linear Regression


Central Limit Theorem
All things bell shaped!
Random occurrences over a large population
tend to wash out the asymmetry and
uniformness of individual events. A more
’natural’ distribution ensues. The name for it is
the Normal distribution (the bell curve).
Formal definition: If (y1 , . . . , yn ) are i.i.d. and
0 < σy2 < ∞, then when n is large the
distribution of ȳ is well approximated by a
σ2
normal distribution N(µy , ny ).

Yijun Zhao Linear Regression


Central Limit Theorem
Example:

Yijun Zhao Linear Regression


LR: Probabilistic Interpretation

Yijun Zhao Linear Regression


LR: Probabilistic Interpretation

1 − 12 (wT xi −yi )2
prob(yi |xi ) = √
2πσ
e 2σ
Yijun Zhao Linear Regression
LR: Probabilistic Interpretation
Likelihood of the entire dataset:
− 12 (wT xi −yi )2
Y 
L ∝ e 2σ

− 12 (wT xi −yi )2
P

=e i

P T
Maximize L ⇐⇒ Minimize (w xi − yi )2
i

Yijun Zhao Linear Regression


Non-linear Transformation

Linear is limited:

Linear models become powerful when we


consider non-linear feature transformations:
Xi = (1, xi , xi2 ) =⇒ yi = ω0 + ω1 xi + ω2 xi2

Yijun Zhao Linear Regression


Overfitting

Yijun Zhao Linear Regression


Overfitting
How do we know we overfitted?
Ein : Error from the training data
Eout : Error from the test data
Example:

Yijun Zhao Linear Regression


Overfitting
How to avoid overfitting?
Use more data
Evaluate on a parameter tuning set
Regularization

Yijun Zhao Linear Regression


Regularization
Attempts to impose ”Occam’s razor” principle
Add a penalty term for model complexity
Most commonly used :
L2 regularization (ridge regression) minimizes:
E (w) =k Xw − y k2 + λ k w k2
where λ ≥ 0 and k w k2 = wT w
L1 regularization (LASSO) minimizes:
E (w) =k Xw − y k2 + λ|w|1
D
P
where λ ≥ 0 and |w|1 = |ωi |
i=1

Yijun Zhao Linear Regression


Regularization
L2: closed form solution

w = (XT X + λI)−1 XT y

L1: No closed form solution. Use quadratic


programming:

minimize k Xw − y k2 s.t. k w k1 ≤ s

Yijun Zhao Linear Regression


L2 Regularization Example

Yijun Zhao Linear Regression


Model Selection
Which model?
A central problem in supervised learning
Simple model: ”underfit” the data
Constant function
Linear model applied to quadratic data

Complex model: ”overfit” the data


High degree polynomials
Model with hidden logics that fits the data to
completion

Yijun Zhao Linear Regression


Bias-Variance Trade-off
 N

1
(wT xn − yn )2 let ŷ = wT xn
P
Consider E N
n=1

E (ŷ − yn )2 can be decomposed into (reading):


var {noise} + bias 2 + var {yi }
var {noise}: can’t be reduced
bias 2 + var {yi } is what counts for prediction
High bias 2 : model mismatch, often due to
”underfitting”
High var {yi }: training set and test set
mismatch, often due to ”overfitting”
Yijun Zhao Linear Regression
Bias-Variance Trade-off
Often: low bias ⇒ high variance
low variance ⇒ high bias
Trade-off:

Yijun Zhao Linear Regression


How to choose λ ?
But we still need to pick λ.
Use the test set data ? NO!

Set aside another evaluation set


Small evaluation set ⇒ inaccurate estimated error
Large evaluation set ⇒ small training set

CrossValidation

Yijun Zhao Linear Regression


Cross Validation (CV)
Divide data into K folds
Alternatively train on all except k th folds, and
test on k th fold

Yijun Zhao Linear Regression


Cross Validation (CV)
How to choose K?
Common choice of K = 5, 10, or N (LOOCV)

Measure on average performance

Cost of computation: K folds × choices of λ

Yijun Zhao Linear Regression


Learning Curve
A learning curve plots the performance of the
algorithm as a function of the size of training data

Yijun Zhao Linear Regression


Learning Curve

Yijun Zhao Linear Regression

You might also like