0% found this document useful (0 votes)
5 views

Lecture 1

Uploaded by

tyfgoodluck
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 1

Uploaded by

tyfgoodluck
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Tony Jebara, Columbia University

Machine Learning
4771

Instructor: Tony Jebara


Tony Jebara, Columbia University

Topic 2
•Regression
•Empirical Risk Minimization
•Least Squares
•Higher Order Polynomials
•Under-fitting / Over-fitting
•Cross-Validation
Tony Jebara, Columbia University

Regression
Classification Regression, f(x)=y

Supervised
x
x xx x x x x x
x x x
x x x x x
x O x x
OOO OO x x
O x x
O O
O

Density/Structure Estimation Clustering

Unsupervised
Feature Selection Anomaly Detection
x
xxx x x x xx x x x
xxx x x x
x x x x x
x x x
xxx x xx x x
Tony Jebara, Columbia University

Function Approximation
⎡ ⎤
•Start with training dataset ⎢ x (1) ⎥
⎢ ⎥
⎢ x (2) ⎥
{
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ R = ⎢⎢
D
} ⎥ y ∈ R1

⎢ ... ⎥
•Have N (input, output ) pairs ⎢ ⎥
⎢ x (D ) ⎥
•Find a function f(x) to predict y from x ⎢⎣ ⎥⎦
That fits the training data well
y x x
x x
x x
x x
x x
x
x x

•Example: predict the price of house in


dollars y using x = [#rooms; latitude; longitude; …]
•Need: a) Way to evaluate how good a fit we have
b) Class of functions in which to search for f(x)
Tony Jebara, Columbia University

Empirical Risk Minimization


•Idea: minimize ‘loss’ on the training data set
•Empirical = use the training set to find the best fit
•Define a loss function of how good we fit a single point:
(
L y, f (x ))
•Empirical Risk = the average loss over the dataset
R= 1
N ∑
N
i =1 ( )
L yi , f (x i )
•Simplest loss: squared error from y value
( ) (y − f (x ))
2
L yi , f (x i ) = 1
2 i i

•Other possible loss: absolute error


( )
L yi , f (x i ) = yi − f (x i )
Tony Jebara, Columbia University

Linear Function Classes


•Linear is simplest class of functions to search over:
f (x; θ) = θ x + θ 0 = ∑ d =1 θd x (d ) + θ 0
T D

•Start with x being 1-dimensional (D=1):


f (x; θ) = θ1x + θ 0
•Plug in the above & minimize empirical
2
risk over θ

R (θ) = ∑ (y − θ1x i − θ 0 )
1 N
2N i =1 i

•Note: minimum occurs when R(θ) gets flat (not always!)


•Note: when R(θ) is flat, gradient ∇ R = 0
θ
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ1x i − θ 0 )
1 N
2N i =1 i
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0

∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0

∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0

∑ (y − θ x − θ )(−x ) = 0
N
∂R
∂θ1
= 1
N i =1 i 1 i 0 i
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0

∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0

∑ (y − θ x − θ )(−x ) = 0
N
∂R
∂θ1
= 1
N i =1 i 1 i 0 i

θ = ∑y − θ ∑x
0
1
N i
1
1 N i
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0

= ∑ (y − θ x − θ )(−1) = 0
∂R 1 N
∂θ0 N i =1 i 1 i 0

= ∑ (y − θ x − θ )(−x ) = 0
∂R 1 N
∂θ1 N i =1 i 1 i 0 i

θ = ∑y − θ ∑x
0
1
N i
1
1 N i

θ ∑x = ∑y x − θ ∑x
1
2
i i i 0 i
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0

= ∑ (y − θ x − θ )(−1) = 0
∂R 1 N
∂θ0 N i =1 i 1 i 0

= ∑ (y − θ x − θ )(−x ) = 0
∂R 1 N
∂θ1 N i =1 i 1 i 0 i

θ = ∑y − θ ∑x
0
1
N i
1
1 N i

θ ∑x = ∑y x − θ ∑x
1
2
i i i 0 i

θ1 =
∑ y x − ∑y ∑x
i i
1
N i i

∑x − ∑x ∑x2
i
1
N i i
Tony Jebara, Columbia University

Properties of the Solution


•Setting θ* as before gives least squared error
•Define error on each data point as:
ei = yi − θ* x i − θ*
1 0

•Note property #1:


∑ (y − θ1x i − θ 0 ) = 0
N
∂R
∂θ0
= 1
N i =1 i

…average error is zero 1


N ∑e i
=0

•Note property #2:

∑ (y − θ1x i − θ 0 )x i = 0
N
∂R
∂θ1
= 1
N i =1 i
…error not correlated with data
1
N ∑ eixi = 1
N
eT x = 0
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in vector-matrix
2
notation:
R (θ) = 2N1 ∑ i =1 (yi − θ1x i − θ 0 )
N

2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡ 1 x ⎤
⎢ 1 ⎥ ⎢ 1 ⎥⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ θ0 ⎥
= 1
⎢  ⎥−⎢   ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ ⎥⎢ θ ⎥
⎢ yN ⎥ ⎢ 1 xn ⎥⎣ 1 ⎦
⎣ ⎦ ⎣ ⎦
2
= 1
2N
y − Xθ
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in vector-matrix
2
notation:
R (θ) = 2N1 ∑ i =1 (yi − θ1x i − θ 0 )
N

2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡ 1 x ⎤
⎢ 1 ⎥ ⎢ 1 ⎥⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ θ0 ⎥ Can add more
= 1
⎢  ⎥−⎢   ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ ⎥⎢ θ ⎥ dimensions by
⎢ yN ⎥ ⎢ 1 xn ⎥⎣ 1 ⎦
⎣ ⎦ ⎣ ⎦ adding columns
2 to X matrix and
= 1
2N
y − Xθ rows to θ vector
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in 2vector-matrix notation:
R (θ) = 2N ∑ i =1 (yi − θ1x i − θ 0 )
1 N

2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2

⎡ y ⎤ ⎡⎢ 1 x (1) … x (D ) ⎤ ⎡⎢ θ 0 ⎤⎥
⎢ 1 ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢
1 1
⎥⎢ θ1 ⎥ Can add more
= 1
⎢  ⎥−⎢     ⎥⎢ ⎥ dimensions by
⎢ ⎥ ⎢ ⎥⎢
 ⎥⎥
2N
⎢ yN ⎥ ⎢ 1 x N (1) … x N (D ) ⎥⎢
⎥⎦ ⎢ adding columns
⎣ ⎦ ⎣ θD ⎥⎥
⎢⎣ ⎦ to X matrix and
2 rows to θ vector
= 1
2N
y − Xθ
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More realistic dataset: many measurements
•Have N apartments each with D measurements
•Each row of X is [#rooms; latitude; longitude,…]
⎡ ⎤
1( ) 1( ) ⎥
⎢ 1 x 1 … x D
⎢ ⎥
X=⎢     ⎥
⎢ ⎥
⎢ 1 x (1) … x (D ) ⎥
⎢⎣ N N ⎥⎦
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0 ∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜
2N ⎝ ⎠
Tony Jebara, Columbia University

Multi-Dimensional Regression
∇ θR = 0
•Solving gradient=0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
⎛ ⎞⎟
∇θ ⎜⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1
2N ⎝ ⎠
1
2N (
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 )
Tony Jebara, Columbia University

Multi-Dimensional Regression
∇ θR = 0
•Solving gradient=0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T 
⎝ ⎠ ∂u θ 
 = uT
∂θ
⎛ ⎞⎟
∇θ ⎜⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1
T 
2N ⎝ ⎠ ∂θ θ T
 = 2θ

( )
∂θ
1
2N
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T 
∂θ Aθ T
( )
1
2N (−2y X + 2θ X X) = 0
T T T  = θ A + AT
∂θ
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T 
⎝ ⎠ ∂u θ 
 = uT
∂θ
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜ T 
T
2N ⎝ ⎠ ∂θ θ
 = 2θ
∂θ
1
2N ( )
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T 
∂θ Aθ T
( )
(−2y X + 2θ X X) = 0
T T T  = θ A + AT
1 ∂θ
2N

XT Xθ = XT y
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T 
⎝ ⎠ ∂u θ 
 = uT
∂θ
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜ T 
T
2N ⎝ ⎠ ∂θ θ
 = 2θ
∂θ
1
2N ( )
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T 
∂θ Aθ T
( )
(−2y X + 2θ X X) = 0
T T T  = θ A + AT
1 ∂θ
2N

XT Xθ = XT y

( )
−1
* T
θ = X X XT y
•In Matlab: “t=pinv(X)*y” or “t=X\y” or “t=inv(X’*X)*X’*y”
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
XT Xθ = XT y

( )
−1
* T
θ = X X XT y
•In Matlab: “t=pinv(X)*y” or “t=X\y” or “t=inv(X’*X)*X’*y”
•If the matrix X is skinny, the solution is probably unique
•If X is fat (more dimensions than points) we get multiple
solutions for theta which give zero error.

•The pseudeoinverse (pinv(X)) returns the theta with zero


error and which has the smallest norm.
2
min θ θ such that Xθ = y
Tony Jebara, Columbia University

2D Linear Regression
•Once best θ* is found, we can plug it into the function:
( )
f x; θ* = θ*2x(2) + θ1*x(1) + θ*0

•What would a fat X look like?


Tony Jebara, Columbia University

Polynomial Function Classes


•Back to 1-dim x (D=1) BUT Nonlinear
•Polynomial: f x; θ =( ) ∑ p=1 p + θ0
P
θ x p

2
•Writing Risk: ⎡ ⎡ θ 0 ⎤⎥
⎤ ⎡ 1 P ⎤⎢
y
⎢ 1 ⎥ ⎢ 1 x1 … x1 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ θ1 ⎥
R (θ) = 1
⎢  ⎥−⎢     ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ P ⎥⎢  ⎥⎥
⎢ N ⎥ ⎢ 1 x N1
y … xN ⎥ ⎢
⎣ ⎦ ⎣ ⎦⎢ θP ⎥⎥
⎢⎣ ⎦
•Order-P polynomial regression fitting for 1D variable is
same as P-dimensional linear regression!
T
•Construct a multidim x = ⎡ x 0 x 1 x 2 x 3 ⎤
⎢⎣ i ⎥⎦
x-vector from x scalar i i i i
T
⎡ ⎤
•More generally any x i
= ( )
⎢ φ0 x i φ1 x i

( ) φ2 (x i ) φ3 (x i ) ⎥

Tony Jebara, Columbia University

Underfitting/Overfitting
•Try varying P. Higher P fits a more complex function class
•Observe R(θ*) drops with bigger P

P=0 P=1 P=2

P=5 P=10 P=15


Tony Jebara, Columbia University

Evaluating The Regression


•Unfair to use empirical to find best order P
•High P (vs. N) can overfit, even linear case!
•min R(θ*) not on training but on future data
•Want model to Generalize to future data

True loss: Rtrue (θ) = ∫ P (x,y ) L (y, f (x; θ))dx dy


•One approach: split data into training / testing portion

{(x ,y ),…, (x
1 1 N }
,yN ) {(x N +1
,yN +1 ),…, (x N +M ,yN +M ) }
•Estimate θ* with training loss: Rtrain (θ) =
1
N ∑
N
i =1 ( )
L yi , f (x i ; θ)

•Evaluate P with testing loss: Rtest (θ) = 1


M ∑
N +M
i =N +1 (
L yi , f (x i ; θ))
Tony Jebara, Columbia University

Crossvalidation
•Try fitting with different polynomial order P
•Select P which gives lowest Rtest(θ*)

Loss ( )
Rtest θ*

( )
Rtrain θ*
P
 underfitting overfitting 

best P

•Think of P as a measure of the complexity of the model


•Higher order polynomials are more flexible and complex
Tony Jebara, Columbia University

Machine Learning
4771

Instructor: Tony Jebara


Tony Jebara, Columbia University

Topic 3
•Additive Models and Linear Regression
•Sinusoids and Radial Basis Functions
•Classification
•Logistic Regression
•Gradient Descent
Tony Jebara, Columbia University

Polynomial Basis Functions


•To fit a P’th order polynomial function to multivariate data:
concatenate columns of all monomials up to power P
•E.g. 2 dimensional data and 2nd order polynomial (quadratic)
⎡ ⎤
⎢ x1 (1) x1 (2) ⎥
⎢ ⎥
⎢   ⎥
⎢ ⎥

⎢ x i (1) x i (2) ⎥⎥
⎢ ⎥
⎢   ⎥
⎢ ⎥
⎢ x N (1) x N (2) ⎥
⎣ ⎦
⎡ ⎤
⎢ 1 x1 (1) x1 (2) x1 (1) x1 (1) x1 (1) x1 (2) x1 (2) x1 (2) ⎥
⎢ ⎥
⎢       ⎥
⎢ ⎥

⎢ 1 x i (1) x i (2) x i (1) x i (1) x i (1) x i (2) x i (2) x i (2) ⎥⎥
⎢ ⎥
⎢       ⎥
⎢ ⎥
⎢ 1 x N (1) x N (2) x N (1) x N (1) x N (1) x N (2) x N (2) x N (2) ⎥
⎣ ⎦
Tony Jebara, Columbia University

Sinusoidal Basis Functions


•More generally, we don’t just have to deal
with polynomials, use any set of basis fn’s:
f (x;θ) = ∑ p=1 θ p φp (x ) + θ0
P

•These are generally called Additive Models


•Regression adds linear combinations of the basis fn’s
•For example: Fourier (sinusoidal) basis
( )
φ2k x i = sin kx i( ) ( )
φ2k +1 x i = cos kx i ( )
•Note, don’t have to be a basis per se, usually subset

θ1 × + θ2 × + θ3 ×
Tony Jebara, Columbia University

Radial Basis Functions


•Can act as prototypes of the data itself
⎛ 1 2⎞
( )
f x; θ = ∑ k =1 θk exp ⎜⎜⎜− 2 x − xk ⎟⎟⎟
N

⎝ 2σ ⎠
•Parameter σ = standard deviation
σ2 = covariance
controls how wide bumps are
what happens if too big/small?

•Also works in multi-dimensions

•Called RBF for short


Tony Jebara, Columbia University

Radial Basis Functions


•Each training point leads to a bump function
⎛ 1 2⎞
( )
f x; θ = ∑ k =1 θk exp ⎜⎜⎜− 2 x − xk ⎟⎟⎟
N

⎝ 2σ ⎠
( )
−1
* T T
•Reuse solution from linear regression: θ = X X X y
•Can view the data instead as X, a big matrix of size N x N
⎡ ⎤
⎢ exp ⎛⎜− 1 x − x ⎞⎟ ⎛ 1 ⎞⎟ ⎛ 1 ⎞⎟
2 2 2
⎟⎟ exp ⎜⎜− 2 x − x ⎟⎟ exp ⎜⎜− 2 x − x ⎟⎟ ⎥
⎢ ⎜⎜⎝ 2σ2 1 1
⎠ ⎜⎝ 2σ 1 2
⎠ ⎜⎝ 2σ 1 3
⎠ ⎥
⎢ ⎥
⎢ ⎛ 1 2⎞ ⎥
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
X = exp ⎜⎜⎜− 2 x 2 − x1
⎢ ⎟⎟ ⎥
⎢ ⎝ 2σ ⎟⎠ ⎜⎝ 2σ 2 2 ⎟⎠ ⎜⎝ 2σ 2 3 ⎟⎠ ⎥
⎢ ⎥
⎢ ⎛ 1 2⎞ ⎥
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎢ exp ⎜⎜⎜− 2 x 3 − x1 ⎟⎟ ⎥
⎢ ⎝ 2σ ⎟⎠ ⎜⎝ 2σ 3 2 ⎟⎠ ⎜⎝ 2σ 3 3 ⎟⎠ ⎥
⎣ ⎦
•For RBFs, X is square and symmetric, so solution is just
∇θR = 0 → XT Xθ = XT y → Xθ = y → θ* = X−1y
Tony Jebara, Columbia University

Evaluating Our Learned Function


•We minimized empirical risk to get θ*
•How well does f(x;θ*) perform on future data?
•It should Generalize and have low True Risk:

Rtrue (θ) = ∫ P (x,y ) L (y, f (x; θ))dx dy


•Can’t compute true risk, instead use Testing Empirical Risk
•We randomly split data into training and testing portions

{(x ,y ),…, (x
1 1 N }
,yN ) {(x N +1
,yN +1 ),…, (x N +M ,yN +M ) }
•Find θ* with training data: Rtrain (θ) = 1
N ∑
N
i =1 (
L yi , f (x i ; θ))
•Evaluate it with testing data: Rtest (θ) = 1
M ∑
N +M
i =N +1 (
L yi , f (x i ; θ) )
Tony Jebara, Columbia University

Crossvalidation
•Try fitting with different sigma radial basis function widths
•Select sigma which gives lowest Rtest(θ*)

Loss ( )
Rtest θ*

( )
Rtrain θ*
1/σ
 underfitting overfitting 

Best sigma

•Think of sigma as a measure of the simplicity of the model


•Thinner RBFs are more flexible and complex
Tony Jebara, Columbia University

Regularized Risk Minimization


•Empirical Risk Minimization gave overfitting & underfitting
•We want to add a penalty for using too many theta values
•This gives us the Regularized Risk

Rregularized (θ) = Rempirical (θ) + Penalty (θ)

( )
L yi , f (x i ; θ) +
2

N
= 1
N i =1
λ
2N
θ

•Solution for Regularized Risk with Least Squares Loss:


⎛1 2 ⎞⎟
2
∇θRregularized = 0 ⇒ ∇θ ⎜⎜⎜ 2N y − Xθ + λ
θ ⎟⎟ = 0
⎝ 2N ⎠

( )
−1
* T
θ = X X + λI XT y
Tony Jebara, Columbia University

Regularized Risk Minimization


•Have D=16 features (or P=15 throughout)
•Try minimizing Rregularized(θ) to get θ* with different λ

•Note that λ=0 give back Empirical Risk Minimization
Tony Jebara, Columbia University

Crossvalidation
•Try fitting with different lambda regularization levels
•Select lambda which gives lowest Rtest(θ*)

Risk ( )
Rtest θ*

( )
Rtrain θ*
1/λ
 underfitting overfitting 

Best lambda

•Lambda measures simplicity of the model


•Models with low lambda are more flexible
Tony Jebara, Columbia University

From Regression To Classification


•Classification is another important learning problem
Regression X = {(x ,y ), (x ,y ),…, (x
1 1 2 2 N
,yN )} x ∈ RD y ∈ R
1

Classification X = {( x ,y ), ( x ,y ),…, ( x
1 1 2 2 N
,yN )} x ∈ RD y ∈ {0,1}

•E.g. Given x = [tumor size, tumor density]


Predict y in {benign,malignant}

•Should we solve this as a least squares regression problem?

x
x xx x x x
x
x x x
x O
OOO OO
O
O O
O
Tony Jebara, Columbia University

Classification vs. Regression


a) Classification needs binary answers like {0,1}
b) Least squares is an unfair measure of risk here
e.g. Why penalize a correct but large positive y answer?
e.g. Why penalize a correct but large negative y answer?

•Example: not good to use regression output for a decision


f(x)>0.5  Class 1 f(x)<0.5  Class 0
if f(x)=-3.8 & correct class=0, squared error penalizes it…

f(x) We pay a hefty squared error loss


xx x x x x
1.0 here even if we got the correct
0.5 classification result. The thick
0.0 oooo
x solid line model makes two
mistakes while the dashed
model is perfect
Tony Jebara, Columbia University

Classification vs. Regression


We will consider the following four steps to improve from
naïve regression to get better classification learning:

1) Fix functions f(x) to give binary output (logistic neuron)


2) Fix our definition of the Risk we will minimize so that we
get good classification accuracy (logistic loss)

…and later on…

3) Make an even better fix on f(x) to binarize (perceptron)


4) Make an even better risk (perceptron loss)
Tony Jebara, Columbia University

Logistic Neuron (McCullough-Pitts)


•To output binary, use squashing function g().

f (x; θ) = θ x
T ( )
f (x; θ) = g θT x
( )
−1
g (z ) = 1 + exp (−z )

Linear neuron Logistic Neuron

•This squashing is called sigmoid or logistic function


Tony Jebara, Columbia University

Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}

•Use this function and output 1 if f(x)>0.5 and 0 otherwise


( ))
−1
(
f (x; θ) = 1 + exp −θ x
T
Tony Jebara, Columbia University

Short hand for Linear Functions


•What happened to adding the intercept?

f (x; θ) = θT x + θ 0
T
⎡ θ ⎤ ⎡ ⎤
T
⎡ θ(1) ⎤ ⎡ x(1) ⎤ ⎢ 0 ⎥ ⎢ 1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ θ(1) ⎥ ⎢ x(1) ⎥ 
⎢ θ(2) ⎥ ⎢ x(2) ⎥ ⎢ ⎥ ⎢ ⎥ T
=⎢ ⎥ ⎢ ⎥ + θ 0 = ⎢ θ(2) ⎥ ⎢ x(2) ⎥ = θ x
⎢  ⎥ ⎢  ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
 ⎥ ⎢  ⎥
⎢⎣ θ(D) ⎥⎦ ⎢⎣ x(D) ⎥⎦ ⎢ ⎢ ⎥
⎢ θ(D) ⎥ ⎢ x(D) ⎥
⎣ ⎦ ⎣ ⎦
Tony Jebara, Columbia University

Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}

•Fix#1: use f(x) below, output 1 if f(x)>0.5 and 0 otherwise


( ))
−1
(
f (x; θ) = 1 + exp −θ x
T
Tony Jebara, Columbia University

Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}

•Fix#1: use f(x) below, output 1 if f(x)>0.5 and 0 otherwise


( ))
−1
f (x; θ) = 1 + exp −θ x ( T

•Fix#2: instead of squared loss, use Logistic Loss


( ) ( )
Llog y, f (x; θ) = (yi −1) log 1 − f (x; θ) − yi log f (x; θ) ( )
•This method is called Logistic Regression.
•But Empirical Risk Minimization has no closed-form sol’n:
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
Tony Jebara, Columbia University

Logistic Regression
•With logistic squashing function, minimizing R(θ) is harder
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
⎛ ⎞⎟
⎜ 1 − y y ⎟⎟ f ' x ; θ = 0 ???
⎜⎜1 − f (x ; θ) f (x ; θ)⎟⎟⎟ ( i )
∇θR = N1 ∑ i =1 ⎜⎜
N i
− i

⎝ i i ⎠
•Can’t minimize risk and find best theta analytically!
•Let’s try finding best theta numerically.
•Use the following to compute gradient
( ))
−1
f (x; θ) = 1 + exp −θ x ( T
( )
= g θT x

•Here, g() is the logistic squashing function


( ) ( )
−1
g (z ) = 1 + exp (−z ) g ' (z ) = g (z ) 1 − g (z )
Tony Jebara, Columbia University

Gradient Descent
•Useful when we can’t get minimum solution in closed form
•Gradient points in direction of fastest increase
•Take step in the opposite direction!

•Gradient Descent Algorithm

choose scalar step size η, & tolerance ε




initialize θ 0 = small random vector

θ1 = θ 0 − η ∇θRemp , t =1
θ0

while θt − θt −1 ≥∈ {
θt +1 = θt − η ∇θRemp t , t = t + 1
θ }
•For appropriate η, this will converge to local minimum
Tony Jebara, Columbia University

Logistic Regression
•Logistic regression gives better classification performance
•Its empirical risk is
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
•This R(θ) is convex so
gradient descent always
converges to the
same solution

•Make predictions using

( ))
−1
f (x; θ) = 1 + exp −θ x( T

•Output 1 if f > 0.5


•Output 0 otherwise

You might also like