Lecture 1
Lecture 1
Machine Learning
4771
Topic 2
•Regression
•Empirical Risk Minimization
•Least Squares
•Higher Order Polynomials
•Under-fitting / Over-fitting
•Cross-Validation
Tony Jebara, Columbia University
Regression
Classification Regression, f(x)=y
Supervised
x
x xx x x x x x
x x x
x x x x x
x O x x
OOO OO x x
O x x
O O
O
Unsupervised
Feature Selection Anomaly Detection
x
xxx x x x xx x x x
xxx x x x
x x x x x
x x x
xxx x xx x x
Tony Jebara, Columbia University
Function Approximation
⎡ ⎤
•Start with training dataset ⎢ x (1) ⎥
⎢ ⎥
⎢ x (2) ⎥
{
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ R = ⎢⎢
D
} ⎥ y ∈ R1
⎥
⎢ ... ⎥
•Have N (input, output ) pairs ⎢ ⎥
⎢ x (D ) ⎥
•Find a function f(x) to predict y from x ⎢⎣ ⎥⎦
That fits the training data well
y x x
x x
x x
x x
x x
x
x x
Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ1x i − θ 0 )
1 N
2N i =1 i
Tony Jebara, Columbia University
Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0
∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0
Tony Jebara, Columbia University
Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0
∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0
∑ (y − θ x − θ )(−x ) = 0
N
∂R
∂θ1
= 1
N i =1 i 1 i 0 i
Tony Jebara, Columbia University
Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0
∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0
∑ (y − θ x − θ )(−x ) = 0
N
∂R
∂θ1
= 1
N i =1 i 1 i 0 i
θ = ∑y − θ ∑x
0
1
N i
1
1 N i
Tony Jebara, Columbia University
Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0
= ∑ (y − θ x − θ )(−1) = 0
∂R 1 N
∂θ0 N i =1 i 1 i 0
= ∑ (y − θ x − θ )(−x ) = 0
∂R 1 N
∂θ1 N i =1 i 1 i 0 i
θ = ∑y − θ ∑x
0
1
N i
1
1 N i
θ ∑x = ∑y x − θ ∑x
1
2
i i i 0 i
Tony Jebara, Columbia University
Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0
= ∑ (y − θ x − θ )(−1) = 0
∂R 1 N
∂θ0 N i =1 i 1 i 0
= ∑ (y − θ x − θ )(−x ) = 0
∂R 1 N
∂θ1 N i =1 i 1 i 0 i
θ = ∑y − θ ∑x
0
1
N i
1
1 N i
θ ∑x = ∑y x − θ ∑x
1
2
i i i 0 i
θ1 =
∑ y x − ∑y ∑x
i i
1
N i i
∑x − ∑x ∑x2
i
1
N i i
Tony Jebara, Columbia University
∑ (y − θ1x i − θ 0 )x i = 0
N
∂R
∂θ1
= 1
N i =1 i
…error not correlated with data
1
N ∑ eixi = 1
N
eT x = 0
Tony Jebara, Columbia University
Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in vector-matrix
2
notation:
R (θ) = 2N1 ∑ i =1 (yi − θ1x i − θ 0 )
N
2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡ 1 x ⎤
⎢ 1 ⎥ ⎢ 1 ⎥⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ θ0 ⎥
= 1
⎢ ⎥−⎢ ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ ⎥⎢ θ ⎥
⎢ yN ⎥ ⎢ 1 xn ⎥⎣ 1 ⎦
⎣ ⎦ ⎣ ⎦
2
= 1
2N
y − Xθ
Tony Jebara, Columbia University
Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in vector-matrix
2
notation:
R (θ) = 2N1 ∑ i =1 (yi − θ1x i − θ 0 )
N
2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡ 1 x ⎤
⎢ 1 ⎥ ⎢ 1 ⎥⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ θ0 ⎥ Can add more
= 1
⎢ ⎥−⎢ ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ ⎥⎢ θ ⎥ dimensions by
⎢ yN ⎥ ⎢ 1 xn ⎥⎣ 1 ⎦
⎣ ⎦ ⎣ ⎦ adding columns
2 to X matrix and
= 1
2N
y − Xθ rows to θ vector
Tony Jebara, Columbia University
Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in 2vector-matrix notation:
R (θ) = 2N ∑ i =1 (yi − θ1x i − θ 0 )
1 N
2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡⎢ 1 x (1) … x (D ) ⎤ ⎡⎢ θ 0 ⎤⎥
⎢ 1 ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢
1 1
⎥⎢ θ1 ⎥ Can add more
= 1
⎢ ⎥−⎢ ⎥⎢ ⎥ dimensions by
⎢ ⎥ ⎢ ⎥⎢
⎥⎥
2N
⎢ yN ⎥ ⎢ 1 x N (1) … x N (D ) ⎥⎢
⎥⎦ ⎢ adding columns
⎣ ⎦ ⎣ θD ⎥⎥
⎢⎣ ⎦ to X matrix and
2 rows to θ vector
= 1
2N
y − Xθ
Tony Jebara, Columbia University
Multi-Dimensional Regression
•More realistic dataset: many measurements
•Have N apartments each with D measurements
•Each row of X is [#rooms; latitude; longitude,…]
⎡ ⎤
1( ) 1( ) ⎥
⎢ 1 x 1 … x D
⎢ ⎥
X=⎢ ⎥
⎢ ⎥
⎢ 1 x (1) … x (D ) ⎥
⎢⎣ N N ⎥⎦
Tony Jebara, Columbia University
Multi-Dimensional Regression
•Solving gradient=0 ∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
Tony Jebara, Columbia University
Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜
2N ⎝ ⎠
Tony Jebara, Columbia University
Multi-Dimensional Regression
∇ θR = 0
•Solving gradient=0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
⎛ ⎞⎟
∇θ ⎜⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1
2N ⎝ ⎠
1
2N (
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 )
Tony Jebara, Columbia University
Multi-Dimensional Regression
∇ θR = 0
•Solving gradient=0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T
⎝ ⎠ ∂u θ
= uT
∂θ
⎛ ⎞⎟
∇θ ⎜⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1
T
2N ⎝ ⎠ ∂θ θ T
= 2θ
( )
∂θ
1
2N
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T
∂θ Aθ T
( )
1
2N (−2y X + 2θ X X) = 0
T T T = θ A + AT
∂θ
Tony Jebara, Columbia University
Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T
⎝ ⎠ ∂u θ
= uT
∂θ
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜ T
T
2N ⎝ ⎠ ∂θ θ
= 2θ
∂θ
1
2N ( )
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T
∂θ Aθ T
( )
(−2y X + 2θ X X) = 0
T T T = θ A + AT
1 ∂θ
2N
XT Xθ = XT y
Tony Jebara, Columbia University
Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T
⎝ ⎠ ∂u θ
= uT
∂θ
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜ T
T
2N ⎝ ⎠ ∂θ θ
= 2θ
∂θ
1
2N ( )
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T
∂θ Aθ T
( )
(−2y X + 2θ X X) = 0
T T T = θ A + AT
1 ∂θ
2N
XT Xθ = XT y
( )
−1
* T
θ = X X XT y
•In Matlab: “t=pinv(X)*y” or “t=X\y” or “t=inv(X’*X)*X’*y”
Tony Jebara, Columbia University
Multi-Dimensional Regression
•Solving gradient=0
XT Xθ = XT y
( )
−1
* T
θ = X X XT y
•In Matlab: “t=pinv(X)*y” or “t=X\y” or “t=inv(X’*X)*X’*y”
•If the matrix X is skinny, the solution is probably unique
•If X is fat (more dimensions than points) we get multiple
solutions for theta which give zero error.
2D Linear Regression
•Once best θ* is found, we can plug it into the function:
( )
f x; θ* = θ*2x(2) + θ1*x(1) + θ*0
2
•Writing Risk: ⎡ ⎡ θ 0 ⎤⎥
⎤ ⎡ 1 P ⎤⎢
y
⎢ 1 ⎥ ⎢ 1 x1 … x1 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ θ1 ⎥
R (θ) = 1
⎢ ⎥−⎢ ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ P ⎥⎢ ⎥⎥
⎢ N ⎥ ⎢ 1 x N1
y … xN ⎥ ⎢
⎣ ⎦ ⎣ ⎦⎢ θP ⎥⎥
⎢⎣ ⎦
•Order-P polynomial regression fitting for 1D variable is
same as P-dimensional linear regression!
T
•Construct a multidim x = ⎡ x 0 x 1 x 2 x 3 ⎤
⎢⎣ i ⎥⎦
x-vector from x scalar i i i i
T
⎡ ⎤
•More generally any x i
= ( )
⎢ φ0 x i φ1 x i
⎣
( ) φ2 (x i ) φ3 (x i ) ⎥
⎦
Tony Jebara, Columbia University
Underfitting/Overfitting
•Try varying P. Higher P fits a more complex function class
•Observe R(θ*) drops with bigger P
{(x ,y ),…, (x
1 1 N }
,yN ) {(x N +1
,yN +1 ),…, (x N +M ,yN +M ) }
•Estimate θ* with training loss: Rtrain (θ) =
1
N ∑
N
i =1 ( )
L yi , f (x i ; θ)
Crossvalidation
•Try fitting with different polynomial order P
•Select P which gives lowest Rtest(θ*)
Loss ( )
Rtest θ*
( )
Rtrain θ*
P
underfitting overfitting
best P
Machine Learning
4771
Topic 3
•Additive Models and Linear Regression
•Sinusoids and Radial Basis Functions
•Classification
•Logistic Regression
•Gradient Descent
Tony Jebara, Columbia University
θ1 × + θ2 × + θ3 ×
Tony Jebara, Columbia University
⎝ 2σ ⎠
•Parameter σ = standard deviation
σ2 = covariance
controls how wide bumps are
what happens if too big/small?
⎝ 2σ ⎠
( )
−1
* T T
•Reuse solution from linear regression: θ = X X X y
•Can view the data instead as X, a big matrix of size N x N
⎡ ⎤
⎢ exp ⎛⎜− 1 x − x ⎞⎟ ⎛ 1 ⎞⎟ ⎛ 1 ⎞⎟
2 2 2
⎟⎟ exp ⎜⎜− 2 x − x ⎟⎟ exp ⎜⎜− 2 x − x ⎟⎟ ⎥
⎢ ⎜⎜⎝ 2σ2 1 1
⎠ ⎜⎝ 2σ 1 2
⎠ ⎜⎝ 2σ 1 3
⎠ ⎥
⎢ ⎥
⎢ ⎛ 1 2⎞ ⎥
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
X = exp ⎜⎜⎜− 2 x 2 − x1
⎢ ⎟⎟ ⎥
⎢ ⎝ 2σ ⎟⎠ ⎜⎝ 2σ 2 2 ⎟⎠ ⎜⎝ 2σ 2 3 ⎟⎠ ⎥
⎢ ⎥
⎢ ⎛ 1 2⎞ ⎥
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎢ exp ⎜⎜⎜− 2 x 3 − x1 ⎟⎟ ⎥
⎢ ⎝ 2σ ⎟⎠ ⎜⎝ 2σ 3 2 ⎟⎠ ⎜⎝ 2σ 3 3 ⎟⎠ ⎥
⎣ ⎦
•For RBFs, X is square and symmetric, so solution is just
∇θR = 0 → XT Xθ = XT y → Xθ = y → θ* = X−1y
Tony Jebara, Columbia University
{(x ,y ),…, (x
1 1 N }
,yN ) {(x N +1
,yN +1 ),…, (x N +M ,yN +M ) }
•Find θ* with training data: Rtrain (θ) = 1
N ∑
N
i =1 (
L yi , f (x i ; θ))
•Evaluate it with testing data: Rtest (θ) = 1
M ∑
N +M
i =N +1 (
L yi , f (x i ; θ) )
Tony Jebara, Columbia University
Crossvalidation
•Try fitting with different sigma radial basis function widths
•Select sigma which gives lowest Rtest(θ*)
Loss ( )
Rtest θ*
( )
Rtrain θ*
1/σ
underfitting overfitting
Best sigma
( )
L yi , f (x i ; θ) +
2
∑
N
= 1
N i =1
λ
2N
θ
( )
−1
* T
θ = X X + λI XT y
Tony Jebara, Columbia University
Crossvalidation
•Try fitting with different lambda regularization levels
•Select lambda which gives lowest Rtest(θ*)
Risk ( )
Rtest θ*
( )
Rtrain θ*
1/λ
underfitting overfitting
Best lambda
Classification X = {( x ,y ), ( x ,y ),…, ( x
1 1 2 2 N
,yN )} x ∈ RD y ∈ {0,1}
x
x xx x x x
x
x x x
x O
OOO OO
O
O O
O
Tony Jebara, Columbia University
f (x; θ) = θ x
T ( )
f (x; θ) = g θT x
( )
−1
g (z ) = 1 + exp (−z )
Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}
f (x; θ) = θT x + θ 0
T
⎡ θ ⎤ ⎡ ⎤
T
⎡ θ(1) ⎤ ⎡ x(1) ⎤ ⎢ 0 ⎥ ⎢ 1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ θ(1) ⎥ ⎢ x(1) ⎥
⎢ θ(2) ⎥ ⎢ x(2) ⎥ ⎢ ⎥ ⎢ ⎥ T
=⎢ ⎥ ⎢ ⎥ + θ 0 = ⎢ θ(2) ⎥ ⎢ x(2) ⎥ = θ x
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎥ ⎢ ⎥
⎢⎣ θ(D) ⎥⎦ ⎢⎣ x(D) ⎥⎦ ⎢ ⎢ ⎥
⎢ θ(D) ⎥ ⎢ x(D) ⎥
⎣ ⎦ ⎣ ⎦
Tony Jebara, Columbia University
Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}
Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}
Logistic Regression
•With logistic squashing function, minimizing R(θ) is harder
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
⎛ ⎞⎟
⎜ 1 − y y ⎟⎟ f ' x ; θ = 0 ???
⎜⎜1 − f (x ; θ) f (x ; θ)⎟⎟⎟ ( i )
∇θR = N1 ∑ i =1 ⎜⎜
N i
− i
⎝ i i ⎠
•Can’t minimize risk and find best theta analytically!
•Let’s try finding best theta numerically.
•Use the following to compute gradient
( ))
−1
f (x; θ) = 1 + exp −θ x ( T
( )
= g θT x
Gradient Descent
•Useful when we can’t get minimum solution in closed form
•Gradient points in direction of fastest increase
•Take step in the opposite direction!
while θt − θt −1 ≥∈ {
θt +1 = θt − η ∇θRemp t , t = t + 1
θ }
•For appropriate η, this will converge to local minimum
Tony Jebara, Columbia University
Logistic Regression
•Logistic regression gives better classification performance
•Its empirical risk is
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
•This R(θ) is convex so
gradient descent always
converges to the
same solution
( ))
−1
f (x; θ) = 1 + exp −θ x( T