0% found this document useful (0 votes)

5 views

Lecture 1

Uploaded by

tyfgoodluck

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture 1

Uploaded by

tyfgoodluck

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Tony Jebara, Columbia University

Machine Learning
4771

Instructor: Tony Jebara

Tony Jebara, Columbia University

Topic 2
•Regression
•Empirical Risk Minimization
•Least Squares
•Higher Order Polynomials
•Under-fitting / Over-fitting
•Cross-Validation
Tony Jebara, Columbia University

Regression
Classification Regression, f(x)=y

Supervised
x
x xx x x x x x
x x x
x x x x x
x O x x
OOO OO x x
O x x
O O
O

Density/Structure Estimation Clustering

Unsupervised
Feature Selection Anomaly Detection
x
xxx x x x xx x x x
xxx x x x
x x x x x
x x x
xxx x xx x x
Tony Jebara, Columbia University

Function Approximation
⎡ ⎤
•Start with training dataset ⎢ x (1) ⎥
⎢ ⎥
⎢ x (2) ⎥
{
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ R = ⎢⎢
D
} ⎥ y ∈ R1
⎥
⎢ ... ⎥
•Have N (input, output ) pairs ⎢ ⎥
⎢ x (D ) ⎥
•Find a function f(x) to predict y from x ⎢⎣ ⎥⎦
That fits the training data well
y x x
x x
x x
x x
x x
x
x x

•Example: predict the price of house in

dollars y using x = [#rooms; latitude; longitude; …]
•Need: a) Way to evaluate how good a fit we have
b) Class of functions in which to search for f(x)
Tony Jebara, Columbia University

Empirical Risk Minimization

•Idea: minimize ‘loss’ on the training data set
•Empirical = use the training set to find the best fit
•Define a loss function of how good we fit a single point:
(
L y, f (x ))
•Empirical Risk = the average loss over the dataset
R= 1
N ∑
N
i =1 ( )
L yi , f (x i )
•Simplest loss: squared error from y value
( ) (y − f (x ))
2
L yi , f (x i ) = 1
2 i i

•Other possible loss: absolute error

( )
L yi , f (x i ) = yi − f (x i )
Tony Jebara, Columbia University

Linear Function Classes

•Linear is simplest class of functions to search over:
f (x; θ) = θ x + θ 0 = ∑ d =1 θd x (d ) + θ 0
T D

•Start with x being 1-dimensional (D=1):

f (x; θ) = θ1x + θ 0
•Plug in the above & minimize empirical
2
risk over θ

R (θ) = ∑ (y − θ1x i − θ 0 )
1 N
2N i =1 i

•Note: minimum occurs when R(θ) gets flat (not always!)

•Note: when R(θ) is flat, gradient ∇ R = 0
θ
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ1x i − θ 0 )
1 N
2N i =1 i
Tony Jebara, Columbia University

Min by Gradient=0 ⎡ ∂R ⎤
•Gradient=0 means the partial ⎢ ∂θ0 ⎥ ⎡ 0 ⎤
∇ θR = ⎢ ⎥=⎢ ⎥
derivatives are all 0 ⎢ ∂R ⎥ ⎢ 0 ⎥
⎢ ∂θ1 ⎥ ⎣ ⎦
•Take partials of empirical risk: ⎣ ⎦
2
R (θ) = ∑ (y − θ x − θ )
1 N
2N i =1 i 1 i 0

∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0
Tony Jebara, Columbia University

∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0

∑ (y − θ x − θ )(−x ) = 0
N
∂R
∂θ1
= 1
N i =1 i 1 i 0 i
Tony Jebara, Columbia University

∑ (y − θ x − θ )(−1) = 0
N
∂R
∂θ0
= 1
N i =1 i 1 i 0

∑ (y − θ x − θ )(−x ) = 0
N
∂R
∂θ1
= 1
N i =1 i 1 i 0 i

θ = ∑y − θ ∑x
0
1
N i
1
1 N i
Tony Jebara, Columbia University

= ∑ (y − θ x − θ )(−1) = 0
∂R 1 N
∂θ0 N i =1 i 1 i 0

= ∑ (y − θ x − θ )(−x ) = 0
∂R 1 N
∂θ1 N i =1 i 1 i 0 i

θ = ∑y − θ ∑x
0
1
N i
1
1 N i

θ ∑x = ∑y x − θ ∑x
1
2
i i i 0 i
Tony Jebara, Columbia University

= ∑ (y − θ x − θ )(−1) = 0
∂R 1 N
∂θ0 N i =1 i 1 i 0

= ∑ (y − θ x − θ )(−x ) = 0
∂R 1 N
∂θ1 N i =1 i 1 i 0 i

θ = ∑y − θ ∑x
0
1
N i
1
1 N i

θ ∑x = ∑y x − θ ∑x
1
2
i i i 0 i

θ1 =
∑ y x − ∑y ∑x
i i
1
N i i

∑x − ∑x ∑x2
i
1
N i i
Tony Jebara, Columbia University

Properties of the Solution

•Setting θ* as before gives least squared error
•Define error on each data point as:
ei = yi − θ* x i − θ*
1 0

•Note property #1:

∑ (y − θ1x i − θ 0 ) = 0
N
∂R
∂θ0
= 1
N i =1 i

…average error is zero 1

N ∑e i
=0

•Note property #2:

∑ (y − θ1x i − θ 0 )x i = 0
N
∂R
∂θ1
= 1
N i =1 i
…error not correlated with data
1
N ∑ eixi = 1
N
eT x = 0
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in vector-matrix
2
notation:
R (θ) = 2N1 ∑ i =1 (yi − θ1x i − θ 0 )
N

2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡ 1 x ⎤
⎢ 1 ⎥ ⎢ 1 ⎥⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ θ0 ⎥
= 1
⎢  ⎥−⎢   ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ ⎥⎢ θ ⎥
⎢ yN ⎥ ⎢ 1 xn ⎥⎣ 1 ⎦
⎣ ⎦ ⎣ ⎦
2
= 1
2N
y − Xθ
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in vector-matrix
2
notation:
R (θ) = 2N1 ∑ i =1 (yi − θ1x i − θ 0 )
N

2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2
⎡ y ⎤ ⎡ 1 x ⎤
⎢ 1 ⎥ ⎢ 1 ⎥⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ θ0 ⎥ Can add more
= 1
⎢  ⎥−⎢   ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ ⎥⎢ θ ⎥ dimensions by
⎢ yN ⎥ ⎢ 1 xn ⎥⎣ 1 ⎦
⎣ ⎦ ⎣ ⎦ adding columns
2 to X matrix and
= 1
2N
y − Xθ rows to θ vector
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More elegant/general to do ∇θR = 0 with linear algebra
•Rewrite empirical risk in 2vector-matrix notation:
R (θ) = 2N ∑ i =1 (yi − θ1x i − θ 0 )
1 N

2
⎛ ⎡ θ ⎤ ⎞⎟
N ⎜ ⎜ ⎡ 1 x ⎤ ⎢ 0 ⎥ ⎟⎟
= 1
2N ∑ i =1 ⎜⎜ i ⎢⎣
⎜ y − i ⎥⎦ ⎢ ⎥ ⎟⎟
⎢ θ1 ⎥ ⎟⎟
⎜⎝ ⎣ ⎦⎠
2

⎡ y ⎤ ⎡⎢ 1 x (1) … x (D ) ⎤ ⎡⎢ θ 0 ⎤⎥
⎢ 1 ⎥ ⎢ ⎥⎢ ⎥
⎢ ⎥ ⎢
1 1
⎥⎢ θ1 ⎥ Can add more
= 1
⎢  ⎥−⎢     ⎥⎢ ⎥ dimensions by
⎢ ⎥ ⎢ ⎥⎢
 ⎥⎥
2N
⎢ yN ⎥ ⎢ 1 x N (1) … x N (D ) ⎥⎢
⎥⎦ ⎢ adding columns
⎣ ⎦ ⎣ θD ⎥⎥
⎢⎣ ⎦ to X matrix and
2 rows to θ vector
= 1
2N
y − Xθ
Tony Jebara, Columbia University

Multi-Dimensional Regression
•More realistic dataset: many measurements
•Have N apartments each with D measurements
•Each row of X is [#rooms; latitude; longitude,…]
⎡ ⎤
1( ) 1( ) ⎥
⎢ 1 x 1 … x D
⎢ ⎥
X=⎢     ⎥
⎢ ⎥
⎢ 1 x (1) … x (D ) ⎥
⎢⎣ N N ⎥⎦
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0 ∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜
2N ⎝ ⎠
Tony Jebara, Columbia University

Multi-Dimensional Regression
∇ θR = 0
•Solving gradient=0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0
⎝ ⎠
⎛ ⎞⎟
∇θ ⎜⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1
2N ⎝ ⎠
1
2N (
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 )
Tony Jebara, Columbia University

Multi-Dimensional Regression
∇ θR = 0
•Solving gradient=0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T 
⎝ ⎠ ∂u θ 
 = uT
∂θ
⎛ ⎞⎟
∇θ ⎜⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1
T 
2N ⎝ ⎠ ∂θ θ T
 = 2θ

( )
∂θ
1
2N
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T 
∂θ Aθ T
( )
1
2N (−2y X + 2θ X X) = 0
T T T  = θ A + AT
∂θ
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
∇ θR = 0
⎛1 2⎞
∇θ ⎜⎜⎜ 2N y − Xθ ⎟⎟⎟ = 0 T 
⎝ ⎠ ∂u θ 
 = uT
∂θ
⎛ ⎞⎟
∇θ ⎜(y − Xθ) (y − Xθ)⎟⎟ = 0
T
1 ⎜ T 
T
2N ⎝ ⎠ ∂θ θ
 = 2θ
∂θ
1
2N ( )
∇θ yT y − 2yT Xθ + θT XT Xθ = 0 T 
∂θ Aθ T
( )
(−2y X + 2θ X X) = 0
T T T  = θ A + AT
1 ∂θ
2N

XT Xθ = XT y
Tony Jebara, Columbia University

XT Xθ = XT y

( )
−1
* T
θ = X X XT y
•In Matlab: “t=pinv(X)*y” or “t=X\y” or “t=inv(X’*X)*X’*y”
Tony Jebara, Columbia University

Multi-Dimensional Regression
•Solving gradient=0
XT Xθ = XT y

( )
−1
* T
θ = X X XT y
•In Matlab: “t=pinv(X)*y” or “t=X\y” or “t=inv(X’*X)*X’*y”
•If the matrix X is skinny, the solution is probably unique
•If X is fat (more dimensions than points) we get multiple
solutions for theta which give zero error.

•The pseudeoinverse (pinv(X)) returns the theta with zero

error and which has the smallest norm.
2
min θ θ such that Xθ = y
Tony Jebara, Columbia University

2D Linear Regression
•Once best θ* is found, we can plug it into the function:
( )
f x; θ* = θ*2x(2) + θ1*x(1) + θ*0

•What would a fat X look like?

Tony Jebara, Columbia University

Polynomial Function Classes

•Back to 1-dim x (D=1) BUT Nonlinear
•Polynomial: f x; θ =( ) ∑ p=1 p + θ0
P
θ x p

2
•Writing Risk: ⎡ ⎡ θ 0 ⎤⎥
⎤ ⎡ 1 P ⎤⎢
y
⎢ 1 ⎥ ⎢ 1 x1 … x1 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ θ1 ⎥
R (θ) = 1
⎢  ⎥−⎢     ⎥⎢ ⎥
2N ⎢ ⎥ ⎢ P ⎥⎢  ⎥⎥
⎢ N ⎥ ⎢ 1 x N1
y … xN ⎥ ⎢
⎣ ⎦ ⎣ ⎦⎢ θP ⎥⎥
⎢⎣ ⎦
•Order-P polynomial regression fitting for 1D variable is
same as P-dimensional linear regression!
T
•Construct a multidim x = ⎡ x 0 x 1 x 2 x 3 ⎤
⎢⎣ i ⎥⎦
x-vector from x scalar i i i i
T
⎡ ⎤
•More generally any x i
= ( )
⎢ φ0 x i φ1 x i
⎣
( ) φ2 (x i ) φ3 (x i ) ⎥
⎦
Tony Jebara, Columbia University

Underfitting/Overfitting
•Try varying P. Higher P fits a more complex function class
•Observe R(θ*) drops with bigger P

P=0 P=1 P=2

P=5 P=10 P=15

Tony Jebara, Columbia University

Evaluating The Regression

•Unfair to use empirical to find best order P
•High P (vs. N) can overfit, even linear case!
•min R(θ*) not on training but on future data
•Want model to Generalize to future data

True loss: Rtrue (θ) = ∫ P (x,y ) L (y, f (x; θ))dx dy

•One approach: split data into training / testing portion

{(x ,y ),…, (x
1 1 N }
,yN ) {(x N +1
,yN +1 ),…, (x N +M ,yN +M ) }
•Estimate θ* with training loss: Rtrain (θ) =
1
N ∑
N
i =1 ( )
L yi , f (x i ; θ)

•Evaluate P with testing loss: Rtest (θ) = 1

M ∑
N +M
i =N +1 (
L yi , f (x i ; θ))
Tony Jebara, Columbia University

Crossvalidation
•Try fitting with different polynomial order P
•Select P which gives lowest Rtest(θ*)

Loss ( )
Rtest θ*

( )
Rtrain θ*
P
 underfitting overfitting 

best P

•Think of P as a measure of the complexity of the model

•Higher order polynomials are more flexible and complex
Tony Jebara, Columbia University

Machine Learning
4771

Instructor: Tony Jebara

Tony Jebara, Columbia University

Topic 3
•Additive Models and Linear Regression
•Sinusoids and Radial Basis Functions
•Classification
•Logistic Regression
•Gradient Descent
Tony Jebara, Columbia University

Polynomial Basis Functions

•To fit a P’th order polynomial function to multivariate data:
concatenate columns of all monomials up to power P
•E.g. 2 dimensional data and 2nd order polynomial (quadratic)
⎡ ⎤
⎢ x1 (1) x1 (2) ⎥
⎢ ⎥
⎢   ⎥
⎢ ⎥
⎢
⎢ x i (1) x i (2) ⎥⎥
⎢ ⎥
⎢   ⎥
⎢ ⎥
⎢ x N (1) x N (2) ⎥
⎣ ⎦
⎡ ⎤
⎢ 1 x1 (1) x1 (2) x1 (1) x1 (1) x1 (1) x1 (2) x1 (2) x1 (2) ⎥
⎢ ⎥
⎢       ⎥
⎢ ⎥
⎢
⎢ 1 x i (1) x i (2) x i (1) x i (1) x i (1) x i (2) x i (2) x i (2) ⎥⎥
⎢ ⎥
⎢       ⎥
⎢ ⎥
⎢ 1 x N (1) x N (2) x N (1) x N (1) x N (1) x N (2) x N (2) x N (2) ⎥
⎣ ⎦
Tony Jebara, Columbia University

Sinusoidal Basis Functions

•More generally, we don’t just have to deal
with polynomials, use any set of basis fn’s:
f (x;θ) = ∑ p=1 θ p φp (x ) + θ0
P

•These are generally called Additive Models

•Regression adds linear combinations of the basis fn’s
•For example: Fourier (sinusoidal) basis
( )
φ2k x i = sin kx i( ) ( )
φ2k +1 x i = cos kx i ( )
•Note, don’t have to be a basis per se, usually subset

θ1 × + θ2 × + θ3 ×
Tony Jebara, Columbia University

Radial Basis Functions

•Can act as prototypes of the data itself
⎛ 1 2⎞
( )
f x; θ = ∑ k =1 θk exp ⎜⎜⎜− 2 x − xk ⎟⎟⎟
N

⎝ 2σ ⎠
•Parameter σ = standard deviation
σ2 = covariance
controls how wide bumps are
what happens if too big/small?

•Also works in multi-dimensions

•Called RBF for short

Tony Jebara, Columbia University

Radial Basis Functions

•Each training point leads to a bump function
⎛ 1 2⎞
( )
f x; θ = ∑ k =1 θk exp ⎜⎜⎜− 2 x − xk ⎟⎟⎟
N

⎝ 2σ ⎠
( )
−1
* T T
•Reuse solution from linear regression: θ = X X X y
•Can view the data instead as X, a big matrix of size N x N
⎡ ⎤
⎢ exp ⎛⎜− 1 x − x ⎞⎟ ⎛ 1 ⎞⎟ ⎛ 1 ⎞⎟
2 2 2
⎟⎟ exp ⎜⎜− 2 x − x ⎟⎟ exp ⎜⎜− 2 x − x ⎟⎟ ⎥
⎢ ⎜⎜⎝ 2σ2 1 1
⎠ ⎜⎝ 2σ 1 2
⎠ ⎜⎝ 2σ 1 3
⎠ ⎥
⎢ ⎥
⎢ ⎛ 1 2⎞ ⎥
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
X = exp ⎜⎜⎜− 2 x 2 − x1
⎢ ⎟⎟ ⎥
⎢ ⎝ 2σ ⎟⎠ ⎜⎝ 2σ 2 2 ⎟⎠ ⎜⎝ 2σ 2 3 ⎟⎠ ⎥
⎢ ⎥
⎢ ⎛ 1 2⎞ ⎥
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎟⎟ exp ⎛⎜⎜− 1 2 x − x
2⎞
⎢ exp ⎜⎜⎜− 2 x 3 − x1 ⎟⎟ ⎥
⎢ ⎝ 2σ ⎟⎠ ⎜⎝ 2σ 3 2 ⎟⎠ ⎜⎝ 2σ 3 3 ⎟⎠ ⎥
⎣ ⎦
•For RBFs, X is square and symmetric, so solution is just
∇θR = 0 → XT Xθ = XT y → Xθ = y → θ* = X−1y
Tony Jebara, Columbia University

Evaluating Our Learned Function

•We minimized empirical risk to get θ*
•How well does f(x;θ*) perform on future data?
•It should Generalize and have low True Risk:

Rtrue (θ) = ∫ P (x,y ) L (y, f (x; θ))dx dy

•Can’t compute true risk, instead use Testing Empirical Risk
•We randomly split data into training and testing portions

{(x ,y ),…, (x
1 1 N }
,yN ) {(x N +1
,yN +1 ),…, (x N +M ,yN +M ) }
•Find θ* with training data: Rtrain (θ) = 1
N ∑
N
i =1 (
L yi , f (x i ; θ))
•Evaluate it with testing data: Rtest (θ) = 1
M ∑
N +M
i =N +1 (
L yi , f (x i ; θ) )
Tony Jebara, Columbia University

Crossvalidation
•Try fitting with different sigma radial basis function widths
•Select sigma which gives lowest Rtest(θ*)

Loss ( )
Rtest θ*

( )
Rtrain θ*
1/σ
 underfitting overfitting 

Best sigma

•Think of sigma as a measure of the simplicity of the model

•Thinner RBFs are more flexible and complex
Tony Jebara, Columbia University

Regularized Risk Minimization

•Empirical Risk Minimization gave overfitting & underfitting
•We want to add a penalty for using too many theta values
•This gives us the Regularized Risk

Rregularized (θ) = Rempirical (θ) + Penalty (θ)

( )
L yi , f (x i ; θ) +
2
∑
N
= 1
N i =1
λ
2N
θ

•Solution for Regularized Risk with Least Squares Loss:

⎛1 2 ⎞⎟
2
∇θRregularized = 0 ⇒ ∇θ ⎜⎜⎜ 2N y − Xθ + λ
θ ⎟⎟ = 0
⎝ 2N ⎠

( )
−1
* T
θ = X X + λI XT y
Tony Jebara, Columbia University

Regularized Risk Minimization

•Have D=16 features (or P=15 throughout)
•Try minimizing Rregularized(θ) to get θ* with different λ

•Note that λ=0 give back Empirical Risk Minimization
Tony Jebara, Columbia University

Crossvalidation
•Try fitting with different lambda regularization levels
•Select lambda which gives lowest Rtest(θ*)

Risk ( )
Rtest θ*

( )
Rtrain θ*
1/λ
 underfitting overfitting 

Best lambda

•Lambda measures simplicity of the model

•Models with low lambda are more flexible
Tony Jebara, Columbia University

From Regression To Classification

•Classification is another important learning problem
Regression X = {(x ,y ), (x ,y ),…, (x
1 1 2 2 N
,yN )} x ∈ RD y ∈ R
1

Classification X = {( x ,y ), ( x ,y ),…, ( x
1 1 2 2 N
,yN )} x ∈ RD y ∈ {0,1}

•E.g. Given x = [tumor size, tumor density]

Predict y in {benign,malignant}

•Should we solve this as a least squares regression problem?

x
x xx x x x
x
x x x
x O
OOO OO
O
O O
O
Tony Jebara, Columbia University

Classification vs. Regression

a) Classification needs binary answers like {0,1}
b) Least squares is an unfair measure of risk here
e.g. Why penalize a correct but large positive y answer?
e.g. Why penalize a correct but large negative y answer?

•Example: not good to use regression output for a decision

f(x)>0.5  Class 1 f(x)<0.5  Class 0
if f(x)=-3.8 & correct class=0, squared error penalizes it…

f(x) We pay a hefty squared error loss

xx x x x x
1.0 here even if we got the correct
0.5 classification result. The thick
0.0 oooo
x solid line model makes two
mistakes while the dashed
model is perfect
Tony Jebara, Columbia University

Classification vs. Regression

We will consider the following four steps to improve from
naïve regression to get better classification learning:

1) Fix functions f(x) to give binary output (logistic neuron)

2) Fix our definition of the Risk we will minimize so that we
get good classification accuracy (logistic loss)

…and later on…

3) Make an even better fix on f(x) to binarize (perceptron)

4) Make an even better risk (perceptron loss)
Tony Jebara, Columbia University

Logistic Neuron (McCullough-Pitts)

•To output binary, use squashing function g().

f (x; θ) = θ x
T ( )
f (x; θ) = g θT x
( )
−1
g (z ) = 1 + exp (−z )

Linear neuron Logistic Neuron

•This squashing is called sigmoid or logistic function

Tony Jebara, Columbia University

Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}

•Use this function and output 1 if f(x)>0.5 and 0 otherwise

( ))
−1
(
f (x; θ) = 1 + exp −θ x
T
Tony Jebara, Columbia University

Short hand for Linear Functions

•What happened to adding the intercept?

f (x; θ) = θT x + θ 0
T
⎡ θ ⎤ ⎡ ⎤
T
⎡ θ(1) ⎤ ⎡ x(1) ⎤ ⎢ 0 ⎥ ⎢ 1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ θ(1) ⎥ ⎢ x(1) ⎥ 
⎢ θ(2) ⎥ ⎢ x(2) ⎥ ⎢ ⎥ ⎢ ⎥ T
=⎢ ⎥ ⎢ ⎥ + θ 0 = ⎢ θ(2) ⎥ ⎢ x(2) ⎥ = θ x
⎢  ⎥ ⎢  ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
 ⎥ ⎢  ⎥
⎢⎣ θ(D) ⎥⎦ ⎢⎣ x(D) ⎥⎦ ⎢ ⎢ ⎥
⎢ θ(D) ⎥ ⎢ x(D) ⎥
⎣ ⎦ ⎣ ⎦
Tony Jebara, Columbia University

Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}

•Fix#1: use f(x) below, output 1 if f(x)>0.5 and 0 otherwise

( ))
−1
(
f (x; θ) = 1 + exp −θ x
T
Tony Jebara, Columbia University

Logistic Regression
•Given a classification problem with binary outputs
{ }
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}

•Fix#1: use f(x) below, output 1 if f(x)>0.5 and 0 otherwise

( ))
−1
f (x; θ) = 1 + exp −θ x ( T

•Fix#2: instead of squared loss, use Logistic Loss

( ) ( )
Llog y, f (x; θ) = (yi −1) log 1 − f (x; θ) − yi log f (x; θ) ( )
•This method is called Logistic Regression.
•But Empirical Risk Minimization has no closed-form sol’n:
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
Tony Jebara, Columbia University

Logistic Regression
•With logistic squashing function, minimizing R(θ) is harder
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
⎛ ⎞⎟
⎜ 1 − y y ⎟⎟ f ' x ; θ = 0 ???
⎜⎜1 − f (x ; θ) f (x ; θ)⎟⎟⎟ ( i )
∇θR = N1 ∑ i =1 ⎜⎜
N i
− i

⎝ i i ⎠
•Can’t minimize risk and find best theta analytically!
•Let’s try finding best theta numerically.
•Use the following to compute gradient
( ))
−1
f (x; θ) = 1 + exp −θ x ( T
( )
= g θT x

•Here, g() is the logistic squashing function

( ) ( )
−1
g (z ) = 1 + exp (−z ) g ' (z ) = g (z ) 1 − g (z )
Tony Jebara, Columbia University

Gradient Descent
•Useful when we can’t get minimum solution in closed form
•Gradient points in direction of fastest increase
•Take step in the opposite direction!

•Gradient Descent Algorithm

choose scalar step size η, & tolerance ε

initialize θ 0 = small random vector

θ1 = θ 0 − η ∇θRemp , t =1
θ0

while θt − θt −1 ≥∈ {
θt +1 = θt − η ∇θRemp t , t = t + 1
θ }
•For appropriate η, this will converge to local minimum
Tony Jebara, Columbia University

Logistic Regression
•Logistic regression gives better classification performance
•Its empirical risk is
Remp (θ) = 1
N ∑ (y
N
i =1 i ( ) (
−1) log 1 − f (xi ; θ) − yi log f (xi ; θ) )
•This R(θ) is convex so
gradient descent always
converges to the
same solution

•Make predictions using

( ))
−1
f (x; θ) = 1 + exp −θ x( T

•Output 1 if f > 0.5

•Output 0 otherwise

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Representer Function
No ratings yet
Representer Function
12 pages
Least Square Vs Gradient Descent
No ratings yet
Least Square Vs Gradient Descent
52 pages
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
No ratings yet
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
25 pages
Kondor Regression
No ratings yet
Kondor Regression
4 pages
CH 1
No ratings yet
CH 1
24 pages
Lec10 LeastSquaresRegression PDF
No ratings yet
Lec10 LeastSquaresRegression PDF
4 pages
Lec 13
No ratings yet
Lec 13
10 pages
Lec 03
No ratings yet
Lec 03
42 pages
eng
No ratings yet
eng
10 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
No ratings yet
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
51 pages
Day 1
No ratings yet
Day 1
41 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Cost Function
No ratings yet
Cost Function
17 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
7 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Paper On Polynomial Regression
No ratings yet
Paper On Polynomial Regression
7 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
Derivation of Normal Equations
No ratings yet
Derivation of Normal Equations
7 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
SolutionQuiz1
No ratings yet
SolutionQuiz1
5 pages
Lec6 Linear Model With LSP
No ratings yet
Lec6 Linear Model With LSP
35 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Dokumen - Tips - Homework 3 Solution Ee263 Introduction To Linear Ee263 Homework 3 Solution
No ratings yet
Dokumen - Tips - Homework 3 Solution Ee263 Introduction To Linear Ee263 Homework 3 Solution
27 pages
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Machine Learning - Home - Week 2 - Notes - Coursera
No ratings yet
Machine Learning - Home - Week 2 - Notes - Coursera
10 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Linear Regression: Normal Equation and Gradient Descent
No ratings yet
Linear Regression: Normal Equation and Gradient Descent
17 pages
ML_Lec 4-introduction to regression
No ratings yet
ML_Lec 4-introduction to regression
65 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
lec24 linear regression
No ratings yet
lec24 linear regression
10 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
09 Curve Fitting II
No ratings yet
09 Curve Fitting II
17 pages
ML_Lec 5_Regression_Gradient Descent Least Square
No ratings yet
ML_Lec 5_Regression_Gradient Descent Least Square
59 pages
Empirical Risk Minimization
No ratings yet
Empirical Risk Minimization
3 pages
Lec 9
No ratings yet
Lec 9
14 pages
Machine Learning: Dr. Shahid Mahmood Awan
No ratings yet
Machine Learning: Dr. Shahid Mahmood Awan
164 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
Linear Regression
No ratings yet
Linear Regression
30 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
Python Tutorial
No ratings yet
Python Tutorial
37 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
ML 5
No ratings yet
ML 5
21 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
InstMan6 10
No ratings yet
InstMan6 10
68 pages
Mws Gen Nle TXT Bisection
No ratings yet
Mws Gen Nle TXT Bisection
6 pages
Estimation of Parameters (Part 2)
No ratings yet
Estimation of Parameters (Part 2)
33 pages
Ans Final Sta104 Feb 2021
No ratings yet
Ans Final Sta104 Feb 2021
6 pages
Full download Modern Statistics with R From Wrangling and Exploring Data to Inference and Predictive Modelling Second Edition Måns Thulin pdf docx
100% (2)
Full download Modern Statistics with R From Wrangling and Exploring Data to Inference and Predictive Modelling Second Edition Måns Thulin pdf docx
71 pages
Normality Test
No ratings yet
Normality Test
12 pages
Predicting Student Academic Success DDA
No ratings yet
Predicting Student Academic Success DDA
26 pages
Normal Distribution
No ratings yet
Normal Distribution
28 pages
02 ABE Review - Sampling Techniques
No ratings yet
02 ABE Review - Sampling Techniques
41 pages
Minimum Mean Square Error Estimation
No ratings yet
Minimum Mean Square Error Estimation
2 pages
Ken Black QA ch13
100% (1)
Ken Black QA ch13
56 pages
Pearson and Correlation
No ratings yet
Pearson and Correlation
8 pages
Logistic Regression - Complete Problems: Outliers and Influential Cases Split-Sample Validation Sample Problems
No ratings yet
Logistic Regression - Complete Problems: Outliers and Influential Cases Split-Sample Validation Sample Problems
72 pages
Cheat Sheet - SSP1
No ratings yet
Cheat Sheet - SSP1
10 pages
Instrumental Variables and The Search For Identification
No ratings yet
Instrumental Variables and The Search For Identification
88 pages
Econometric Methods For Panel Data
No ratings yet
Econometric Methods For Panel Data
58 pages
Statistical Modeling Implications For Coffee Progenies Selection
No ratings yet
Statistical Modeling Implications For Coffee Progenies Selection
13 pages
UDAU M6 Correlation & Regression
No ratings yet
UDAU M6 Correlation & Regression
26 pages
Stock Watson 4E Exercisesolutions Chapter12 Students
No ratings yet
Stock Watson 4E Exercisesolutions Chapter12 Students
6 pages
Discriminant Analysis Cases
No ratings yet
Discriminant Analysis Cases
2 pages
Quantification of Favorable Conditions During The Season: MDR MDR
No ratings yet
Quantification of Favorable Conditions During The Season: MDR MDR
5 pages
Slides Prepared by John S. Loucks St. Edward's University
100% (1)
Slides Prepared by John S. Loucks St. Edward's University
44 pages
SPSS Chapter 14
No ratings yet
SPSS Chapter 14
34 pages
R Project
No ratings yet
R Project
14 pages
Name: Score: Grade & Section: Date
No ratings yet
Name: Score: Grade & Section: Date
3 pages
Determination of The Aluminium Content in Different Brands of Deodor
No ratings yet
Determination of The Aluminium Content in Different Brands of Deodor
14 pages
Multivariate Quality Control Thesis
100% (1)
Multivariate Quality Control Thesis
135 pages
Exercise 8
No ratings yet
Exercise 8
2 pages
Direct Manpower Manhour - Rev
No ratings yet
Direct Manpower Manhour - Rev
1 page