0% found this document useful (0 votes)
341 views

Machine Learning

This document provides an overview of topics covered in Machine Learning 4771 taught by Tony Jebara at Columbia University. It includes tutorials on using Matlab for machine learning problems and reviews concepts like logistic neurons, gradient descent, perceptrons, stochastic gradient descent, and backpropagation in neural networks. It also discusses multi-layer neural networks and how they can handle more complex decisions compared to single layer networks.

Uploaded by

jamey_mork1
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
341 views

Machine Learning

This document provides an overview of topics covered in Machine Learning 4771 taught by Tony Jebara at Columbia University. It includes tutorials on using Matlab for machine learning problems and reviews concepts like logistic neurons, gradient descent, perceptrons, stochastic gradient descent, and backpropagation in neural networks. It also discusses multi-layer neural networks and how they can handle more complex decisions compared to single layer networks.

Uploaded by

jamey_mork1
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Tony Jebara, Columbia University

Machine Learning 4771


Instructor: Tony Jebara

Tony Jebara, Columbia University

Topic 4
utorial: Matlab T eview: Logistic Neuron & Gradient Descent R erceptron P nline Perceptron and Stochastic Gradient Descent O onvergence Guarantee C erceptron vs. Linear Regression P ulti-Layer Neural Networks M ack-Propagation B emo: LeNet D

Tony Jebara, Columbia University

Tutorial: Matlab
achine learning code is best written in Matlab M nline info to get started is available at: O http://www.cs.columbia.edu/~jebara/tutorials.html ow to get access to Matlab H atlab tutorials M ist of Matlab function calls L xample code: for homework #1 will use polyreg.m E eneral: help, lookfor, 1:N, rand, zeros, A, reshape, size G ath: max, min, cov, mean, norm, inv, pinv, det, sort, eye M ontrol: if, for, while, end, %, function, return, clear C isplay: figure, clf, axis, close, plot, subplot, hold on, fprintf D nput/Output: load, save, ginput, print, I BS and TAs are also helpful B

Tony Jebara, Columbia University

Logistic Neuron
f (x; ) = x
T

(McCullough-Pitts)

nother choice of last node is squashing function g(). A

g (z ) = 1 + exp (z )

f (x; ) = g T x

( )

Linear neuron

Logistic Neuron

his squashing is called sigmoid or logistic function T

Tony Jebara, Columbia University

Logistic Neuron Gradient Descent


xample: E

eed to pick the step size scalar well (each step reduces R) N oo small slow, too large unstable T eed to avoid flat regions in the space (slow) N .e. make sure we are operating in linear regime I of squashing function

Tony Jebara, Columbia University

Classification vs. Regression


hy is regression & squared error bad for classification? W
Regression Classification

{(x ,y ), (x ,y ),, (x X = {(x ,y ), (x ,y ),, (x


X =
1 1 1 1 2 2 2 2

,yN ) x RD y R1 N ,yN N
D

} )} x R

y {1,1}

.g. x is size & density of a tumor E predict y, if it is benign/malignant

ould convert this into a least squares regression problem C

x x x x xx x x xx x O x O OO OO OO O O

Why is this bad?

Tony Jebara, Columbia University

Squashing & Classification


e can convert regression output into a decision W f(x)>0 Class 1 f(x)<0 Class 0 f f(x)=3.8, class=1 but squared error penalizes it I y x xx x x x Large x
x xx xxx

reg error but good classification here


2

ix 1) Squashing Function g() in {-1,1} with squared loss F

f (x; ) = g x and L y, f (x; ) =


T

( )

1 4

(y f (x; ))

r Fix 2) use Classification Loss (aka Classification Error) O

L y, f (x; ) = step yf (x; )

Tony Jebara, Columbia University

Perceptron (another Neuron)


better choice for classification squashing function is A
1 when z < 0 g (z ) = +1 when z 0

g (z ) 1,1

quivalent to classification loss function E


R () =
1 4N

N i =1

y g xi

))

1 N

step yi T x i i =1
N

hat does this R() function look like? W

Tony Jebara, Columbia University

Perceptron
better choice for classification squashing function is A
1 when z < 0 g (z ) = +1 when z 0

g (z ) 1,1

quivalent to classification loss function E


R () =
1 4N

N i =1

y g xi

))

1 N

step yi T x i i =1
N

hat does this R() function look like? W bert-like, no gradient descent Q since the gradient is zero except at edges when a label flips

Tony Jebara, Columbia University

Perceptron
nstead of Classification Loss, lets try I erceptron Loss: P
1 R per () = N

R () =

1 N

step yi T x i i =1

yi T x i imisclassified

nstead of staircase-shaped R I get smooth piece-wise linear R et reasonable gradient G


1 R per () = N

imisclassified

yi x i

t +1 = t R per

1 = t + N

imisclassified

yi x i

Tony Jebara, Columbia University

Perceptron vs. Linear Regression


inear regression gets close L but doesnt do perfectly err =2 sq. err = 0.139 erceptron can potentially P get 0 error err =0 per. err = 0

Tony Jebara, Columbia University

Stochastic Gradient Descent


radient Descent vs. Stochastic Gradient Descent G nstead of computing the average gradient for all points I nd then taking a step A
1 R per () = N

imisclassified

yi x i

pdate the gradient for each mis-classified point by itself U

R per () = yi xi
t +1 = t R per = t + yi x i

if i mis-classified

lso, set to 1 without loss of generality A


t

if i mis-classified

Tony Jebara, Columbia University

Online Perceptron
terate cycling through examples i=1N one at a time. I If ith example is properly classified: t +1 = t Else: t +1 = t + yi xi herorem: converge to zero error * in finite total steps t. T 1) assume all data inside a sphere of radius r: xTi r i 2) assume data is separable with margin : yi (* ) xi i art 1) Look at inner product of current t with * P assume we just updated a mistake on point i:

( )
* *

( )
* *

t 1

+ yi

( )
*

xi

( )
*

t 1 +

after applying t such updates, we must get:

( )

( )

t t

Tony Jebara, Columbia University

Online Perceptron Proof


art 1) (* ) t = (* ) t t P
T T

art 2) P

t 1

+ yi x i
2

t 1

+ 2yi

( )
t 1

xi + xi

t 1

+ xi + r2

t 1

since only update mistakes middle term is negative

tr 2

art 3) Angle between optimal & current solution P


cos *, t

( ) =
*

t
*

t
t

t tr 2 *
2

apply part 1 then part 2 so t is finite!

ince cos 1 S

t tr 2 *

r2 * 1 t 2

Tony Jebara, Columbia University

Multi-Layer Neural Networks


hat if we consider cascading multiple layers of network? W ach output layer is input to the next layer E ach layer has its own weights parameters E g: each layer has linear nodes (not perceptron/logistic) E

bove Neural Net has 2 layers. What does this give? A

Tony Jebara, Columbia University

Multi-Layer Neural Networks


eed to introduce non-linearities between layers N voids previous redundant linear layer problem A

eural network can adjust the basis functions themselves N

f (x ) = g

i g T x i =1 i
P

( ))

Tony Jebara, Columbia University

Multi-Layer Neural Networks


ulti-Layer Network can handle more complex decisions M -layer: is linear, cant handle XOR 1 ach layer adds more flexibility (but more parameters!) E ach node splits its input space with linear hyperplane E -layer: if last layer is AND operation, get convex hull 2 -layer: can do almost anything multi-layer can 2 by fanning out the inputs at 2nd layer

ote: Without loss of generality, we can omit the 1 and 0 N

Tony Jebara, Columbia University

Back-Propagation
radient descent on squared loss is done layer by layer G ayers: input, hidden, output. Parameters: = {wij ,w jk ,wkl } L

a j = wij zi
i

zj = g aj

( )

ak = w jk z j
j

zk = g (ak )

al = wkl zk
k

zl = g (al )

ach input xn for n=1..N generates its own as and zs E ack-Propagation: Splits layer into its inputs & outputs B et gradient on outputback-track chain rule until input G

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () = =
1 N 1 N

L yn f x n n=1
1 n 2

(
n

( ))
k kl j jk i n ij i 2

(y

( w g ( w g ( w x ))))

irst compute output layer derivative: F


R = wkl
1 N

n n L al a n w Chain Rule kl n l

define L :=

1 2

(y

f x

( ))
n

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () = =
1 N 1 N

L yn f x n n=1
1 n 2

(
n

( ))
k kl j jk i n ij i 2

(y

( w g ( w g ( w x ))))

irst compute output layer derivative: F


R = wkl =
1 N

1 N

n n L al a n w Chain Rule kl n l 2 1 n n 2 y g al a n l w n a n kl l

define L :=

1 2

(y

f x

( ))
n

( ))

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () = =
1 N 1 N

L yn f x n n=1
1 n 2

(
n

( ))
k kl j jk i n ij i 2

(y

( w g ( w g ( w x ))))

irst compute output layer derivative: F


R = wkl =
1 N

1 N

n n L al a n w Chain Rule kl n l 2 1 n n 2 y g al a n l = 1 w N n a n kl l

define L :=

1 2

(y

f x

( ))
n

( ))

(y
n

n zln g ' aln zk

) ( )( )

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () = =
1 N 1 N

L yn f x n n=1
1 n 2

(
n

( ))
k kl j jk i n ij i 2

(y

( w g ( w g ( w x ))))

irst compute output layer derivative: F


R = wkl =
1 N

1 N

n n L al a n w Chain Rule kl n l 2 1 n n 2 y g al a n l = 1 w N n a n kl l

define L :=
Define as

1 2

(y

f x

( ))
n
n n l k

( ))

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

ext, hidden layer derivative: N


R = w jk
1 N

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

Ln a n k a n w jk n k

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

ext, hidden layer derivative: N


R = w jk
1 N

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

Ln a n k = a n w jk n k

1 N

n n Ln al ak a n a n w jk n l l k

Multivariate Chain Rule

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

ext, hidden layer derivative: N


R = w jk =
1 N

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

1 N

Ln a n k = a n w n jk k n n al z n l a n j n l k

1 N

n n Ln al ak a n a n w n l jk l k

Multivariate Chain Rule

( )

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

ext, hidden layer derivative: N


R = w jk
n n Ln a n Ln al ak k 1 1 = N n N n n w w n ak n l al ak jk jk n 1 n al z n = N l n ak j n l Recall al = k w kl g (ak )

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

Multivariate Chain Rule

( )

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

ext, hidden layer derivative: N


R = w jk
n n Ln a n Ln al ak k 1 1 = Multivariate Chain Rule N n N n n w w n ak n l al ak jk jk n 1 n al z n = 1 n w g ' a n z n = N l l kl k j n N ak j n l n l Recall al = k w kl g (ak )

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

( )

( )( )

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

ext, hidden layer derivative: N


R = w jk
n n Ln a n Ln al ak k 1 1 = N n Multivariate Chain Rule N n n w w ak jk al ak jk n n l n 1 n al z n = 1 n w g ' a n z n = 1 n z n = N l l kl k j k j n N N ak j n l n l n Recall al = k w kl g (ak ) Define as

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

( )

( )( )

Tony Jebara, Columbia University

Back-Propagation
ost function: C
R () =
1 N

1 n 2

(y

( w g ( w g ( w x ))))
k kl j jk i n ij i

R = wkl

1 N

n n L al a n w = kl n l

1 N

(y
n

n zln g ' aln zk =

) ( )( )

1 N

z
n

n n l k

R = w jk
R = wij

1 N

Ln a n k = a n w n jk k

1 N

n w g ' a n z n = l kl k j n l

( )( )

1 N

z
n

n n k j

ny previous (input) layer derivative: repeat the formula! A


1 N

n a n L j = a n w ij n j

1 N

n n Ln ak a j a n a n w = ij n k k j

1 N

n w g ' a n z n = k jk j i n k

( )( )

1 N

z
n

n n j i

hat is this last z? W

Tony Jebara, Columbia University

Back-Propagation
gain, take small step in direction opposite to gradient A
t t wij+1 = wij

R wij

w tjk+1 = w tjk

R w jk

t t wkl+1 = wkl

R wkl

igits Demo: LeNet http://yann.lecun.com D roblems with back-prop P is that MLP over-fits ther problems: hard to interpret, black-box O hat are the hidden inner layers doing? W ther main problem: O minimum training error not minimum testing error

Tony Jebara, Columbia University

Minimum Training Error?


s minimizing Empricial Risk the right thing? I re Perceptrons and Neural Networks A giving the best classifier? e are getting: W minimum training error not minimum testing error

erceptrons are giving a bunch of solutions: P

a better solution SVMs

You might also like