Machine Learning
Machine Learning
Topic 4
utorial: Matlab T eview: Logistic Neuron & Gradient Descent R erceptron P nline Perceptron and Stochastic Gradient Descent O onvergence Guarantee C erceptron vs. Linear Regression P ulti-Layer Neural Networks M ack-Propagation B emo: LeNet D
Tutorial: Matlab
achine learning code is best written in Matlab M nline info to get started is available at: O http://www.cs.columbia.edu/~jebara/tutorials.html ow to get access to Matlab H atlab tutorials M ist of Matlab function calls L xample code: for homework #1 will use polyreg.m E eneral: help, lookfor, 1:N, rand, zeros, A, reshape, size G ath: max, min, cov, mean, norm, inv, pinv, det, sort, eye M ontrol: if, for, while, end, %, function, return, clear C isplay: figure, clf, axis, close, plot, subplot, hold on, fprintf D nput/Output: load, save, ginput, print, I BS and TAs are also helpful B
Logistic Neuron
f (x; ) = x
T
(McCullough-Pitts)
g (z ) = 1 + exp (z )
f (x; ) = g T x
( )
Linear neuron
Logistic Neuron
eed to pick the step size scalar well (each step reduces R) N oo small slow, too large unstable T eed to avoid flat regions in the space (slow) N .e. make sure we are operating in linear regime I of squashing function
,yN ) x RD y R1 N ,yN N
D
} )} x R
y {1,1}
x x x x xx x x xx x O x O OO OO OO O O
( )
1 4
(y f (x; ))
g (z ) 1,1
N i =1
y g xi
))
1 N
step yi T x i i =1
N
Perceptron
better choice for classification squashing function is A
1 when z < 0 g (z ) = +1 when z 0
g (z ) 1,1
N i =1
y g xi
))
1 N
step yi T x i i =1
N
hat does this R() function look like? W bert-like, no gradient descent Q since the gradient is zero except at edges when a label flips
Perceptron
nstead of Classification Loss, lets try I erceptron Loss: P
1 R per () = N
R () =
1 N
step yi T x i i =1
yi T x i imisclassified
imisclassified
yi x i
t +1 = t R per
1 = t + N
imisclassified
yi x i
imisclassified
yi x i
R per () = yi xi
t +1 = t R per = t + yi x i
if i mis-classified
if i mis-classified
Online Perceptron
terate cycling through examples i=1N one at a time. I If ith example is properly classified: t +1 = t Else: t +1 = t + yi xi herorem: converge to zero error * in finite total steps t. T 1) assume all data inside a sphere of radius r: xTi r i 2) assume data is separable with margin : yi (* ) xi i art 1) Look at inner product of current t with * P assume we just updated a mistake on point i:
( )
* *
( )
* *
t 1
+ yi
( )
*
xi
( )
*
t 1 +
( )
( )
t t
art 2) P
t 1
+ yi x i
2
t 1
+ 2yi
( )
t 1
xi + xi
t 1
+ xi + r2
t 1
tr 2
( ) =
*
t
*
t
t
t tr 2 *
2
ince cos 1 S
t tr 2 *
r2 * 1 t 2
f (x ) = g
i g T x i =1 i
P
( ))
Back-Propagation
radient descent on squared loss is done layer by layer G ayers: input, hidden, output. Parameters: = {wij ,w jk ,wkl } L
a j = wij zi
i
zj = g aj
( )
ak = w jk z j
j
zk = g (ak )
al = wkl zk
k
zl = g (al )
ach input xn for n=1..N generates its own as and zs E ack-Propagation: Splits layer into its inputs & outputs B et gradient on outputback-track chain rule until input G
Back-Propagation
ost function: C
R () = =
1 N 1 N
L yn f x n n=1
1 n 2
(
n
( ))
k kl j jk i n ij i 2
(y
( w g ( w g ( w x ))))
n n L al a n w Chain Rule kl n l
define L :=
1 2
(y
f x
( ))
n
Back-Propagation
ost function: C
R () = =
1 N 1 N
L yn f x n n=1
1 n 2
(
n
( ))
k kl j jk i n ij i 2
(y
( w g ( w g ( w x ))))
1 N
n n L al a n w Chain Rule kl n l 2 1 n n 2 y g al a n l w n a n kl l
define L :=
1 2
(y
f x
( ))
n
( ))
Back-Propagation
ost function: C
R () = =
1 N 1 N
L yn f x n n=1
1 n 2
(
n
( ))
k kl j jk i n ij i 2
(y
( w g ( w g ( w x ))))
1 N
n n L al a n w Chain Rule kl n l 2 1 n n 2 y g al a n l = 1 w N n a n kl l
define L :=
1 2
(y
f x
( ))
n
( ))
(y
n
) ( )( )
Back-Propagation
ost function: C
R () = =
1 N 1 N
L yn f x n n=1
1 n 2
(
n
( ))
k kl j jk i n ij i 2
(y
( w g ( w g ( w x ))))
1 N
n n L al a n w Chain Rule kl n l 2 1 n n 2 y g al a n l = 1 w N n a n kl l
define L :=
Define as
1 2
(y
f x
( ))
n
n n l k
( ))
(y
n
) ( )( )
1 N
z
n
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
Ln a n k a n w jk n k
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
Ln a n k = a n w jk n k
1 N
n n Ln al ak a n a n w jk n l l k
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
1 N
Ln a n k = a n w n jk k n n al z n l a n j n l k
1 N
n n Ln al ak a n a n w n l jk l k
( )
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
( )
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
( )
( )( )
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
( )
( )( )
Back-Propagation
ost function: C
R () =
1 N
1 n 2
(y
( w g ( w g ( w x ))))
k kl j jk i n ij i
R = wkl
1 N
n n L al a n w = kl n l
1 N
(y
n
) ( )( )
1 N
z
n
n n l k
R = w jk
R = wij
1 N
Ln a n k = a n w n jk k
1 N
n w g ' a n z n = l kl k j n l
( )( )
1 N
z
n
n n k j
n a n L j = a n w ij n j
1 N
n n Ln ak a j a n a n w = ij n k k j
1 N
n w g ' a n z n = k jk j i n k
( )( )
1 N
z
n
n n j i
Back-Propagation
gain, take small step in direction opposite to gradient A
t t wij+1 = wij
R wij
w tjk+1 = w tjk
R w jk
t t wkl+1 = wkl
R wkl
igits Demo: LeNet http://yann.lecun.com D roblems with back-prop P is that MLP over-fits ther problems: hard to interpret, black-box O hat are the hidden inner layers doing? W ther main problem: O minimum training error not minimum testing error