20.NeuralNets Short
20.NeuralNets Short
Jihoon Yang
Machine Learning Research Laboratory
Department of Computer Science & Engineering
Sogang University
Email: [email protected]
URL: mllab.sogang.ac.kr/people/jhyang.html
Neurons and Computation
X0 =1
x1 w1
W0
n Output
Input x2 w2 y
y 1 if
n
w x i i 0
i 0
xn wn Synaptic weights
y 1 otherwise
x2 w x + w x + w > 0
1 1 2 2 0
(w1,w2)
Decision
boundary C1
x1
C2
w1x1 + w2x2 + w0 < 0 w1x1 + w2x2 + wo = 0
n
wi xi w0 0
i 1
describes a hyperplane which divides the instance
space
n
into two half–spaces
X p W X p w0 0
n
and
X p n W X p w0 0
Department of Computer Science & Engineering
Machine Learning Research Laboratory 4
McCulloch-Pitts neuron or Threshold neuron
y sign W X w0 x1 w1
x w
n
sign wi xi 2
X 2 W
i 0
sign W T X w0 xn wn
sign v 1 if v 0
0 otherwise
• Instance space n
W (X 1 X 2 ) 0
W is normal to any vector lying in H
• Example
[ w0 w1 w2 ]T [ 1 1 1]T
XTp [1 0]T W X p w0 1 ( 1) 2
X p is assignedto class C2
• Example x1 x2 g(X) y
– Let w0 = -1.5; w1 = w2 = 1
– In this case, the threshold neuron 0 0 -1.5 -1
implements the logical AND
function 0 1 -0.5 -1
1 0 -0.5 -1
1 1 0.5 1
• Example: Exclusive OR
Why?
x2
x1
Department of Computer Science & Engineering
Machine Learning Research Laboratory 10
Terminology and Notation
S X k X k , d k E and d k 1
S X k X k , d k E and d k 1
W * such that X p S , W * X p 0
and X p S , W * X p 0
0 0..... 0
T
1. Initialize W
}
W W d k yk X k
4. W* W; Return W*
Let
S+ = {(1, 1, 1), (1, 1, -1), (1, 0, -1)} 1
S- = {(1,-1, -1), (1,-1, 1), (1,0, 1) }
2
W = (0 0 0)
Theorem:
Let E Xk , d k be a training set where Xk {1} and d k {1,1}
n
Let S Xk Xk , d k E & d k 1 and S Xk Xk , d k E & d k 1
x (x)
x ( x)
o x ( x)
(o)
x ( x)
o (o)
o x (o)
o ( x)
(o)
X
Department of Computer Science & Engineering
Machine Learning Research Laboratory 18
Exclusive OR revisited
0.5
(1,1)
• When mapped into the feature space <z1, z2>, C1 and C2 become
linearly separable. So a linear classifier with φ1(X) and φ2(X) as
inputs can be used to solve the XOR problem.
Department of Computer Science & Engineering
Machine Learning Research Laboratory 19
Learning in the Feature Space
min i
yi f (xi ) f (x) w, x b
|| w ||
• Important insight:
Error of the classifier trained on a separable data set is inversely
proportional to its margin, and is independent of the dimensionality
of the input space!
Department of Computer Science & Engineering
Machine Learning Research Laboratory 21
Margin of a Training Set
γ min γi
i
γ min γi
i
• Minimize
W, W
Subject to:
yi W, Xi b 1
df
f x f X 0 x X 0
dx
x X 0
f f f
X C X C η , .......... ... (why?)
x0 x1 x n X XC
df
f f ( Z1 ) f ( Z 0 ) Z
dZ Z Z
0
f (x1, x2)
Gradient descent/ascent is
guaranteed to find the
minimum/maximum when the
function has a single
minimum/maximum
XC= (x1C, x2C)
x2
X*
x1
• Minimize
W, W
subject to
yi W, Xi b 1
• Lagrangian:
1
Lp (w) w, w i [ yi ( w, xi b) 1]
2
0
Department of Computer Science & Engineering
Machine Learning Research Laboratory 30
From Primal to Dual
y
i
i i 0
• Maximize: LD w 1 i yi x i j y j x j
2
i j
i yi j y j x j , x i b 1
i j
1
i j yi y j xi , x j ij i j yi y j xi , x j bi i yi i i
2 ij
1
i j yi y j xi , x j i i
2 ij
subject to y
i
i i 0 and i 0
1
W ( ) i i j yi y j K ( xi , x j )
i 2 i, j
0 i C
y
i
i i 0
W (α)
1 yi j y j K ( xi , x j )
i j
• Strengths
– Training is relatively easy
– No local optima
– It scales relatively well to high dimensional data
– Tradeoff between classifier complexity and error can be
controlled explicitly
– Non-traditional data like strings and trees can be used as input
to SVM, instead of feature vectors
• Weaknesses
– Need to choose a “good” kernel function
E S
wi wi η
wi
E S 1 1 2
e p
2
wi 2 wi p 2 p wi
e p
1 e p e y p n
2e p
ep p e p 1 w j x jp
wi w w
2 p p y p i p i j 0
d p y p wi xip w j x jp
wi
p j i
d p y p
wi
wi xip w j x jp
p wi j i
d p y p xip
p wi wi η d p y p xip
p
Department of Computer Science & Engineering
Machine Learning Research Laboratory 43
Learning Real-Valued Functions
L
N
F ( x1, x2 ... xn ) j w ji xi j
j 1 i 1
• Unlike Kolmogorov’s theorem, UFAT requires only one kind of
nonlinearity to approximate any arbitrary nonlinear function to any
desired accuracy
• A single bias unit is connected to each unit other than the input
units
• Net input
N N
n j xi w ji w j 0 xi w ji W j . X,
i 1 i 0
• Each output unit similarly computes its net activation based on the
hidden unit signals as:
nH nH
nk y j wkj wk 0 y j wkj Wk Y,
j 1 j 0
• Let tkp be the k-th target (or desired) output for input pattern Xp and
zkp be the output produced by k-th output node and let W represent
all the weights in the network
• Training error:
M
kp kp E p W
1
E S ( W) ( t z ) 2
2 p k 1 p
E p E p nkp
.
wkj nkp wkj
nkp
y jp
wkj
E p E p z kp
. (t kp z kp )(1)
nkp z kp nkp
E p
wkj wkj η wkj (t kp z kp ) y jp wkj kp y jp
wkj
Department of Computer Science & Engineering
Machine Learning Research Laboratory 57
Generalized delta rule
E p M E p z kp M E p z kp y jp n jp
. .
w ji k 1 z kp w ji k 1 z kp y jp n jp w ji
1 M 2
2 lp lp wkj ( y jp )1 y jp xip
M
(t z )
k 1 z kp l 1
k 1
M
kp wkj ( y jp )1 y jp xip jp xip
k
1
jp
w ji w ji jp xip
Department of Computer Science & Engineering
Machine Learning Research Laboratory 58
Back propagation algorithm
F ( X p ) argmax z kp
k
Classify a pattern by assigning it to the class that corresponds to
the index of the output node with the largest output for the
pattern