0% found this document useful (0 votes)
4 views60 pages

20.NeuralNets Short

Uploaded by

zenithw131013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views60 pages

20.NeuralNets Short

Uploaded by

zenithw131013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Artificial Neural Networks

Jihoon Yang
Machine Learning Research Laboratory
Department of Computer Science & Engineering
Sogang University
Email: [email protected]
URL: mllab.sogang.ac.kr/people/jhyang.html
Neurons and Computation

Department of Computer Science & Engineering


Machine Learning Research Laboratory 2
McCulloch-Pitts computational model of a neuron

X0 =1
x1 w1
W0

 n Output
Input x2 w2 y

  y  1 if
n

w x i i 0
i 0
xn wn Synaptic weights
y  1 otherwise

When a neuron receives input signals from other neurons, its


membrane voltage increases. When it exceeds a certain threshold,
the neuron “fires” a burst of pulses.
Department of Computer Science & Engineering
Machine Learning Research Laboratory 3
Threshold neuron – Connection with Geometry

x2 w x + w x + w > 0
1 1 2 2 0

(w1,w2)
Decision
boundary C1
x1
C2
w1x1 + w2x2 + w0 < 0 w1x1 + w2x2 + wo = 0
n

 wi xi  w0  0
i 1
describes a hyperplane which divides the instance
space 
n
into two half–spaces


  X p   W  X p  w0  0
n
 and 
  X p  n W  X p  w0  0 
Department of Computer Science & Engineering
Machine Learning Research Laboratory 4
McCulloch-Pitts neuron or Threshold neuron

y  sign W  X  w0   x1   w1 
x  w 
 n 
 sign   wi xi   2
X   2 W
 i 0   
   

 sign W T X  w0   xn   wn 

sign v   1 if v  0
 0 otherwise

Department of Computer Science & Engineering


Machine Learning Research Laboratory 5
Threshold neuron – Connection with Geometry

Department of Computer Science & Engineering


Machine Learning Research Laboratory 6
Threshold neuron – Connection with Geometry

• Instance space n

• Hypothesis space is the set of (n-1)-dimensional hyperplanes


defined in the n-dimensional instance space
n
• A hypothesis is defined by w x
i 0
i i 0

• Orientation of the hyperplane is governed by ( w1... wn )T


• W determines the orientation of the hyperplane H: given two points
X1 and X2 on the hyperplane,

W (X 1  X 2 )  0
 W is normal to any vector lying in H

Department of Computer Science & Engineering


Machine Learning Research Laboratory 7
Threshold neuron as a pattern classifier

• The threshold neuron can be used to classify a set of instances into


one of two classes C1, C2

• If the output of the neuron for input pattern Xp is +1 then Xp is


assigned to class C1

• If the output is -1 then the pattern Xp is assigned to C2

• Example
[ w0 w1 w2 ]T  [ 1  1 1]T
XTp  [1 0]T W  X p  w0  1  ( 1)  2
X p is assignedto class C2

Department of Computer Science & Engineering


Machine Learning Research Laboratory 8
Threshold neuron – Connection with Logic

• Suppose the input space is {0,1}n


• Then threshold neuron computes a Boolean function
f: {0,1}n  {-1,1}

• Example x1 x2 g(X) y
– Let w0 = -1.5; w1 = w2 = 1
– In this case, the threshold neuron 0 0 -1.5 -1
implements the logical AND
function 0 1 -0.5 -1

1 0 -0.5 -1

1 1 0.5 1

Department of Computer Science & Engineering


Machine Learning Research Laboratory 9
Threshold neuron – Connection with Logic

• Theorem: There exist functions that cannot be implemented by a


single threshold neuron

• Example: Exclusive OR

Why?
x2

x1
Department of Computer Science & Engineering
Machine Learning Research Laboratory 10
Terminology and Notation

• Synonyms: Threshold function, Linearly separable function, Linear


discriminant function

• Synonyms: Threshold neuron, McCulloch-Pitts neuron, Perceptron,


Threshold Logic Unit (TLU)

• We often include w0 as one of the components of W and incorporate


x0 as the corresponding component of X with the understanding that
x0 = 1; Then y = 1 if W·X > 0 and y = -1 otherwise

Department of Computer Science & Engineering


Machine Learning Research Laboratory 11
Learning Threshold functions

• A training example Ek is an ordered pair (Xk, dk) where

X k  x0 k x1k .... xnk  is an (n+1) dimensional input pattern, and


T

d k  f (X k )  {1, 1} is the desired output of the classifier and f is


an unknown target function to be learned

• A training set E is simply a multi-set of examples

Department of Computer Science & Engineering


Machine Learning Research Laboratory 12
Learning Threshold functions

S   X k X k , d k   E and d k  1
S   X k X k , d k   E and d k  1

• We say that a training set E is linearly separable if and only if

W * such that X p  S  , W *  X p  0
and X p  S  , W *  X p  0

• Learning task: Given a linearly separable training set E, find a


solution W*
such that X p  S  , W*  X p  0 and X p  S  , W*  X p  0

Department of Computer Science & Engineering


Machine Learning Research Laboratory 13
Rosenblatt’s Perceptron Learning Algorithm

 0 0..... 0
T
1. Initialize W

2. Set learning rate  0

3. Repeat until a complete pass through E results in no weight


updates
For each training example Ek  E
{
yk  sign ( W  X k )

}
W  W  d k  yk X k

4. W*  W; Return W*

Department of Computer Science & Engineering


Machine Learning Research Laboratory 14
Perceptron Learning Algorithm – Example

Let
S+ = {(1, 1, 1), (1, 1, -1), (1, 0, -1)} 1
S- = {(1,-1, -1), (1,-1, 1), (1,0, 1) }

2
W = (0 0 0)

Xk dk W W.Xk yk Update? Updated


W
(1, 1, 1) 1 (0, 0, 0) 0 -1 Yes (1, 1, 1)
(1, 1, -1) 1 (1, 1, 1) 1 1 No (1, 1, 1)
(1,0, -1) 1 (1, 1, 1) 0 -1 Yes (2, 1, 0)
(1, -1, -1) -1 (2, 1, 0) 1 1 Yes (1, 2, 1)
(1,-1, 1) -1 (1, 2, 1) 0 -1 No (1, 2, 1)
(1,0, 1) -1 (1, 2, 1) 2 1 Yes (0, 2, 0)
(1, 1, 1) 1 (0, 2, 0) 2 1 No (0, 2, 0)

Department of Computer Science & Engineering


Machine Learning Research Laboratory 15
Perceptron Convergence Theorem (Novikoff)

Theorem:
Let E  Xk , d k  be a training set where Xk {1}  and d k {1,1}
n

   
Let S   Xk Xk , d k  E & d k  1 and S   Xk Xk , d k  E & d k  1

The perceptron algorithm is guaranteed to terminate after a bounded


*
number t of weight updates with a weight vector W
such that  Xk  S  , W*  Xk   and  Xk  S  , W*  Xk  
for some   0, whenever such W*  n1 and   0 exist
– that is, E is linearly separable.

The bound on the number t of weight updates is given by


2
 W* L 
t   where L  max X and S  S   S 
   X k S
k
 

Department of Computer Science & Engineering


Machine Learning Research Laboratory 16
Limitations of Perceptrons

• Perceptrons can only represent threshold functions

• Perceptrons can only learn linear decision boundaries

• What if the data are not linearly separable?


– Modify the learning procedure or the weight update equation?
(e.g. Pocket algorithm, Thermal perceptron)

– More complex networks?

– Non-linear transformations into a feature space where the data


become separable?

Department of Computer Science & Engineering


Machine Learning Research Laboratory 17
Extending Linear Classifiers:
Learning in feature spaces
• Map data into a feature space where they are linearly separable

x   (x)

x  ( x)
o x  ( x)
 (o)
x  ( x)
o  (o)
o x  (o)
o  ( x)
 (o)
X 
Department of Computer Science & Engineering
Machine Learning Research Laboratory 18
Exclusive OR revisited

• In the feature (hidden) space:

1 x1 , x2   e || X W1 || 2


 z1 W1  [1,1]T
 2 x1 , x2   e || X W2 || 2
 z2 W2  [0,0]T
z2
(0,0)
1.0 Decision boundary

0.5
(1,1)

(0,1) and (1,0) 0.5 1.0 z1

• When mapped into the feature space <z1, z2>, C1 and C2 become
linearly separable. So a linear classifier with φ1(X) and φ2(X) as
inputs can be used to solve the XOR problem.
Department of Computer Science & Engineering
Machine Learning Research Laboratory 19
Learning in the Feature Space

• High dimensional feature spaces

X  ( x1 , x2 ,, xn )   ( X )  (1 ( X ),2 ( X ),, d ( X ))


where typically d >> n solve the problem of expressing complex
functions

• But this introduces


– Computational problem (working with very large vectors)
 Solved using the kernel trick – implicit feature spaces
– Generalization problem (curse of dimensionality)
 Solved by maximizing the margin of separation – first
implemented in SVM (Vapnik)

Department of Computer Science & Engineering


Machine Learning Research Laboratory 20
Margin Based Bounds on Error of Classification

• The error ε of classification function h for separable data sets is


d 
  O 
l
• Can prove margin based bound:
 1  L 2 
  O    L  max X p
l    p
 

  min i
yi f (xi ) f (x)  w, x  b
|| w ||
• Important insight:
Error of the classifier trained on a separable data set is inversely
proportional to its margin, and is independent of the dimensionality
of the input space!
Department of Computer Science & Engineering
Machine Learning Research Laboratory 21
Margin of a Training Set

• The functional margin of a


training set

γ  min γi
i

 • The geometric margin of a


training set

γ  min γi
i

Department of Computer Science & Engineering


Machine Learning Research Laboratory 22
Maximum Margin Separating Hyperplane

• γ  min γi is called the (functional) margin of (W,b)


i
w.r.t. the data set S={(Xi,yi)}

• The margin of a training set S is the maximum geometric margin


over all hyperplanes; A hyperplane realizing this maximum is a
maximal margin hyperplane

Maximal Margin Hyperplane


Department of Computer Science & Engineering
Machine Learning Research Laboratory 23
Maximizing Margin  Minimizing ||W||

• Definition of hyperplane (W,b) does not change if we rescale it to


(W, b), for 0.

• Functional margin depends on scaling, but geometric margin 


does not

• If we fix (by rescaling) the functional margin to 1, the geometric


margin will be equal 1/||W||

• Then, we can maximize the margin by minimizing the norm ||W||

Department of Computer Science & Engineering


Machine Learning Research Laboratory 24
Learning as Optimization

• Minimize

W, W
Subject to:

yi  W, Xi  b  1

• The problem of finding the maximal margin hyperplane:


constrained optimization (quadratic programming) problem

Department of Computer Science & Engineering


Machine Learning Research Laboratory 25
Taylor Series Approximation of Functions

Taylor series approximation of f x 


If f x  is differentiable i.e., its derivatives
df d 2 f d  df  dn f
,   , ... n exist at x  X 0 and
dx dx 2
dx  dx  dx
f x  is continuous in the neighborho od of x  X 0 , then
 df   df n 
f x   f  X 0    x  X 0   .....    x  X n
1
 dx  n!  dx n
 0
 x X 0   x X 0 

 df 
f x   f  X 0     x  X 0 
 dx 
 x X 0 

Department of Computer Science & Engineering


Machine Learning Research Laboratory 26
Minimizing/Maximizing Multivariate Functions

To find X * that minimizes f X , we change current guess X C


in the direction of the negative gradient of f X  evaluated at X C

 f f f 
X C  X C  η , .......... ...  (why?)
 x0 x1 x n  X  XC

for small(ideally infinitesimally small)

Department of Computer Science & Engineering


Machine Learning Research Laboratory 27
Minimizing/Maximizing Multivariate Functions

Suppose we more from Z 0 to Z1. We want to ensure f Z1   f Z 0 .


In the neighborho od of Z 0 , using Taylor series expansion, we can write
 df 
f Z1   f Z 0  Z   f Z 0    Z   ...
 dZ Z  Z 
 0 

 df 
f  f ( Z1 )  f ( Z 0 )   Z 
 dZ Z  Z 
 0 

We want to make sure f  0.


If we choose
2
 df   df 
Z      f     0
 dZ   dZ 
 Z Z0   Z Z0 

Department of Computer Science & Engineering


Machine Learning Research Laboratory 28
Minimizing/Maximizing Functions

f (x1, x2)

Gradient descent/ascent is
guaranteed to find the
minimum/maximum when the
function has a single
minimum/maximum
XC= (x1C, x2C)
x2
X*
x1

Department of Computer Science & Engineering


Machine Learning Research Laboratory 29
Maximum Margin Hyperplane

• The problem of finding the maximal margin hyperplane is a


constrained optimization problem

• Use Lagrange theory (extended by Karush, Kuhn, and Tucker –


KKT)

• Minimize
W, W
subject to
yi  W, Xi  b  1
• Lagrangian:
1
Lp (w)  w, w   i [ yi ( w, xi  b)  1]
2
 0
Department of Computer Science & Engineering
Machine Learning Research Laboratory 30
From Primal to Dual

• Minimize Lp(w) with respect to (w,b) requiring that derivatives of


Lp(w) with respect to (w,b) all vanish, subject to the constraints  i 0

• Differentiating Lp(w): Lp


 0,
w
Lp
 0
b
w   i yi xi 
i

 y
i
i i  0

• Substituting this equality constraints back into Lp(w) we obtain a


dual problem

Department of Computer Science & Engineering


Machine Learning Research Laboratory 31
The Dual Problem

• Maximize: LD w   1    i yi x i    j y j x j 
2   
i  j 
     
   i  yi     j y j x j , x i  b   1
 
i    j   
 
1
   i j yi y j xi , x j ij  i j yi y j xi , x j bi  i yi i  i
2 ij
1
   i j yi y j xi , x j i  i
2 ij
subject to  y
i
i i  0 and  i  0

• Duality permits the use of kernels!


• This is a quadratic optimization problem: convex, no local minima
• Solvable in polynomial time

Department of Computer Science & Engineering


Machine Learning Research Laboratory 32
Karush-Kuhn-Tucker Conditions for SVM

• The KKT conditions state that optimal solutions i (w, b) must


satisfy
 i yi  w, xi  b  1  0

• Only the training samples Xi for which the functional margin = 1


can have nonzero i; They are called Support Vectors

• The optimal hyperplane can be expressed in the dual


representation in terms of this subset of training samples – the
support vectors
l
f (x, α, b)   yi i xi  x  b   yi i xi  x  b
i 1 isv

Department of Computer Science & Engineering


Machine Learning Research Laboratory 33
Support Vectors

Department of Computer Science & Engineering


Machine Learning Research Laboratory 34
Implementation Techniques

• Maximizing a quadratic function, subject to a linear equality and


inequality constraints

1
W ( )    i    i j yi y j K ( xi , x j )
i 2 i, j
0   i  C
 y
i
i i 0

W (α)
 1  yi  j y j K ( xi , x j )
 i j

Department of Computer Science & Engineering


Machine Learning Research Laboratory 35
Strengths and Weaknesses of SVM

• Strengths
– Training is relatively easy
– No local optima
– It scales relatively well to high dimensional data
– Tradeoff between classifier complexity and error can be
controlled explicitly
– Non-traditional data like strings and trees can be used as input
to SVM, instead of feature vectors

• Weaknesses
– Need to choose a “good” kernel function

Department of Computer Science & Engineering


Machine Learning Research Laboratory 36
Learning Linear Functions

Department of Computer Science & Engineering


Machine Learning Research Laboratory 37
Learning Linear Functions

Department of Computer Science & Engineering


Machine Learning Research Laboratory 38
Learning Linear Functions

Department of Computer Science & Engineering


Machine Learning Research Laboratory 39
Learning Linear Functions:
Computing Gradient

Department of Computer Science & Engineering


Machine Learning Research Laboratory 40
Learning Linear Functions:
Computing Gradient

Department of Computer Science & Engineering


Machine Learning Research Laboratory 41
Learning Linear Functions:
Linear Algebra Solution

Department of Computer Science & Engineering


Machine Learning Research Laboratory 42
Learning Linear Functions:
Delta/Adaline/Widrow-Hoff/LMS(Least-Mean-Squared) Rule

E S
wi  wi  η
wi
E S 1    1  2 
  e p    
2

wi 2 wi  p  2  p wi
e p   

1 e p   e  y p     n 
   2e p 
   ep  p     e p  1   w j x jp  
wi    w   w  
2 p  p  y p  i  p  i  j 0 
   
  d p  y p   wi xip   w j x jp  
 
 wi
p   j i 
   
  d p  y p 
 wi
wi xip      w j x jp  
 
p  wi  j i 
  d p  y p xip
p wi  wi  η d p  y p  xip
p
Department of Computer Science & Engineering
Machine Learning Research Laboratory 43
Learning Real-Valued Functions

• Universal function approximation theorem

• Learning nonlinear functions using gradient descent in weight


space

• Practical considerations and examples

Department of Computer Science & Engineering


Machine Learning Research Laboratory 44
Universal function approximation theorem
(Cybenko, 1989)
• Let  :  be a non-constant, bounded (hence non-linear),
monotone, continuous function. Let IN be the N-dimensional unit
hypercube in N.

• Let C(IN ) = {f: IN  } be the set of all continuous functions with


domain IN and range . Then for any function f  C(IN ) and any  >
0,  an integer L and a set of real values , j, j, wji (1jL; 1iN )
such that
L
 N 
F ( x1, x2 ... xn )    j   w ji xi   j   
j 1  i 1 
is a uniform approximation of f – that is,

( x1,... x N )  I N , F ( x1,... x N )  f ( x1,... x N )  

Department of Computer Science & Engineering


Machine Learning Research Laboratory 45
Universal function approximation theorem (UFAT)

L
 N 
F ( x1, x2 ... xn )    j   w ji xi   j   
j 1  i 1 
• Unlike Kolmogorov’s theorem, UFAT requires only one kind of
nonlinearity to approximate any arbitrary nonlinear function to any
desired accuracy

• The sigmoid function satisfies the UFAT requirements


1
( z )   az
;a  0 lim ( z )  0; lim ( z )  1
1 e z   z  

• Similar universal approximation properties can be guaranteed for


other functions (e.g. radial basis functions)

Department of Computer Science & Engineering


Machine Learning Research Laboratory 46
Feed-forward neural networks

• A feed-forward 3-layer network consists of 3 layers of nodes


– Input nodes
– Hidden nodes
– Output nodes

• Interconnected by modifiable weights from input nodes to the


hidden nodes and the hidden nodes to the output nodes

• More general topologies (with more than 3 layers of nodes, or


connections that skip layers – e.g., direct connections between
input and output nodes) are also possible

Department of Computer Science & Engineering


Machine Learning Research Laboratory 47
Three-layer feed-forward neural network

• A single bias unit is connected to each unit other than the input
units

• Net input
N N
n j   xi w ji  w j 0   xi w ji  W j .  X,
i 1 i 0

where the subscript i indexes units in the input layer, j in the


hidden; wji denotes the input-to-hidden layer weights at the hidden
unit j.

• The output of a hidden unit is a nonlinear function of its net input.


That is, yj = f(nj) e.g.,
1
yj  n j
1 e

Department of Computer Science & Engineering


Machine Learning Research Laboratory 48
Three-layer feed-forward neural network

• Each output unit similarly computes its net activation based on the
hidden unit signals as:
nH nH
nk   y j wkj  wk 0   y j wkj  Wk  Y,
j 1 j 0

where the subscript k indexes units in the ouput layer and nH


denotes the number of hidden units

• The output can be a linear or nonlinear function of the net input


e.g.,
y k  nk

Department of Computer Science & Engineering


Machine Learning Research Laboratory 49
Computing nonlinear functions using a
feed-forward neural network

Department of Computer Science & Engineering


Machine Learning Research Laboratory 50
Realizing non linearly separable class boundaries
using a 3-layer feed-forward neural network

Department of Computer Science & Engineering


Machine Learning Research Laboratory 51
Learning nonlinear functions

Given a training set determine:


• Network structure – number of hidden nodes or more generally,
network topology
• Start small and grow the network
• Start with a sufficiently large network and prune away the
unnecessary connections

• For a given structure, determine the parameters (weights) that


minimize the error on the training samples (e.g., the mean squared
error)

• For now, we focus on the latter

Department of Computer Science & Engineering


Machine Learning Research Laboratory 52
Generalized delta rule – error back-propagation

• Challenge – we know the desired outputs for nodes in the output


layer, but not the hidden layer

• Need to solve the credit assignment problem – dividing the credit


or blame for the performance of the output nodes among hidden
nodes

• Generalized delta rule offers an elegant solution to the credit


assignment problem in feed-forward neural networks in which
each neuron computes a differentiable function of its inputs

• Solution can be generalized to other kinds of networks, including


networks with cycles

Department of Computer Science & Engineering


Machine Learning Research Laboratory 53
Feed-forward networks

• Forward operation (computing output for a given input based on the


current weights)

• Learning – modification of the network parameters (weights) to


minimize an appropriate error measure

• Because each neuron computes a differentiable function of its


inputs if error is a differentiable function of the network outputs, the
error is a differentiable function of the weights in the network – so
we can perform gradient descent!

Department of Computer Science & Engineering


Machine Learning Research Laboratory 54
A fully connected 3-layer network

Department of Computer Science & Engineering


Machine Learning Research Laboratory 55
Generalized delta rule

• Let tkp be the k-th target (or desired) output for input pattern Xp and
zkp be the output produced by k-th output node and let W represent
all the weights in the network

• Training error:
M

 kp kp   E p W
1
E S ( W)   ( t  z ) 2

2 p k 1 p

• The weights are initialized with pseudo-random values and are


changed in a direction that will reduce the error:
E S ES
w ji   wkj  
w ji wkj

Department of Computer Science & Engineering


Machine Learning Research Laboratory 56
Generalized delta rule

>0 is a suitable the learning rate W W+ W


Hidden–to-output weights

E p E p nkp
 .
wkj nkp wkj
nkp
 y jp
wkj
E p E p z kp
 .  (t kp  z kp )(1)
nkp z kp nkp

E p
wkj  wkj  η  wkj  (t kp  z kp ) y jp  wkj   kp y jp
wkj
Department of Computer Science & Engineering
Machine Learning Research Laboratory 57
Generalized delta rule

Weights from input to hidden units

E p M E p z kp M E p z kp y jp n jp
  . .
w ji k 1 z kp w ji k 1 z kp y jp n jp w ji
 1 M 2
 2  lp lp  wkj ( y jp )1  y jp xip 
M
 (t  z )
k 1 z kp  l 1 

  t kp  z kp wkj ( y jp )1  y jp xip 


M

k 1

M 
    kp wkj ( y jp )1  y jp xip    jp xip
k 
1

 jp

w ji  w ji   jp xip
Department of Computer Science & Engineering
Machine Learning Research Laboratory 58
Back propagation algorithm

Start with small random initial weights


Until desired stopping criterion is satisfied do
Select a training sample from S
Compute the outputs of all nodes based on current
weights and the input sample
Compute the weight updates for output nodes
Compute the weight updates for hidden nodes
Update the weights

Department of Computer Science & Engineering


Machine Learning Research Laboratory 59
Using neural networks for classification

Network outputs are real valued.


How can we use the networks for classification?

F ( X p )  argmax z kp
k
Classify a pattern by assigning it to the class that corresponds to
the index of the output node with the largest output for the
pattern

Department of Computer Science & Engineering


Machine Learning Research Laboratory 60

You might also like