NN Ch3
NN Ch3
Memory
• Two types of associations. For two patterns s and t
– hetero-association (s != t) : relating two different patterns
– auto-association (s = t): relating parts of a pattern with
other parts
• Architectures of NN associative memory
– single layer (with/out input layer)
– two layers (for bidirectional assoc.)
• Learning algorithms for AM
– Hebbian learning rule and its variations
– gradient descent
• Analysis
– storage capacity (how many patterns can be
remembered correctly in a memory)
– convergence
• AM as a model for human memory
Training Algorithms for Simple AM
• Network structure: single layer
– one output layer of non-linear units and one input layer
– similar to the simple network for classification in Ch. 2
s_1 x_1 w_11 y_1 t_1
w_1m
w_n1
s_n x_n y_m t_m
w_nm
• Goal of learning:
– to obtain a set of weights w_ij
– from a set of training pattern pairs {s:t}
– such that when s is applied to the input layer, t is computed
at the output layer
– for all training pairs s : t : t j f ( s T w j ) for all j
Hebbian rule
p 1 p 1
s ( k ) s ( k ) t ( k ) s ( k ) s T ( p) t ( p)
T
pk
s ( k ) t ( k ) s ( k ) s T ( p) t ( p)
2
pk
principal cross-talk
term term
• Principal term gives the association between s(k) and t(k).
• Cross-talk represents correlation between s(k):t(k) and other
training pairs. When cross-talk is large, s(k) will recall
something other than t(k).
• If all s(p) are orthogonal to each other, then s (k ) s T ( p) 0 ,
no sample other than s(k):t(k) contribute to the result.
• There are at most n orthogonal vectors in an n-dimensional
space.
• Cross-talk increases when P increases.
• How many arbitrary training pairs can be stored in an AM?
– Can it be more than n (allowing some non-orthogonal patterns
while keeping cross-talk terms small)?
– Storage capacity (more later)
Example of hetero-associative memory
• Binary pattern pairs s:t with |s| = 4 and |t| = 2.
• Total weighted input to output units:
y _ in j x i w ij
• Activation function: threshold i
1 if y _ in j 0
yj
• Weights are0 _ in j 0rule (sum of outer products of all
if byy Hebbian
computed
training pairs)
P
s i ( p) t j ( p)
T
Wsamples:
• Training
p 1
s(p)
t(p)
p=1 (1 0 0 0) (1, 0)
p=2 (1 1 0 0) (1, 0)
p=3 (0 0 0 1) (0, 1)
p=4 (0 0 1 1) (0, 1)
1 1 0 1 1 0
0 0 0 1 1 0
s (1) t (1)
T 1 0 s (2) t (2)
T 1 0
0 0 0 0 0 0
0 0 0 0 0
0
0 0 0 0 0 0
0 0 0
s T (3) t (3) 0 1 s ( 4) t ( 4 )
T 0 0 1 0 0
0 0 0
1 0 1
1 0 1
1 0 1
2 0
1 0
W
0 1 Computing the weights
0 2
recall:
x=(1 0 0 0) x=(0 1 0 0) (similar to S(1) and
S(2) 2 0 2 0
1 0 0 01 0
2 0 0 1 0 0 1 0 1 0
0 1 0 1
0 2 0 2
y1 1, y2 0 y1 1, y2 0
x=(0 1 1 0)
2 0 (1 0 0 0), (1 1 0 0) class (1, 0)
(0 0 0 1), (0 0 1 1) class (0, 1)
0 1 1 01 0
1 1
0 1 (0 1 1 0) is not sufficiently similar
0 2 to any class
y1 1, y 2 1
delta-rule would give same or
similar results.
Example of auto-associative memory
• Same as hetero-associative nets, except t(p) =s (p).
• Used to recall a pattern by a its noisy or incomplete version.
(pattern completion/pattern recovery)
• A single pattern s = (1, 1, 1, -1) is stored (weights computed
by Hebbian rule – outer product)
1 1 1 1
1 1 1 1
W
1 1 1 1
1 1 1 1
a (k ) w
i 1
i ij a i ( k ) a i ( p )a j ( p )
i j p 1
m
a j ( p ) a i ( k )a i ( p )
p 1 i j
n
a ( k )a ( p ) a ( k )a ( p ) a ( k )a ( p )
i j
i i
i 1
i i j j
j i i j j j j
p 1 i j pk
a j (k ) a j (k )(n 1)
pk
w ii 0
same as Hebbian rule (with zero diagonal)
binary: w ij (2 si ( p) 1)(2 s j ( p) 1) i j
p
w ii 0
Y3 is selected Y2 is selected
y _ in3 x3 y3 w j 3 1 1 2 y _ in2 x 2 y2 w j 0 2 2
2
y3 1 y2 1
Y (1, 0, 1, 0) Y (1, 1, 1, 0)
The stored pattern is correctly recalled
Convergence Analysis of DHM
• Two questions:
1.Will Hopfield AM converge (stop) with any given recall input?
2.Will Hopfield AM converge to the stored pattern that is closest
to the recall input ?
• Hopfield provides answer to the first question
– By introducing an energy function to this model,
– No satisfactory answer to the second question so far.
• Energy function:
– Notion in thermo-dynamic physical systems. The system has a
tendency to move toward lower energy state.
– Also known as Lyapunov function. After Lyapunov theorem
for the stability of a system of differential equations.
• In general, the energy function E ( y (t )), where y (t ) is the state
of the system at step (time) t, must satisfy two conditions
1. E (t ) is bounded from below E (t ) c t
2. E (t ) is monotonically nonincreasing.
E (t 1) E (t 1) E (t ) 0 (in continuous version : E (t ) 0)
• The energy function defined for DHM
E 0.5 yi y j w ij x i yi i yi
i j j i i
• Show E (t 1) 0
At t+1,Yk is selected for update
yk (t 1) yk (t 1) yk (t )
Note : y j (t 1) 0 j k (only one unit can update at a time)
E (t 1) E (t )
(0.5 yi (t 1) y j (t 1) w ij x i yi (t 1) i yi (t 1))
i j j i i
(0.5 yi (t ) y j (t ) w ij x i yi (t ) i yi (t ))
i j j i i
terms which are different in the two parts are those involving yk
yj
k y j w jk , y y
i
i k
w ki , xk yk , k yk
E (t 1) [ y j (t ) w k j x k k ]yk (t 1)
jk
y _ ink (t 1)
cases :
if yk (t ) 1 & yk (t 1) 1 yk (t 1) 2
y _ ink k E (t 1) 0
if yk (t ) 1 & yk (t 1) 1 yk (t 1) 1
y _ ink k E (t 1) 0
otherwise, yk (t 1) yk (t ) yk (t 1) 0 E (t 1) 0
k either yk (t 1) yk (t ) yk 0
or y _ ink yk 0
2.The state the system converges is a stable state.
Will return to this state after some small perturbation. It is called an
attractor (with different attraction basin)
3.Error function of BP learning is another example of
energy/Lyapunov function. Because
• It is bounded from below (E>0)
• It is monotonically non-increasing (W updates along
gradient descent of E)
Capacity Analysis of DHM
• P: maximum number of random patterns of dimension n
can be stored in a DHM of n nodes
P
• Hopfield’s observation: P 0.15n, 0.15
n
• Theoretical analysis: n P 1
P ,
2 log 2 n n 2 log 2 n
output1 output2
hidden1 hidden2
input1 input2
Hetero-association
Bidirectional AM(BAM)
• Architecture:
– Two layers of non-linear units: X-layer, Y-layer
– Units: discrete threshold, continuing sigmoid (can be
either binary or bipolar).
• Weights: P
– Wnm s T ( p) t ( p) (Hebbian/outer product)
p 1
– Symmetric: w ij w ji
– Convert binary patterns to bipolar when constructing W
• Recall:
– Bidirectional, either by X ( to recall a Y ) or by Y ( to recall a X )
– Recurrent: y (t ) ( f ( y _ in1 (t ),...... f ( y _ inm (t ))
n
where y _ in j (t ) w i j x i (t 1)
i 1
x (t 1) ( f ( x _ in1 (t 1),...... f ( x _ inn (t 1))
m
where x _ in i (t 1) w ij y j (t )
j 1