MachineLearningSlides PartTwo
MachineLearningSlides PartTwo
Learning for
Physicists
University of Erlangen-Nuremberg
& Max Planck Institute for the
Science of Light
Florian Marquardt
[email protected]
http://machine-learning-for-physicists.org
output
... ...
input
time
Long memories with convolutional nets are
challenging:
• would need large filter sizes
• even then, would need to know required
memory time beforehand
• can expand memory time efficiently by multi-
layer network with subsampling (pooling), but
this is still problematic for precise long-term
memory
recall
signal no important signals signal!
time
But: may be OK for some physics applications!
(problems local in time, with short memory)
Memory
recall
signal no important signals signal!
time
Solution: Recurrent Neural Networks (RNN)
keep
memory
output
input
Advantage: in principle this could give arbitrarily long memory!
time
Note: each circle may represent multiple neurons (i.e. a layer)
Each arrow then represents all possible connections between those
neurons
Solution: Recurrent Neural Networks (RNN)
keep
memory
output
input
time
Note: the weights are not time-dependent, i.e. need to store only one set of
weights (similar to convolutional net)
output
hidden
input
time
“correct answer” known here
output
hidden
input
time
LSTM: Forget gate (delete)
ct 1 ct
memory
cell content *
keep: ct = 1 ⇤ ct 1
delete: ct = 0 ⇤ ct 1
LSTM: Forget gate (delete)
elementwise product
Backpropagation
ct 1 ct
memory
cell content *
The multiplication * splits the
“forget error backpropagation into two
gate f” branches
xt
input
product rule:
@fj ct 1,j @fj @ct 1,j
= ct 1,j + fj
@w⇤ @w⇤ @w⇤
(Note: if time is not specified, we are referring to t)
LSTM: Forget gate (delete)
memory
cell content * *
“forget
gate f”
input
t-1 t t+1
LSTM: Write new memory value
ct 1 ct i = (W (i) (i)
xt + b )
+ c̃t = tanh(W (c) xt + b(c) )
*
“input new value c̃t
gate i”
xt
input
both delete and write together:
ct = f ⇤ ct 1 + i ⇤ c̃t
forget new value
LSTM: Read (output) memory value
ht
ct *c
1 t
(o) (o)
o = (W xt + b )
ht = o ⇤ tanh(ct )
“output
gate o”
xt
input
LSTM: exploit previous memory output ‘h’
rnn.add(LSTM(10, return_sequences=True))
LSTM 2
def init_memory_net():
global rnn, batchsize, timesteps
rnn = Sequential()
LSTM 5
# note: batch_input_shape is
(batchsize,timesteps,data_dim)
rnn.add(LSTM(5, batch_input_shape=(None,
timesteps, 3), return_sequences=True))
rnn.add(LSTM(2, return_sequences=True)) 3
rnn.compile(loss='mean_squared_error', input
optimizer='adam', metrics=['accuracy'])
Example: A network for recall
(see code on website)
0 1
0 0.4
0 1
tell time
0 1
signal! time
Output of the recall network, evolving during
training (for a fixed input sequence)
time
RECALL
TELL
SIGNAL
TELL (delay 5)
input sequence
T H E _ T H E O R Y _ O F _ G E
(characters in one-hot encoding) time
we counter. He stutn co des. His stanted out one ofler that concossions and was
to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn
Aftair fall unsuch that the hall for Prince Velzonski's that me of
her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort
how, and Gogition is so overelical and ofter.
"Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
Homework
input 3+5=??
output ....08
input 7-5=??
output ....02
warm
three
apple
desk
cold
tree
nice
bird
that
two
was
one
sun
hot
full
dimension: number of words in dictionary
car-cars ~ tree-trees
(subtracting the word vectors on each
side yields approx. identical vectors)
“car”
“tree”
“cars”
“trees”
Word vectors encode meaning
Mikolov,Yih, Zweig 2013
“woman”
“queen”
“aunt”
“man” “king”
“uncle”
Word vectors encode meaning
Mikolov et al. 2013 ”Distributed Representations of Words and Phrases and their Compositionality”
0.5
Poland
0 Germany
France Warsaw
Berlin
-0.5 Italy Paris
Greece Athens
Rome
-1 Spain
Madrid
-1.5 Portugal
Lisbon
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly
Word vectors in keras
!
!
teacher student
(smart) (imitates teacher)
?
?
?
student/scientist
(tries out things)
observation
fully observed vs.
“agent” “environment” partially observed
“state” of the
environment
action
Self-driving cars, robotics:
Observe immediate environment & move
Games:
Observe board & place stone
Observe video screen & move player
“policy”
state action
observation
RL-agent RL-environment
state = position x,y
action = move (direction)
state = position x,y
action = move (direction)
reward for picking up box
Policy Gradient
=REINFORCE (Williams 1992): The simplest model-free
general reinforcement learning technique
“policy”
state action
observation
RL-agent RL-environment
Policy: ⇡✓ (at |st ) – probability to pick action at
given observed state st
at time t
action
“policy”
state action
observation
RL-agent RL-environment
action
neural
network
observation
RL-agent RL-environment
Policy Gradient
Probabilistic policy:
Probability to take action a, given the current state s
⇡✓ (a|s)
s a π
π down 0.1
up 0.6
left 0.2
right 0.1
@ ln ⇡✓ (at |st )
@✓
@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓
Xk = (R bk )Gk
2 2
Var[Xk ] = E[Xk ] E[Xk ] = min
@Var[Xk ]
=0
@bk
E[G2k R]
bk =
E[G2k ]
@ ln P✓ (⌧ )
Gk =
@✓k
✓k = ⌘E[Gk (R bk )]
For more in-depth treatment, see David Silver’s course on
reinforcement learning (University College London):
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
The simplest RL example ever
reward
time
The simplest RL example ever
extremely rare)
⇡✓ (up)(1
⇡✓ (up)
The simplest RL example ever
probability ⇡✓ (up)
3 learning attempts
strong fluctuations!
1
⇠N
hXi
p
Var(X 0 ) 1
⇠N
hXi
h Xi = M hXi
p p p
Var X = M VarX
p
Var X 1
relative spread ⇠p
h Xi M
Homework
“target site”
policy ⇡✓ (a|s)
execute action,
record new state
Do one trajectory
(in reality: a batch of trajectories)
P (a) = 0
for all other actions a
@C
✓= ⌘
@✓
input = state
implements policy gradient
RL in keras: categorical cross-entropy trick
Encountered N states (during repeated runs)
After setting categorical cross-entropy as cost function,
just use the following simple line to implement policy
gradient: array N x state-size
net.train_on_batch(observed_inputs,desired_outputs)
array N x number-of-actions
Here desired_outputs[j,a]=R for the state
numbered j, if action a was taken during a run that
gave overall return R
AlphaGo
t-SNE visualization of
last hidden layer
Advanced Exercise
0/1 /
1 kE(s)T X E(s0 )
P (s) = e B Z= e kB T
Z s0
probability for state s, Z for normalization:
in thermal equilibrium “partition function”
Problem: for a many-body system, exponentially many states (for
example 2N spin states). Cannot go through all of them!
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...
s1 s3
s0 s2
“hidden” units h
“visible” units v
“hidden” units h
“visible” units v
“hidden” units h
“visible” units v
“hidden” units h
“visible” units v
v 0
v “reconstruction” v 00
h h
P
a i vi zj
=e i ⇧j (1 + e )
X
with: zj = b j + vi wij
P i X X X X
where we used: e j Xj = ⇧j eXj ... = ...
h h 0 =0,1 h1 =0,1 h2 =0,1
Therefore:
e E(v,h) ezj hj
P (h|v) = = ⇧j
ZP (v) 1 + ezj
Product of probabilities! All the hj are independently
distributed, with probabilities:
ezj
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
P (hj = 0|v) = 1 (zj )
Building a Markov chain
Given some visible-units state vector v, calculate
the probabilities z
ej
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
Then assign 1 or 0, according to these probabilities,
to obtain the new hidden state vector h
v
(v: training sample) v0 v 00
output layer
(e.g. softmax)
“hidden” units h
“visible” units v
Deep belief networks
Stack RBMs: First train a simple RBM, then use its
hidden units as input to another RBM, and so on
hidden units h3
third RBM
hidden units h2
second RBM
hidden units h1
first RBM
visible units v
“hidden” units h
“visible” units v
=spins of quantum model
Try to solve a quantum many-body problem (quantum spin
model) using the following variational ansatz for the wave
function amplitudes:
X P z P P z
a + b h + h i j Wij
(S) = e j j j i i i ij
h
Carleo & Troyer, Science 2017
z z z
S = ( 1 , 2 , . . . , N ) one basis state in the many-body Hilbert space
z
j = ±1 (in general, a,b, W may be complex)
hi = ±1
This is exactly (proportional to) the RBM representation
for P(v) [with v=S]!
Application to Quantum Physics
“hidden” units h
“visible” units v
=spins of quantum model
D E
Minimize the energy |Ĥ|
h | i
by adapting the weights W and biases a and b!
[requires additional Monte Carlo simulation, to obtain a
stochastic sampling of the gradient with respect to these
parameters]
For example: sample probabilities by using Metropolis
algorithm, with transition probabilities 0 2
0 (S )
P (S S) = min(1, )
(S)
Application to Quantum Physics
Carleo & Troyer, Science 2017
error
number of ‘filters’
http://machine-learning-for-physicists.org