0% found this document useful (0 votes)

87 views141 pages

MachineLearningSlides PartTwo

This document discusses machine learning techniques for physicists, including machine learning algorithms like recurrent neural networks that are useful for modeling sequential or time-series data. It introduces recurrent neural networks as networks with memory that can model arbitrarily long sequences. The document explains challenges like exploding and vanishing gradients in training recurrent networks and introduces long short-term memory (LSTM) networks, which address these issues through purposeful forget, write, and read operations on an internal memory cell state. LSTMs allow for robust long-term memory in recurrent networks through their gating mechanisms.

Uploaded by

Hasitha m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views141 pages

MachineLearningSlides PartTwo

Uploaded by

Hasitha m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 141

Machine

Learning for
Physicists
University of Erlangen-Nuremberg
& Max Planck Institute for the
Science of Light
Florian Marquardt
[email protected]
http://machine-learning-for-physicists.org

Part Two (Image generated by a net with 20 hidden layers)

Optimized gradient descent algorithms

How to speed up stochastic gradient descent?

- accelerate (“momentum”) towards minimum

- Automatically choose learning rate
- ...even different rates for different weights
Summary: a few gradient update methods
see overview by S. Ruder https://arxiv.org/abs/1609.04747

divide by RMS of all previous gradients

adagrad

RMSprop divide by RMS of last few previous gradients

same, but multiply also by RMS of last few

adadelta parameter updates

adam divide running average of last few gradients by

RMS of last few gradients (* with corrections
during earliest steps)

adam often works best

“adagrad”
divide by RMS of all previous gradients
“RMSprop”
divide by RMS of last
few previous gradients
“RMSprop”
“adadelta”
same, but multiply also by RMS of
last few parameter updates
“adam”
divide running average
of last few gradients by
RMS of last few
gradients (* with
corrections during
earliest steps)
Recurrent neural networks
Recurrent neural networks

Networks “with memory”

Useful for analyzing time-evolution (time-series of

data), for analyzing and translating sentences, for
control/feedback (e.g. robotics or action games), and
many other things
Could use a convolutional network!

output

... ...

input

filter size = memory time

time
Long memories with convolutional nets are
challenging:
• would need large filter sizes
• even then, would need to know required
memory time beforehand
• can expand memory time efficiently by multi-
layer network with subsampling (pooling), but
this is still problematic for precise long-term
memory

recall
signal no important signals signal!

time
But: may be OK for some physics applications!
(problems local in time, with short memory)
Memory

recall
signal no important signals signal!

time
Solution: Recurrent Neural Networks (RNN)
keep
memory
output

input
Advantage: in principle this could give arbitrarily long memory!

time
Note: each circle may represent multiple neurons (i.e. a layer)
Each arrow then represents all possible connections between those
neurons
Solution: Recurrent Neural Networks (RNN)
keep
memory
output

input

time
Note: the weights are not time-dependent, i.e. need to store only one set of
weights (similar to convolutional net)
output

hidden

input

time
“correct answer” known here
output

hidden

input

“Backpropagation through time”

time
Long memories with recurrent networks are
challenging, due to a feature of backpropagation:

“Exploding gradients” / “Vanishing gradients”

Backpropagation through many layers (in a deep

network) or through many time-steps (in a
recurrent network):
(for the
Something like t 1 = Mt t recurrent
network case)
Depending on | | | |
typical exploding vanishing
eigenvalues of
the matrices M:

backprop. steps backprop. steps

Long short-term memory (LSTM)
Why this name? “Long-term memory” would be the weights that are adapted
during training and then stored forever. “Short-term memory” is the input-
dependent memory we are talking about here. “Long short-term memory”
tries to have long memory times in a robust way, for this short-term memory.
Sepp Hochreiter and Jürgen Schmidhuber, 1997
Main idea: determine read/write/delete operations of a
memory cell via the network (through other neurons)
Most of the time, a memory neuron just sits there and
is not used/changed!
recall
signal no important signals signal!

time
LSTM: Forget gate (delete)

ct 1 ct
memory
cell content *

keep: ct = 1 ⇤ ct 1
delete: ct = 0 ⇤ ct 1
LSTM: Forget gate (delete)

Calculate “forget gate”:

ct 1 ct f = (W (f )
xt + b (f )
)
memory
cell content *
sigmoid
(usually x,b,f are vectors, W
“forget the weight matrix)
gate f”
xt Obtain new memory content:
ct = f ⇤ ct
input 1

elementwise product

NEW: for the first time, we are multiplying neuron values!

LSTM: Forget gate (delete)

Backpropagation
ct 1 ct
memory
cell content *
The multiplication * splits the
“forget error backpropagation into two
gate f” branches
xt
input
product rule:
@fj ct 1,j @fj @ct 1,j
= ct 1,j + fj
@w⇤ @w⇤ @w⇤
(Note: if time is not specified, we are referring to t)
LSTM: Forget gate (delete)

memory
cell content * *

“forget
gate f”

input
t-1 t t+1
LSTM: Write new memory value

ct 1 ct i = (W (i) (i)
xt + b )
+ c̃t = tanh(W (c) xt + b(c) )
*
“input new value c̃t
gate i”
xt
input
both delete and write together:
ct = f ⇤ ct 1 + i ⇤ c̃t
forget new value
LSTM: Read (output) memory value
ht
ct *c
1 t
(o) (o)
o = (W xt + b )
ht = o ⇤ tanh(ct )
“output
gate o”
xt
input
LSTM: exploit previous memory output ‘h’

make f,i,o etc. at time t depend on output ‘h’

calculated in previous time step!
(otherwise: ‘h’ could only be used in higher
layers, but not to control memory access in
present layer)
f = (W (f ) xt + U (f ) ht 1 + b (f )
)
...and likewise for every other quantity!
Thus, result of readout can actually influence
subsequent operations (e.g.: readout of some
selected other memory cell!)

Sometimes, o is even made to depend on ct

LSTM: backpropagation through time is OK

As long as memory content is not read or written, the

backpropagation gradient is trivial:
ct = ct 1 = ct 2 = ...
@ct @ct 1 @ct 2
= = = ...
@w⇤ @w⇤ @w⇤
(deviation vector multiplied by 1)
During those ‘silent’ time-intervals: No
explosion or vanishing gradient!
Adding an LSTM layer with 10 memory cells:

Each of those cells has the full structure, with f,i,o

gates and the memory content c, and the output h.

rnn.add(LSTM(10, return_sequences=True))

whether to return the full

time sequence of outputs, or only
the output at the final time
Two LSTM layers (input > LSTM > LSTM=output), taking an
input of 3 neuron values for each time step and producing a
time sequence with 2 neuron values for each time step
output

LSTM 2

def init_memory_net():
global rnn, batchsize, timesteps
rnn = Sequential()
LSTM 5
# note: batch_input_shape is
(batchsize,timesteps,data_dim)
rnn.add(LSTM(5, batch_input_shape=(None,
timesteps, 3), return_sequences=True))
rnn.add(LSTM(2, return_sequences=True)) 3
rnn.compile(loss='mean_squared_error', input
optimizer='adam', metrics=['accuracy'])
Example: A network for recall
(see code on website)

input time sequence

2 1
1 0.4

0 1

tell recall! time

desired output time sequence

0 0.4

tell recall time

Example: A network that counts down
(see code on website)

input time sequence

1 7

0 1

tell time

desired output time sequence

7 steps

0 1

signal! time
Output of the recall network, evolving during
training (for a fixed input sequence)
time

RECALL

TELL

Learning episode (batch of 20 for each episode)

Output of the countdown network, evolving
during training (for a fixed input sequence)
time

SIGNAL

TELL (delay 5)

Learning episode (batch of 20 for each episode)

Character generation

input sequence
T H E _ T H E O R Y _ O F _ G E
(characters in one-hot encoding) time

desired output: predict next character

H E _ T H E O R Y _ O F _ G E N
network will output probability for each
possible character, at each time step
ABCDEFGHIJKLMNOPQRSTUVWXYZ_
(example for second time-step)
Character generation

Example by Andrej Karpathy training on MBs of text

tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e

plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng

"Tmont thithey" fomesscerliund

Keushey. Thom here
sheulke, anmerenith ol sivh I lalterthend Bleipile shuwy fil on
aseterlome
coaniogennc Phe lism thond hon at. MeiDimorotion in ther thize."

we counter. He stutn co des. His stanted out one ofler that concossions and was
to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn

Aftair fall unsuch that the hall for Prince Velzonski's that me of
her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort
how, and Gogition is so overelical and ofter.

"Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
Homework

Train a network that is eventually able to carry out

sums or differences:

input 3+5=??
output ....08
input 7-5=??
output ....02

How do you encode the input/output

sequences? What happens when the result has
two digits? etc.
Word vectors
Word vectors
simple one-hot encoding of words needs
large vectors (and they do not carry any
special meaning):
“warm” 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...

warm

three
apple

desk

cold
tree

nice
bird

that

two

was
one
sun

hot

full
dimension: number of words in dictionary

word2vec – reduction to vectors in much

lower dimension, where similar words lie
closer together: “warm”
“warm” 0.3 0 0.1 0 2 1.2 0 “hot” “tree”
“hot” 0.2 0 0.2 0 2.5 1.3 0 “cold”
“cold” 0.1 0 0 2 0.4 0.2 0.3
Word vectors: recurrent net for training
have a long his-
Mikolov,1986;
ng (Hinton, Yih, Zweig 2013
Deerwester et al.,
network language
the classical lan- w(t) s(t) y(t)
a probability dis-
Figure 1: Recurrent Neural Network Language Model.
ven some preced-
input word at time t predicted next word
dim. N (probabilities for each word in vocab.;
first studied in the
(one-hot; dimension D=
s (Bengio size
et ofal.,dictionary)
output layers are computed as follows:dimension D)
later in the con- NxD NxN
odels (Mikolov et s(t) = f (Uw(t) + Ws(t 1)) (1)
This early work DxN
y(t) = g (Vs(t)) , (2)
mance in terms of
for more compu- where U matrix (NxD) contains word vectors!
as been addressed 1 ezm
rchical prediction f (z) = , g(zm ) = . (3)
1+e z
ke
z k
and Hinton, 2009;
sigmoid SOFTMAX are
011b; Mikolov et In this framework, the word representations
use of distributed found in the columns of U, with each column rep-
udied in (Hinton resenting a word. The RNN is trained with back-
Word vectors: how to train them

Predicting the probability of any word in

the dictionary, given the context words
(most recent word): very expensive!
Alternative:
Noise-contrastive estimation: provide
a few noisy (wrong) examples, and
train the model to predict that they
are fake (but that the true one is
correct)!
Word vectors: how to train them
Two approaches:
“continuous bag of words” Context words word
“skip-gram” word context words
Example dataset:
the quick brown fox jumped over the lazy dog

word context words (here: just surrounding words)

quick the, brown
over jumped, the
lazy the, dog
... ...
Word vectors: how to train them

Model tries to predict:

P✓ (w, h) prob. that w is the correct word,
given the context word h
parameters of the model, i.e. weights, biases,
and entries of embedding vectors

P✓ (w, h) = (Wjk ek (h) + bj )

j: index for word w in dictionary
k: index in embedding vector [Einstein sum]
e(h): embedding vector for word h
W,b: weights, biases

At each time-step: go down the

gradient of
X
C ==
-( ln P✓ (wt , h) +
(t)
ln(1 P✓ (w̃, h)) )
w̃
noisy examples
Word vectors encode meaning
Mikolov,Yih, Zweig 2013

car-cars ~ tree-trees
(subtracting the word vectors on each
side yields approx. identical vectors)

“car”
“tree”
“cars”
“trees”
Word vectors encode meaning
Mikolov,Yih, Zweig 2013

“woman”
“queen”
“aunt”

“man” “king”
“uncle”
Word vectors encode meaning
Mikolov et al. 2013 ”Distributed Representations of Words and Phrases and their Compositionality”

Country and Capital Vectors Projected by PCA

2
China
Beijing
1.5 Russia
Japan
Moscow
1
Ankara Tokyo
Turkey

0.5
Poland

0 Germany
France Warsaw
Berlin
-0.5 Italy Paris

Greece Athens
Rome
-1 Spain

Madrid
-1.5 Portugal
Lisbon

-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly
Word vectors in keras

Layer for mapping word indices (integer numbers representing

position in a dictionary) to word vectors (of length
EMBEDDING_DIM), for input sequences of some given length
embedding_layer = Embedding(len(word_index) + 1,
=
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH)

Helper routines for converting actual text into a sequence of

word indices. See especially:
function/class Keras documentation
Tokenizer Text Preprocessing
pad_sequences Sequence Preprocessing
(and others)
Search for “GloVe word embeddings”: 800 MB database
pre-trained on a 2014 dump of the English Wikipedia,
encoding 400k words in 100-dimensional vectors
Fu
re nct
pr io
es n/
en Im
tat ag
i o e
[H Im n
an ag
dw e
rit clas
ing si
Co re ficat
nv co io
olu gn n
Au t i on
itio
a n]
to ln
en et
dim V c od s
er
en isua
sio izal s
na tio
lr n
Re ed by
cu uc
tio
rre
nt n
ne
tw
W or
or ks
dv
Re e cto
inf rs
or
ce
me
nt
lea
rn
ing
Image: Wikipedia
Reinforcement Learning
“Supervised learning”
(most neural network applications)

!
!

teacher student
(smart) (imitates teacher)

final level limited by teacher

“Reinforcement learning”

?
?
?

student/scientist
(tries out things)

final level: unlimited (?)

Reinforcement learning

observation
fully observed vs.
“agent” “environment” partially observed
“state” of the
environment
action
Self-driving cars, robotics:
Observe immediate environment & move
Games:
Observe board & place stone
Observe video screen & move player

Challenge: the “correct” action is not known!

Therefore: no supervised learning!
Reward will be rare (or decided only at end)
Reinforcement learning

Use reinforcement learning:

Training a network to produce actions based on rare
rewards (instead of being told the ‘correct’ action!)
Challenge: We could use the final reward to define a cost
function, but we cannot know how the environment
reacts to a proposed change of the actions that were
taken!
(unless we have a model of the environment)
Reinforcement Learning: Basic setting
action

“policy”
state action

observation
RL-agent RL-environment
state = position x,y
action = move (direction)
state = position x,y
action = move (direction)
reward for picking up box
Policy Gradient
=REINFORCE (Williams 1992): The simplest model-free
general reinforcement learning technique

Basic idea: Use probabilistic action choice. If the

reward at the end turns out to be high, make
all the actions in this sequence more likely
(otherwise do the opposite)
This will also sometimes reinforce ‘bad’ actions,
but since they occur more likely in trajectories
with low reward, the net effect will still be to
suppress them!
action

“policy”
state action

observation
RL-agent RL-environment
Policy: ⇡✓ (at |st ) – probability to pick action at
given observed state st
at time t
action

“policy”
state action

observation
RL-agent RL-environment
action
neural
network

observation
RL-agent RL-environment
Policy Gradient

Probabilistic policy:
Probability to take action a, given the current state s
⇡✓ (a|s)

parameters of the network

s a π
π down 0.1
up 0.6
left 0.2
right 0.1

Environment: makes (possibly stochastic) transition to a new state

s’, and possibly gives a reward r
Transition function 0
P (s |s, a)
Policy Gradient
Probability for having a certain trajectory of actions
and states: product over time steps

P✓ (⌧ ) = ⇧t P (st+1 |st , at )⇡✓ (at |st )

trajectory: ⌧ = (a, s)
a = a 0 , a1 , a2 , . . .
s = s1 , s2 , . . . (state 0 is fixed)
Expected overall reward (='return'):
sum over all trajectories
X
P✓ (⌧ )R(⌧ ) return for this sequence (sum over
R̄ = E[R] = individual rewards r for all times)
⌧
X X
sum over all actions at all times ... = ...
and over all states at all times >0 ⌧ a0 ,a1 ,a2 ,...,s1 ,s2 ,...

Try to maximize expected return by changing

parameters of policy:
@ R̄
=?
@✓
Policy Gradient
@ R̄ X X @⇡✓ (at |st ) 1
= R(⌧ ) ⇧t0 P (st0 +1 |st0 , at0 )⇡✓ (at0 |st0 )
@✓ t ⌧
@✓ ⇡✓ (at |st )

@ ln ⇡✓ (at |st )
@✓

Main formula of policy gradient method:

@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓

Stochastic gradient descent:

@ R̄
✓=⌘ where E[. . .] is approximated via the
@✓
value for one trajectory (or a batch)
Policy Gradient

@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓

Increase the probability of all action choices in the

given sequence, depending on size of return R.
Even if R>0 always, due to normalization of probabilities
this will tend to suppress the action choices in
sequences with lower-than-average returns.
Abbreviation: @ ln P (⌧ ) X @ ln ⇡✓ (at |st )
✓
Gk = =
@✓k t
@✓k
@ R̄
= E[RGk ]
@✓k
Policy Gradient: reward baseline
Challenge: fluctuations of estimate for return gradient
can be huge. Things improve if one subtracts a constant
baseline from the return.
@ R̄ X @ ln ⇡✓ (at |st )
= E[(R b) ]
@✓ t
@✓
= E[(R b)G]
This is the same as before. Proof:
X @ ln P✓ (⌧ ) @ X
E[Gk ] = P✓ (⌧ ) = P✓ (⌧ ) = 0
⌧
@✓k @✓k ⌧
However, the variance of the fluctuating random
variable (R-b)G is different, and can be smaller
(depending on the value of b)!
Optimal baseline

Xk = (R bk )Gk
2 2
Var[Xk ] = E[Xk ] E[Xk ] = min
@Var[Xk ]
=0
@bk

E[G2k R]
bk =
E[G2k ]

@ ln P✓ (⌧ )
Gk =
@✓k
✓k = ⌘E[Gk (R bk )]
For more in-depth treatment, see David Silver’s course on
reinforcement learning (University College London):

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
The simplest RL example ever

A random walk, where the probability to go “up” is

determined by the policy, and where the return is given
by the final position (ideal strategy: always go up!)
(Note: this policy does not even depend on the current state)
position

reward
time
The simplest RL example ever

A random walk, where the probability to go “up” is

determined by the policy, and where the return is given
by the final position (ideal strategy: always go up!)
(Note: this policy does not even depend on the current state)
1
policy ⇡✓ (up) = ✓
return R = x(T )
1+e
X ⌧
@ ln ⇡✓ (at )
RL update ✓=⌘ R
t
@✓
at = up or down
@ ln ⇡✓ (at ) + for up, - for down
✓ 1 ⇡✓ (up) for up
= ±e ⇡✓ (at ) = ±(1 ⇡✓ (at )) =
@✓ ⇡✓ (up) for down
number of ‘up-steps’
X @ ln ⇡✓ (at )
= Nup N ⇡✓ (up)
t
@✓ N=number of time steps
The simplest RL example ever

return R = x(T ) = Nup Ndown = 2Nup N

X ⌧
@ ln ⇡✓ (at )
RL update ✓=⌘ R
t
@✓
at = up or down
* + ⌧
X @ ln ⇡✓ (at ) N
R = 2 (Nup )(Nup N̄up )
t
@✓ 2
(general analytical expression for
average update, rare)
1
Initially, when ⇡✓ (up) = :
⌧ 2
N 2 N
✓ = 2⌘ (Nup ) = 2⌘Var(Nup ) = ⌘ > 0
2 2
(binomial distribution!)
The simplest RL example ever
In general:
* + ⌧
X @ ln ⇡✓ (at ) N
R = 2 (Nup )(Nup N̄up )
⌧✓
t
@✓ 2◆
N
=2 (Nup N̄up ) + (N̄up ) (Nup N̄up )
2
N ⌦ ↵
= 2VarNup + 2(N̄up ) Nup N̄up
2
= 2VarNup = 2N ⇡✓ (up)(1 ⇡✓ (up)) (general analytical
expression for average
update, fully simplified,
⇡✓ (up))

extremely rare)
⇡✓ (up)(1

⇡✓ (up)
The simplest RL example ever

probability ⇡✓ (up)

3 learning attempts
strong fluctuations!

trajectory (=training episode)

(This plot for N=100 time steps in a

trajectory; eta=0.001)
Spread of the update step

Y = Nup N̄up c = N̄up N/2 X = (Y + c)Y

(Note: to get Var X, we need central moments X=update
of binomial distribution up to 4th moment) (except
prefactor of 2)
3
⇠N 2
p
Var(X)

1
⇠N
hXi

⇡✓ (up) (This plot for N=100)

Optimal baseline suppresses spread!

Y = Nup N̄up c = N̄up N/2 X = (Y + c)Y

with optimal baseline: ⌦ 2 ↵
0 Y (Y + c)
X = (Y + c b)Y b=
hY 2 i
3
⇠N2
p
Var(X)

p
Var(X 0 ) 1
⇠N
hXi

⇡✓ (up) (This plot for N=100)

Note: Many update steps reduce relative spread

M = number of update steps

M
X
X= Xj
j=1

h Xi = M hXi
p p p
Var X = M VarX
p
Var X 1
relative spread ⇠p
h Xi M
Homework

Implement the RL update including the optimal

baseline and run some stochastic learning
attempts. Can you observe the improvement
over the no-baseline results shown here?
Note:You do not need to simulate the individual
random walk trajectories, just exploit the
binomial distribution.
The second-simplest RL example
position actions: move or stay
“walker”

“target site”

return=number of time steps on target

time

See code on website: “SimpleRL_WalkerTarget”

output = action probabilities (softmax)

policy ⇡✓ (a|s)

a=0 ("stay") a=1 ("move")

input = s = "are we on target"? (0/1)

Policy gradient: all the steps
Obtain one "trajectory":

execute action,
record new state

apply neural network to state,

thus obtain action probabilities

from probabilities, obtain

action for next step
Policy gradient: all the steps
For each trajectory:

Do one trajectory
(in reality: a batch of trajectories)

Obtain overall sum of rewards (=return)

for each trajectory

apply policy gradient training

(enhance probabilities for all
actions in a high-return trajectory)
RL in keras: categorical cross-entropy trick
output = action categorical cross-entropy
probabilities (softmax) X distr. from net
⇡✓ (a|s) C= P (a) ln ⇡✓ (a|s)
a desired
distribution
Set
a=0 a=1 a=2
P (a) = R
for a=action that was taken

P (a) = 0
for all other actions a

@C
✓= ⌘
@✓
input = state
implements policy gradient
RL in keras: categorical cross-entropy trick
Encountered N states (during repeated runs)
After setting categorical cross-entropy as cost function,
just use the following simple line to implement policy
gradient: array N x state-size

net.train_on_batch(observed_inputs,desired_outputs)

array N x number-of-actions
Here desired_outputs[j,a]=R for the state
numbered j, if action a was taken during a run that
gave overall return R
AlphaGo

Among the major board games, “Go” was

not yet played on a superhuman level by any
program (very large state space on a 19x19
board!)
alpha-Go beat the world’s best player in 2017
AlphaGo

First: try to learn from human expert players

Silver et al.,“Mastering the game of Go with deep neural networks

and tree search” (Google Deepmind team), Nature, January 2016
AlphaGo

Second: use policy gradient RL on games played

against previous versions of the program

Silver et al.,“Mastering the game of Go with deep neural networks

and tree search” (Google Deepmind team), Nature, January 2016
AlphaGo

*Note: beyond policy-

gradient type methods,
this also includes another
algorithm, called Monte
Carlo Tree Search

Silver et al.,“Mastering the game of Go with deep neural networks

and tree search” (Google Deepmind team), Nature, January 2016
AlphaGoZero
No training on human expert knowledge
– eventually becomes even better!

Silver et al, Nature 2017

AlphaGoZero

Ke Jie stated that "After humanity spent thousands of

years improving our tactics, computers tell us that humans
are completely wrong... I would go as far as to say not a
single human has touched the edge of the truth of Go."
Q-learning
Q-learning

An alternative to the policy gradient approach

Introduce a quality function Q that predicts the

future reward for a given state s and a given
action a. Deterministic policy: just select
the action with the largest Q!
"value" of a state as color
"quality" of the action "going up" as color
Q-learning
Introduce a quality function Q that predicts the
future reward for a given state s and a given
action a. Deterministic policy: just select
the action with the largest Q!
Q(st , at ) = E[Rt |st , at ] (assuming future
steps to follow the
“Discounted” XT policy!)
t0 t
future reward: Rt = rt 0
t0 =t
depends on state
Reward at time step t: rt and action at time t
Discount factor: 0 <  1 learning somewhat
easier for smaller
factor (short
memory times)
Note: The ‘value’ of a state is V (s) = maxa Q(s, a)
How do we obtain Q?
Q-learning: Update rule
Bellmann equation: (from optimal control theory)
Q(st , at ) = E[rt + maxa Q(st+1 , a)|st , at ]
In practice, we do not know the Q function yet, so
we cannot directly use the Bellmann equation.
However, the following update rule has the correct Q
function as a fixed point:
Qnew (st , at ) = Qold (st , at ) + ↵(rt + maxa Qold (st+1 , a) Qold (st , at ))
will be zero, once
we have converged
small (<1) update to the correct Q
factor
If we use a neural network to calculate Q, it will be
trained to yield the “new” value in each step.
Q(a=up,s)
Q(a=up,s)
Q(a=up,s)
Q-learning: Exploration

Initially, Q is arbitrary. It will be bad to follow this Q all

the time. Therefore, introduce probability ✏ of
random action (“exploration”)!
Follow Q: “exploitation”
Do something random (new): “exploration”
“✏ -greedy“

Reduce this randomness later!

Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015

last four 84x84 pixel images as input [=state]

motion as output [=action]
Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015
Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015

t-SNE visualization of
last hidden layer
Advanced Exercise

Apply RL to solve the challenge of finding, as fast

as possible, a "treasure" in:
- a fixed given labyrinth
- an arbitrary labyrinth (in each run, the player
finds itself in another labyrinth)

Use the labyrinth generator on Wikipedia "Maze

Generation Algorithm"
Wikipedia: "Maze Generation Algorithm /
Python Code Example"
Fu
re nct
pr io
es n/
en Im
tat ag
i o e
[H Im n
an ag
dw e
rit clas
ing si
Co re ficat
nv co io
olu gn n
Au t i on
itio
a n]
to ln
en et
dim V c od s
er
en isua
sio izal s
na tio
lr n
Re ed by
cu uc
tio
rre
nt n
ne
W tw
or or
dv ks
Re ec
inf to
or rs
ce
me
Co nt
nn lea
ec rn
ti o
ns i ng
to
ph
ys
i cs
Neural networks and spin models

0/1 /

Artifical Bit Spin

neuron

Neural networks with stochastic transitions,

and with some energy functional similar to
spin models in physics; e.g. as described by
Hopfield and others starting from the 80s
Modeling probability distributions

Goal: Use a neural network to generate previously

unseen examples, according to the probability
distribution of training samples
One example already mentioned in these lectures: generating new
random (but kind-of reasonable) text after seeing lots of it

Example: Generate new images after looking at many,

generate handwritten text

The solution will exploit the connection between

neural networks and the statistical physics of spin
models!
Boltzmann-Gibbs distribution
Probabilities of states of a physical system,
in thermal equilibrium?

energy high: less likely energy low: more likely

1 kE(s)T X E(s0 )
P (s) = e B Z= e kB T
Z s0
probability for state s, Z for normalization:
in thermal equilibrium “partition function”
Problem: for a many-body system, exponentially many states (for
example 2N spin states). Cannot go through all of them!
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,

make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,

make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,

make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,

make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Time evolution of ensemble?

X “IN” “OUT”
0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
change in one P(s) = probability to find the
time-step system in state s (or: fraction
of ensemble in this state)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Time evolution of ensemble?

At long times: stable steady state distribution

If we have “detailed balance”, i.e. if there exists a
distribution P(s), such that for any pair of states:
0
P (s s) P (s)
=
P (s0 s) P (s0 )
then P(s) is the long-time distribution!
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Monte Carlo for thermal equilibrium: choose

transition probabilities such that P(s) will be the
Boltzmann distribution!
0 E(s0 ) E(s)
P (s s)
= e kB T
P (s0 s)
example Metropolis algorithm: pick random spin, calculate
energy change for spin flip. Do the flip if it lowers the energy. If
the energy increases, only flip with probability exp( E/kB T )
Markov chain

s1 s3
s0 s2

The sequence of visited states forms a

so-called “Markov chain”

Markov = transitions without memory

Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1

Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1

Define “energy” (we will set kBT=1)

“restricted”: no coupling v-v or h-h w: couplings (weights)

see G. Hinton’s guide: http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1

e E(v,h) X
E(v,h)
P (v, h) = Z = e
X Z v,h
P (v) = P (v, h)
h
Goal: adapt weights (and biases), such that the
probability distribution of a set of training
examples is approximately reproduced by P(v)
P (v) ⇡ P0 (v) from training samples
Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1

Interpretation: the ‘hidden units’ represent

categories of data (e.g. “dog+white+big”)
Building a Markov chain
Instead of the full state s=(v,h): Consider alternating
transitions between v and h states
P (v, h)
Set: P (h v) = P (h|v) =
P (v)
P (v, h)
P (v h) = P (v|h) =
P (h)
h h0

v 0
v “reconstruction” v 00

These transition probabilities fulfill detailed balance!

P (h v) P (h) Thus: P(v) [and P(h)] are the
= steady-state distributions!
P (v h) P (v)
Building a Markov chain
X X P P P
ZP
P (v) = e E(v,h)
= e i a i vi + j bj h j + i,j vi hj wij

h h
P
a i vi zj
=e i ⇧j (1 + e )
X
with: zj = b j + vi wij
P i X X X X
where we used: e j Xj = ⇧j eXj ... = ...
h h 0 =0,1 h1 =0,1 h2 =0,1
Therefore:
e E(v,h) ezj hj
P (h|v) = = ⇧j
ZP (v) 1 + ezj
Product of probabilities! All the hj are independently
distributed, with probabilities:
ezj
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
P (hj = 0|v) = 1 (zj )
Building a Markov chain
Given some visible-units state vector v, calculate
the probabilities z
ej
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
Then assign 1 or 0, according to these probabilities,
to obtain the new hidden state vector h

Similarly, go from h to a new v’, using:

0 0
P (vi = 1|h) = (zi )
X
zi0 = ai + wij hj
j
Updating the weights
Goal: adapt weights (and biases), such that the
probability distribution of a set of training
examples is approximately reproduced by P(v)
P (v) ⇡ P0 (v) from training samples

Minimize the categorical cross-entropy

X
C= P0 (v) ln P (v)
v
But now (unlike earlier examples), there are
exponentially many values for v, so we cannot
simply have a network output P(v) for all v. Still,
let us take the derivative of C with respect to
the weights w!
Updating the weights
X
C= P0 (v) ln P (v)
v
@
P
@ @wij h P (v, h)
ln P (v) = P
@wij P (v, h)
h
@
P E(v,h) @ 1
X e
E(v 0 ,h0 ) @wij h @wij Z
Z= e = P E(v,h) 1
v 0 ,h0 h e Z
P P 0 0 E(v 0 ,h0 )
E(v,h)
h v i hj e v 0 ,h0 vi hj e
= P E(v,h)
h e Z
overall:
X @ X X
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi0 h0j P (v 0 , h0 )
@wij
v v,h v 0 ,h0
Updating the weights
X @ X X
0 0 0 0 0
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi hj P (h |v )P (v )
@wij
v v,h v 0 ,h0

easy: draw one training

sample v, then do one
Markov chain step from v to
h; average over all samples v
hard: need to average over
the correct distribution P(v)
belonging to the Boltzmann
machine!
Updating the weights
X @ X X
0 0 0 0 0
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi hj P (h |v )P (v )
@wij
v v,h v 0 ,h0

Could obtain P(v) by running the Markov chain for

really long times! Very expensive!
h 0
h

v
(v: training sample) v0 v 00

Rough approximation, used in practice: Just take v’,h’

from the second pair of the chain! [For better
approx.: can take a pair further down the chain]
⌦ 0 0↵
wij = ⌘(hvi hj i v i hj )
(averaged over a batch of training samples v starting the chain)
Updating the weights
⌦ 0 0
↵
wij = ⌘(hvi hj i v i hj
)
(averaged over a batch of training samples v starting the chain)
“Contrastive Divergence” (CD) algorithm by G. Hinton

Note: At least we can claim that P0 (v) = P (v)

would be a fixed point of this update rule, since then the two
averages on the right-hand-side yield identical results. Of
course, usually the restricted Boltzmann machine will not be
able to reach this point, since it cannot represent arbitrary
P(v).
0
ai = ⌘(hvi i hvi i)
⌦ 0↵
bj = ⌘(hhj i hj )
Restricted Boltzmann Machine for MNIST
example from http://deeplearning.net/tutorial/rbm.html
Markov chain steps (1000 steps between each row!)

Each column: a different, independent Markov chain

Restricted Boltzmann Machine for MNIST
example from http://deeplearning.net/tutorial/rbm.html

The learned weights for the 100 hidden units

RBM as a starting point
First train RBM, then connect hidden layer to some
output layer for supervised learning of classification
Idea: RBM provides unsupervised learning of
important features in the training set (pre-training)

output layer
(e.g. softmax)

“hidden” units h

“visible” units v
Deep belief networks
Stack RBMs: First train a simple RBM, then use its
hidden units as input to another RBM, and so on

hidden units h3
third RBM
hidden units h2
second RBM
hidden units h1
first RBM
visible units v

Afterwards, fine-tune weights,

e.g. by supervised learning
Application to Quantum Physics

“hidden” units h

“visible” units v
=spins of quantum model
Try to solve a quantum many-body problem (quantum spin
model) using the following variational ansatz for the wave
function amplitudes:
X P z P P z
a + b h + h i j Wij
(S) = e j j j i i i ij

h
Carleo & Troyer, Science 2017
z z z
S = ( 1 , 2 , . . . , N ) one basis state in the many-body Hilbert space
z
j = ±1 (in general, a,b, W may be complex)
hi = ±1
This is exactly (proportional to) the RBM representation
for P(v) [with v=S]!
Application to Quantum Physics

“hidden” units h

“visible” units v
=spins of quantum model
D E
Minimize the energy |Ĥ|
h | i
by adapting the weights W and biases a and b!
[requires additional Monte Carlo simulation, to obtain a
stochastic sampling of the gradient with respect to these
parameters]
For example: sample probabilities by using Metropolis
algorithm, with transition probabilities 0 2
0 (S )
P (S S) = min(1, )
(S)
Application to Quantum Physics
Carleo & Troyer, Science 2017

error
number of ‘filters’

Exploit translational invariance (like in convolutional nets);

weights are “filters” (convolutional kernels)
Find updates on

http://machine-learning-for-physicists.org

Panasonic SA-PMX70 & SA-PMX90 User Manual
No ratings yet
Panasonic SA-PMX70 & SA-PMX90 User Manual
20 pages
Directory Structures and Implementations
No ratings yet
Directory Structures and Implementations
18 pages
Gantala Panchangam 2021 2022
No ratings yet
Gantala Panchangam 2021 2022
36 pages
Lec 10
No ratings yet
Lec 10
37 pages
EBM2.1 MANUAL For Compute and Tablet
No ratings yet
EBM2.1 MANUAL For Compute and Tablet
40 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
105 pages
cs224n spr2024 Lecture06 Fancy RNN
No ratings yet
cs224n spr2024 Lecture06 Fancy RNN
56 pages
10 RNN
No ratings yet
10 RNN
56 pages
Learn CSS in One Day and Learn It Well (Includes HTML5) - CSS For Beginners With Hands-On Project. The Only Book You Need To Start Coding in CSS ... Coding Fast With Hands-On Project) (Volume 2)
No ratings yet
Learn CSS in One Day and Learn It Well (Includes HTML5) - CSS For Beginners With Hands-On Project. The Only Book You Need To Start Coding in CSS ... Coding Fast With Hands-On Project) (Volume 2)
108 pages
Deep Learning 2017 Lecture6RNN 1 18
No ratings yet
Deep Learning 2017 Lecture6RNN 1 18
18 pages
DSP Unit - Iv
No ratings yet
DSP Unit - Iv
15 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Unit 6
No ratings yet
Unit 6
41 pages
Outline
No ratings yet
Outline
50 pages
AN2DL 04 2324 RecurrentNeuralNetworks
No ratings yet
AN2DL 04 2324 RecurrentNeuralNetworks
34 pages
Mohamed - CV
No ratings yet
Mohamed - CV
2 pages
DL Experiments
No ratings yet
DL Experiments
19 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
CE6146 Lecture 4
No ratings yet
CE6146 Lecture 4
53 pages
RNN and LSTM - Explanation by Example
No ratings yet
RNN and LSTM - Explanation by Example
56 pages
Low-Area and Low-Power VLSI Architectures For Long Short-Term Memory Networks
No ratings yet
Low-Area and Low-Power VLSI Architectures For Long Short-Term Memory Networks
15 pages
Introduction To Rnns
No ratings yet
Introduction To Rnns
48 pages
Module 4
No ratings yet
Module 4
36 pages
Session2 2024 - 2025 - Natural Language Processing
No ratings yet
Session2 2024 - 2025 - Natural Language Processing
30 pages
RNN LSTM
No ratings yet
RNN LSTM
37 pages
Arrays
No ratings yet
Arrays
9 pages
Long Short-Term Memory (LSTM)
No ratings yet
Long Short-Term Memory (LSTM)
25 pages
Chap 7.2 Sequence Analysis Using RNN LSTM
No ratings yet
Chap 7.2 Sequence Analysis Using RNN LSTM
60 pages
Deep Learning L3
No ratings yet
Deep Learning L3
37 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
CSE 4237 SoftCom Solutions
No ratings yet
CSE 4237 SoftCom Solutions
115 pages
Copernicus Product Catalogue 20200302
No ratings yet
Copernicus Product Catalogue 20200302
76 pages
Understanding LSTM
No ratings yet
Understanding LSTM
34 pages
LSTM 1738024034
No ratings yet
LSTM 1738024034
13 pages
Deep Learning Questions
No ratings yet
Deep Learning Questions
17 pages
Artificial Intelligence 417 Class X Sample Paper Test 02 For Board Exam 2023
No ratings yet
Artificial Intelligence 417 Class X Sample Paper Test 02 For Board Exam 2023
6 pages
CH4 - AA1.1-Sequence Models
No ratings yet
CH4 - AA1.1-Sequence Models
26 pages
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
No ratings yet
Long Short-Term Memory Networks (LSTM) - Simply Explained! - Data Basecamp
4 pages
MachineLearningSlides PartOne
No ratings yet
MachineLearningSlides PartOne
252 pages
AIDL03 EvolutionOfAI
No ratings yet
AIDL03 EvolutionOfAI
22 pages
9 - Exp-5 LSTM
No ratings yet
9 - Exp-5 LSTM
10 pages
Vail CMMS
No ratings yet
Vail CMMS
24 pages
15.03.2024 Csa3007 A24+d23+d24
No ratings yet
15.03.2024 Csa3007 A24+d23+d24
8 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
1 - Stock Market Lecture - 1 PDF
100% (1)
1 - Stock Market Lecture - 1 PDF
9 pages
Module 6
No ratings yet
Module 6
42 pages
Inte 423 Exam Draft
No ratings yet
Inte 423 Exam Draft
3 pages
DLT Unit-4
No ratings yet
DLT Unit-4
18 pages
Reflective Trigate Design for Classical Computers
From Everand
Reflective Trigate Design for Classical Computers
Ylia Callan
No ratings yet
Machine Learning Unit 4 RNN
No ratings yet
Machine Learning Unit 4 RNN
11 pages
TVL Comprog12 q3 m7
No ratings yet
TVL Comprog12 q3 m7
11 pages
Shipment Detail PDF
No ratings yet
Shipment Detail PDF
2 pages
LSTM
No ratings yet
LSTM
22 pages
RNN
No ratings yet
RNN
22 pages
RNN & LSTM Notes
No ratings yet
RNN & LSTM Notes
8 pages
EPJ LSTM Survey
No ratings yet
EPJ LSTM Survey
14 pages
6 - RNN LSTM & Gru
No ratings yet
6 - RNN LSTM & Gru
14 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
LSTM Material 1
No ratings yet
LSTM Material 1
3 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Leoxsys - Wifi Usb Adaper - User Manual
No ratings yet
Leoxsys - Wifi Usb Adaper - User Manual
43 pages
Machine Learning For Physicists: University of Erlangen-Nuremberg Florian Marquardt
No ratings yet
Machine Learning For Physicists: University of Erlangen-Nuremberg Florian Marquardt
44 pages
E-Eli5-Way-3bd2b1164a53: CNN (Source:)
No ratings yet
E-Eli5-Way-3bd2b1164a53: CNN (Source:)
4 pages
Brochure X13 Servers
No ratings yet
Brochure X13 Servers
48 pages
Os Installation
No ratings yet
Os Installation
16 pages
CS 601 Machine Learning Unit 4
No ratings yet
CS 601 Machine Learning Unit 4
14 pages
EE3706 - Chapter 6 - Capacitors and Inductors
No ratings yet
EE3706 - Chapter 6 - Capacitors and Inductors
27 pages
Learning Material 1 in MMW, Ch3
No ratings yet
Learning Material 1 in MMW, Ch3
16 pages
Mobile Phone Cloning IJERTCONV3IS10043
No ratings yet
Mobile Phone Cloning IJERTCONV3IS10043
5 pages
VLSI Physical Design Automation PDF
No ratings yet
VLSI Physical Design Automation PDF
29 pages
Machine Learning and Quantum Devices: Florian Marquardt April 22, 2020
No ratings yet
Machine Learning and Quantum Devices: Florian Marquardt April 22, 2020
40 pages
LSTM Networks Thesis Updated
No ratings yet
LSTM Networks Thesis Updated
5 pages
Unit 3
No ratings yet
Unit 3
8 pages
8.5 Recurrent Neural Networks
No ratings yet
8.5 Recurrent Neural Networks
5 pages
Addition Multiplication RNN
No ratings yet
Addition Multiplication RNN
7 pages
NN Text Generation Zaid Bouslikhin
No ratings yet
NN Text Generation Zaid Bouslikhin
14 pages
Gridadvisor Series II Smart Sensor Catalog Ca915001en
No ratings yet
Gridadvisor Series II Smart Sensor Catalog Ca915001en
4 pages
1308 0850 PDF
No ratings yet
1308 0850 PDF
43 pages
Commercial Electric 4 Ft. 8-Outlet Surge Protect Energy Saving Power Bar in White - HEADER - META - TAGS - SITE - NAME
No ratings yet
Commercial Electric 4 Ft. 8-Outlet Surge Protect Energy Saving Power Bar in White - HEADER - META - TAGS - SITE - NAME
7 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Cuet 2nd Day Admit Card
No ratings yet
Cuet 2nd Day Admit Card
2 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Schools Division of Cebu City/Lusaran National High School Workweek Plan
No ratings yet
Schools Division of Cebu City/Lusaran National High School Workweek Plan
2 pages
Network Hospital List - New India Assurance Co. LTD
No ratings yet
Network Hospital List - New India Assurance Co. LTD
12 pages
Product Detail - 700d - English - 3
No ratings yet
Product Detail - 700d - English - 3
2 pages
For Seminar
No ratings yet
For Seminar
17 pages
TOPMODEL User Notes: Windows Version 97.01
No ratings yet
TOPMODEL User Notes: Windows Version 97.01
15 pages
Regular Falsi Method: B.S. (SE) Semester Project Report
No ratings yet
Regular Falsi Method: B.S. (SE) Semester Project Report
12 pages
A Level 9618 7
100% (1)
A Level 9618 7
4 pages
Software Development: Cansat Program
No ratings yet
Software Development: Cansat Program
22 pages
S 8401 PDF
No ratings yet
S 8401 PDF
110 pages
Recurrent Neural Network Using LSTM Model
No ratings yet
Recurrent Neural Network Using LSTM Model
15 pages
15984650889088220A
No ratings yet
15984650889088220A
2 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Zlib 3 PDF
No ratings yet
Zlib 3 PDF
2 pages
ACURIL XL Local Org Comm Invit. (Eng)
No ratings yet
ACURIL XL Local Org Comm Invit. (Eng)
2 pages
LSTM
No ratings yet
LSTM
42 pages
Apple Assignment
No ratings yet
Apple Assignment
8 pages