Computational Statistical Physics Exercise Sheet 5: V H J I 1 N N Ij J I
Computational Statistical Physics Exercise Sheet 5: V H J I 1 N N Ij J I
Task 1: Read carefully through chapter 1.9 of the lecture notes and familiarize yourself with the
concepts of a neuron, the Hopfield Network and the Boltzmann Machine.
A Restricted Boltzmann Machine (RBM) is a neural network consisting of two layers of neurons
where every neuron of one layer is connected with every neuron of the other layer (inter-layer
connections between every two neurons). Within the same layer neurons are not connected (no
intra-layer connections). A schematic is presented in figure 1.
v1
w11
h1
v2
h2
v3
hNh
vNv
One of the two layers is called visible layer while the other one is called hidden layer. Interacting
with the machine (input and output) can only occur over the visible layer. The hidden layer
is not directly accessible. Moreover, the neurons are binary, i.e., they can only take one of two
possible values - either 0 or 1.
Let’s call the number of visible nodes Nv and the number of hidden nodes Nh . Furthermore,
call the current value of the j-th node in the visible layer vj and the i-th node in the hidden
layer hi . With these definitions we are able to have a closer look at the dynamics of the system.
Given v = (v1 , .., vNv ) the value of the i-th node in the hidden layer is set to 1 with probability
XNv
p(hi = 1|v) = σ wij vj + bi
j=1
1
else it is set to 0. The coefficients wij are called weights and the coefficients bi are called biases
(of the hidden layer). σ(x) is the sigmoid function
1
σ(x) =
1 + e−x
which maps any real number to the interval (0, 1). Similarly, given the values h = (h1 , .., hNh )
of the hidden layer the value of the j-th visible node is determined by
Nh
!
X
p(vj = 1|h) = σ wji hi + aj
i=1
where aj are the biases of the visible layer. Note that the weights are symmetric, i.e., wij = wji .
Due to these update rules the RBM is classified as a stochastic model.
Task 2: State and explain the differences between a Hopfield Network, a Boltzmann Machine
and a Restricted Boltzmann Machine.
In the following we are going to use the RBM to generate 2D Ising configurations with L = 32 at
a certain temperature T . Therefore, we choose the number of visible nodes to be Nv = 32 × 32.
Before samples can be drawn the machine has to be trained. By training we mean updating the
weights and biases according to our training data.1 This is done via contrastive divergence. The
update rule for the weights is given by
wij → wij − hvj hi idata − hvj hi ikmodel
where is a so-called learning rate. The expectation values are understood to be averages over
the whole set of training data.2 The quantity (vj hi )data is calculated by taking a vector v
from the training data and computing the corresponding vector h as described above. For the
quantity (vj hi )kmodel one has to take a vector from the training data, compute the corresponding
vector h, compute the new v and perform k more back-and-forth operations. More information
about the contrastive divergence can be found here: https://arxiv.org/pdf/1803.08823.pdf
(p. 90 ff.).
For completeness, the update rules for the biases are given by
aj → aj − hvj idata − hvj ikmodel
bi → bi − hhi idata − hhi ikmodel .
For the following tasks we provide you with a python project which is missing some functionality
that you have to implement. The main program is ”main.py” and structured as follows: At
first, the training data is extracted and brought into the right shape (implementation found in
”ising main.py”). Then, the RBM is set up and trained for one fixed temperature T (implemen-
tation found in ”my RBM tf2.py”). Finally, new Ising configurations are generated and stored
in an external file.
Task 3: Implement the function ”contr divergence” in the class ”RBM” in the file ”my RBM tf2.py”
as described above.
2
• tensorflow.sigmoid
• tensorflow.add
• tensorflow.tensordot
• tensorflow.transpose
• tensorflow.reshape
Check out the tensorflow documentation (https: // www. tensorflow. org/ api_ docs/ python )
for more information.
Task 4: Use the training data stored in ”ising data.hdf5” to find the optimal weights and biases
for your RBM. (Disclaimer: Training the machine may take quite a while depending on your
computer.)
Hint: In python, one possibility to access the Ising training data is given by:
import h5py
Once the machine is trained it can be used to generate new samples. This can be done in the
following way:
1. Set the nodes in the visible layer to random values (either 0 or 1).
2. Let the machine evolve i.e. go several times back and forth between the visible and hidden
layer.
3. Read out the nodes in the visible layer. This is the desired sample.
Task 5: Use the RBM to obtain new Ising configurations and store these samples in a separate
file.
Task 6: Repeat Task 4 and Task 5 for at least two more temperatures.
The two outer layers (used for input and output) are visible while the two inner layers are
hidden. In contrast to the RBM there is no back-and-forth flow of information between the
3
Input (visible)
hidden hidden
x1 Output (visible)
h1(1) h1(2)
x2 y1
y4
hNh1(1) hNh2(2)
xNv
Figure 2: Schematic of a feed-forward network with two hidden layers. Visible layers (green).
Hidden layers (blue).
visible and hidden layers. Instead, information flows from the input to the output layer. That
is why this network is called a feed-forward network. Furthermore, we assume that the nodes
are not binary anymore but can take continuous values between 0 and 1 and that the dynamics
of the system is given by
!
(1)
X (1) (1)
hk = σ wkl xl + bk
l
!
(2) (2) (1) (2)
X
hj =σ wjk hk + bj
k
(3) (2) (3)
X
yi = σ wij hj + bi
j
where σ(x) is again the sigmoid function. Due to the fact that the values of the nodes are uniquely
defined (not set with a certain probability as in the RBM) the network is called deterministic.
Since the goal is to map an Ising configuration to its corresponding temperature the input layer
is chosen to have 32 × 32 nodes while the ouput layer consists of 4 nodes (one for every possible
temperature we would like to detect). Thus, the values of the nodes in the output layer can be
interpreted as probabilities that the system has a certain temperature.
Training the machine means again that the weights and biases have to be adjusted. This is done
in a way such that the so-called cost function (also loss function) is minimized. Given some
input i(d) with expected output o(d) the (mean-squared) cost of this single training example is
defined as
4
(d) 2
(d)
X
C (d) = yi − oi .
i=1
With this the total cost function C is defined as the average of all costs over the whole training
4
data set3
NX
data
1
C w(1) , b(1) , w(2) , b(2) , w(3) , b(3) = C (d) .
Ndata
d=1
The most straightforward way to do the updates of the weights and biases is by using a steepest
descent method. However, such a method is usually slow because one has to average over all data
of the training set in every step. Therefore, similar to Exercise 1, we randomly divide the set
of training data into mini-batches and compute the gradient only for one of these mini-batches
in one step. This procedure is known as stochastic gradient descent. The update rule can be
stated in the following form:
w(1) w(1)
∂w(1)
b(1) b(1) ∂b(1)
(2) (2)
w w ∂ (2)
w
(2) → (2) − ∂ (2) C.
b b b
(3) (3) ∂ (3)
w w w
b(3) b(3) ∂b(3)
The gradient ∇C of the cost function can be computed via backpropagation which is nothing
else but the chain rule.
Task 1: State and derive the analytical expressions for ∂w(3) C and ∂w(2) C.
i,j i,j
Task 2: Build up the network and train your machine with the test data provided in ”ising data.hdf5”.
Task 3: Use the samples you generated in exercise 1 and determine the corresponding temper-
atures using the network from this exercise.
Task 4 (optional): In the end machine learning is about trial and error, i.e., finding the best
model to describe and successfully predict the kind of data you consider. Therefore, modify your
feed-forward network and see which modifications yield the best results. There are several things
you can change. To mention only a few:
• Activation function/non-linearity (instead of the sigmoid function one can use the ReLU,
Softmax, etc.)
• Cost function (instead of the mean-squared cost function one can use the categorical cross-
entropy, etc.)
• ...
3
Note that in our case Ndata = 5000.