Boltzmann Machine
Boltzmann Machine
Structure :
The boltzmann Machine consists of hidden nodes & Input nodes.
Like above picture the visible nodes are blue colored & hidden nodes are red colored. Every node
is connected to every node via synapses/linkage. But the network does not discriminate between
hidden nodes & visible nodes. It treats all nodes same. Every linkage carries some weights. There
is no output layers because we are not giving any output value. RBM is bipartite which means
there is no interlaying connections between input & hidden nodes.
• Like above picture System A has some molecules of gas in a high density region of a corner & in
System B has same number of molecules but uniformly distributed throughout the system B.
System B has the uniform density & as per Boltzmann distribution, System B is more stable &
it’s probability of being real is more.
• The Energy function E( v, h) is de ned as
where E( v, h) is the energy of the state, vi = input of the state, hj = hidden state, ai, bj = the biases
of vi & hj respectively, wi,j = the weight element of the matrix associated with vi & hj.
• Now we will be assuming the energy of EBM is equivalent to weights of BM. Once the system is
trained up, Restricted Boltzmann Machine always try to nd out the lowest energy state
possible. If we put the Energy Model equation of E(x) in p, we will get to know p is exponentially
inversely proportional to E(x), which matches with our Boltzmann Machine concept where lower
the energy higher the probability & higher the energy lower the probability.
What are Boltzmann machines used for?
• The main aim of a Boltzmann machine is to optimize the solution of a problem. To do this, it
optimizes the weights and quantities related to the speci c problem that is assigned to it. This
technique is employed when the main aim is to create mapping and to learn from the attributes
and target variables in the data. If you seek to identify an underlying structure or the pattern
within the data, unsupervised learning methods for this model are regarded to be more useful.
Some of the most widely used unsupervised learning methods are clustering, dimensionality
reduction, anomaly detection and creating generative models.
• All of these techniques have a di erent objective of detecting patterns like identifying latent
grouping, nding irregularities in the data, or even generating new samples from the data that is
available. You can even stack these networks in layers to build deep neural networks that
capture highly complicated statistics. Restricted Boltzmann machines are widely used in the
domain of imaging and image processing as well because they have the ability to model
continuous data that are common to natural images. They are even used to solve complicated
quantum mechanical many-particle problems or classical statistical physics problems like the
Ising and Potts classes of models.
• Boltzmann machines are non-deterministic (stochastic) generative Deep Learning models that
only have two kinds of nodes - hidden and visible nodes. They don’t have any output nodes,
and that’s what gives them the non-deterministic feature. They learn patterns without the typical
1 or 0 type output through which patterns are learned and optimized using Stochastic Gradient
Descent.
fi
ff
fi
fi
fi
fi
fi
fi
• A major di erence is that unlike other traditional networks (A/C/R) which don’t have any
connections between the input nodes, Boltzmann Machines have connections among the input
nodes. Every node is connected to all other nodes irrespective of whether they are input or
hidden nodes. This enables them to share information among themselves and self-generate
subsequent data. You’d only measure what’s on the visible nodes and not what’s on the hidden
nodes. After the input is provided, the Boltzmann machines are able to capture all the
parameters, patterns and correlations among the data. It is because of this that they are known
as deep generative models and they fall into the class of Unsupervised Deep Learning.
• Like above example of RBM, it is designed for recommended system. Here dataset having
movies in features & viewers in rows are going under the training process. Now Genre A, Genre
B, Actor X, Award Y, Director Z are the preferences given by the viewers. We will make a
network which will give a recommendation to us that a new viewer having some parameters
which movie he/she will prefer to see. To build this recommended system, we will train our
network. To have a good accuracy the network must have to adjust the weights of synapses
between the nodes repeatedly & this is exactly where RBM model helps us to this.
ff
fi
• At rst step the input data will be fed into network. The features of the dataset are The matrix,
Fight Club, Forrest Gump, Pulp Fiction, The Departed — these ve movies. Viewers rated the
movie & we have another dataset having the parameters like genre of the movie, oscar-winning
or not & name of actor/director. When we will combine the dataset & fed into the input like
above picture, one by one row the network will get the value of of input nodes. For the rst
movie ‘The Matrix’, the network will check the hidden nodes matches with ‘The Matrix’ or not.
No matching values are found because ‘The Matrix’ is neither a drama, action nor it is a oscar
winning movies & it is not acted by Dicaprio & Tarantino. The second movie ‘Fight club’ does
not have any data. The third one Forrest Gump is a drama. ‘The Titanic’ is also a drama. Drama
hidden node is learnt by Forest Gump & Titanic. Same way Dicaprio matches with Titanic &
Oscar matches with ‘Forrest Gump’ & ‘Titanic’. So the matched hidden nodes are colored
yellow & unmatched hidden nodes are colored red. So the network now knows which input
nodes are activated for which hidden nodes. Then backward propagation happens. RBM will
reconstruct inputs based on hidden nodes. During training if the reconstruction is incorrect, the
weights are adjusted. Then again reconstruction happens, again if the reconstruction is
incorrect, the weights are adjusted. This process is continued till we achieve maximum accuracy
in the network.
• During this reconstruction those vacant input nodes are lled with data by network, which it
gives us the recommendation that new user will love to watch the movie or not. For example,
the second movie ‘Fight Club’ will not be watched by a new viewer if it is a action movie.
Because this movie has not any parameters which are liked by viewers for other movies as well.
So the value of ‘0’ is updated in input node of ‘Fight club’. The last one ‘The Departed’ movie
learns from those hidden nodes & it matches with drama, Dicaprio, Oscar. So a new viewer will
love to watch this movie. Because this movie has some parameters which are liked by viewers
for other movies as well. So the value of ‘1’ is updated in input node of ‘The Departed’.
Contrastive Divergence :
• This is algorithm that allows us RBM to learn & update weights. The gradient descent will not
work here as it is directionless. The rst step is called Gibbs Sampling. We have an input vector
v0 & probability p(h0|v0) where h0 = hidden values. Then we will use p(v0|h0) to nd the v1. If
the no of iteration=k, our reconstructed input vector=vk,
• To understand this, let’s take a simple example of 5input nodes & 5 hidden nodes. At rst step a
hidden node is created by all input nodes. This way all hidden nodes are created. Then all
hidden nodes will now reconstruct input nodes one by one. So after updating value of input
node, it is no longer as same as previous input node even after changing the weights. So this
way the all input node changes. But again we should remember our each hidden nodes are also
based on input nodes. So hidden nodes will also be changed. This process will be continued till
the energy of the state is minimized just like we have learnt from ‘Energy Based Model’. The
weights of RBM is de ned by the energy of EBM. If we draw a curve between the energy of
state versus the epoch we will be getting a curve like this : —
fi
fi
fi
fi
fi
fi
fi
fi
• The more step we will go ahead in the contrastive divergence the less the energy of the state
will be & less the probability of the state will be. Just like above picture E@3rd state < E@2nd
state < E@1st State. We get the change of probability with respect to weights. Here <vi⁰hj⁰> is
the initial state of the system & <vi∞hj∞> is the nal state of the system. The process is called
‘gibbs sampling’. We can drag the energy curve by gibbs sampling, which technically means
adjusting weights with reassembling our input values to reach the minimum energy state.