DL Mod 5
DL Mod 5
"connectionist" approach to learning arbitrary probability distributions over However, computing the partition function Z exactly is computationally
binary vectors. They were introduced in the 1980s by researchers like intractable. Therefore, the gradient of the likelihood must be approximated
Fahlman, Ackley, Hinton, and Sejnowski. Since then, variants of the original using techniques like contrastive divergence In the context of Boltzmann
Boltzmann machine that incorporate different types of variables have largely machines, learning is said to be local, meaning that the update rule for a
surpassed the original version in popularity. In this section, the focus is on weight connecting two units depends only on the statistics of those two units
explaining the binary Boltzmann machine, as well as discussing the issues under two different distributions:1.The distribution of the model
that arise during training and inference.Basic Definition of the Boltzmann Pmodel(v)P_ {\text{model} }(v)Pmodel(v), 2.The distribution of the data
Machine:The Boltzmann machine is defined over a d-dimensional binary P^data(v)\hat{P} _{\text{data}}(v)P^data(v). The rest of the network plays
random vector x∈{0,1}dx \in \{0, 1\}^dx∈{0,1}d. It is an energy-based a role in shaping the statistics, but the weight update does not require
model, meaning that the joint probability distribution over the variables is knowledge about the rest of the network or how those statistics were
defined in terms of an energy function: produced. This local learning rule is interesting from a biological
perspective, as it resembles Hebbian learning—the idea that "neurons that
fire together wire together" (Hebb, 1949). In this case, if two units
E(x) is the energy function,Z is the partition frequently activate together, their connection strength is increased, reflecting
function, which normalizes the distribution such that the sum of a biological learning mechanism.This local learning rule contrasts with
probabilities over all possible states equals 1: ∑xP(x)=1\sum_x P(x) = 1∑x other learning algorithms (like backpropagation) that require more complex
P(x)=1. machinery, such as maintaining secondary communication networks to
The energy function for the Boltzmann machine is typically defined as: transmit gradient information.
Negative Phase and Sampling:The negative phase of Boltzmann machine
learning is more complex and harder to explain from a biological
perspective. It involves sampling from the model’s distribution to compute
Where:U is the weight matrix containing the model parameters,
gradients. In contrast to the positive phase, which strengthens connections
b is the bias vector associated with each binary unit.
between frequently co-activated units, the negative phase aims to update the
Energy Function and Probability Distribution:
weights by sampling from the model’s distribution and then using this
The energy function E(x) is used to define the probability distribution over
information to adjust the parameters in a way that reduces the discrepancy
the binary vector xxx. In simple terms, the Boltzmann machine uses this
between the model and the data distribution.While the negative phase is
energy function to represent the relationships between the units in the
computationally challenging, one possible explanation for it from a
model. The goal of training the machine is to adjust the parameters such that
biological standpoint is that it might be akin to dream sleep or other forms
the model can effectively learn the underlying distribution of data.
of unconscious processing that might serve as a form of negative phase
Boltzmann Machines with Latent Variables:
sampling. However, this idea remains speculative.
While the basic Boltzmann machine uses only observed variables, it
becomes significantly more powerful when some of the variables are latent
RESTRICTED BOLTZMANN MACHINES (RBMS), originally called
or hidden. In this case, the latent variables allow the model to capture
harmonium by Smolensky (1986), are probabilistic graphical models that
higher-order interactions among the visible units. This turns the Boltzmann
consist of two layers: a visible layer (v) representing observed variables and
machine into a universal approximator of probability mass functions over
a hidden layer (h) representing latent or hidden variables. They are used as
discrete variables, meaning it can model complex, non-linear relationships
building blocks for deep generative models and are central to unsupervised
between the observed and latent variables.
learning tasks like dimensionality reduction, feature learning, and collabo
The Boltzmann machine can be formalized by splitting the units into two
rative filtering.An RBM is a bipartite graph, meaning there are two distinct
subsets:v (visible units) h (latent or hidden units)
sets of variables: the visible units and the hidden units. The key feature of
The energy function for a Boltzmann machine with both visible and hidden
this structure is that there are no connections between units within the same
units becomes:E(v,h)=−vTRv−hTWh−hTSb−vTc−hTbE(v, h) = -v^T R v -
layer. This means that the visible units are not connected to each other, and
h^T W h - h^T S b - v^T c - h^T bE(v,h)=−vTRv−hTWh−hTSb−vTc−hTb
the hidden units are not connected to each other. Instead, each visible unit is
Where:R,W,SR, W, SR,W,S are matrices of weights between different units,
connected to every hidden unit, though sparse connections can be used in
b,cb, cb,c are bias vectors for the visible and hidden units.
more advanced variants like convolutional RBMs.RBMs are energy-based
models where the joint probability distribution of the visible and hidden and visible units, which can be either binary or real. There are no intralayer
variables is specified by an energy function. The energy function determines connections, and units in adjacent layers are connected, often in a fully
how likely a configuration of visible and hidden units is. In the case of connected manner.
RBMs, the energy function is defined as:
E(v,h)=−∑ivici−∑jhjbj−∑i,jviWijhjE(v, h) = -\sum_i v_i c_i - \sum_j h_j
b_j - \sum_{i,j} v_i W_{ij} h_jE(v,h)=−i∑vici−j∑hjbj−i,j∑viWijhj
Here, viv_ivi and hjh_jhj represent the visible and hidden units,
respectively, cic_ici and bjb_jbj are bias terms, and WijW_{ij}Wij are the
weights connecting visible and hidden units.
In DBNs, the connections between the top two layers are undirected, while
connections between all other layers are directed, with arrows pointing
downward toward the data. The layers consist of weight matrices and bias
The
vectors, and the DBN defines a probabilistic distribution over the hidden
RBM defines a joint probability distribution over the
and visible units.
visible and hidden units:where Z is the partition function, a normalizing
constant defined as:The partition function is computationally intractable to
compute directly due to the large number of possible states. This makes
exact computation of the probability distribution difficult. The intractability Training a DBN is done by initially training a Restricted Boltzmann
of Z is a well-known challenge in training RBMs, as evaluating the joint Machine (RBM) on the data, followed by training subsequent RBMs layer
probability distribution requires summing over all possible states of the by layer, where each RBM models the distribution of the hidden units from
visible and hidden units. the previous layer. This greedy, layer-wise training method can be repeated
RBMs can be extended to handle other types of units, such as continuous or to add more layers to the DBN. Once trained, the DBN can be used for tasks
real-valued units, and their latent variables can be stacked to form deeper such as generative modeling or to improve classification tasks.
models. When stacked, RBMs form models like Deep Belief Networks Although DBNs have mostly fallen out of favor today, they are recognized
(DBNs) or Deep Boltzmann Machines (DBMs), which have multiple hidden for their crucial role in the deep learning revolution. They helped establish
layers and are used for more complex generative tasks. the viability of deep architectures by demonstrating that such models could
In summary, RBMs are fundamental components in deep generative models, successfully train and outperform previous methods like kernelized support
where they help capture dependencies between observable and latent vector machines on datasets such as MNIST.
variables. They are probabilistic models defined by an energy function and
involve computing a
partition function, which
is typically intractable.
Despite this, RBMs can still be trained using methods like contrastive
divergence, which approximates the partition function and makes training Despite their historical significance, DBNs have practical limitations such
feasible in practice. as intractable inference and challenges with evaluating or maximizing log-
likelihoods due to the complexity of the underlying model. These issues
DEEP BELIEF NETWORKS (DBNS) are a type of deep learning model make DBNs less commonly used in contemporary deep learning
introduced in 2006 by Geoffrey Hinton and others. They represent one of applications compared to other models. However, their introduction paved
the first successful attempts at training deep architectures, which were the way for the more widespread adoption of deep neural networks.
previously considered difficult to optimize. DBNs are generative models
composed of multiple layers of latent variables, which are typically binary, .