0% found this document useful (0 votes)
5 views10 pages

Paper 9

The document provides an overview of deep generative models, highlighting their significance in artificial intelligence and detailing three key models: Deep Belief Networks (DBNs), deep autoencoders, and deep Boltzmann machines (DBMs). It discusses the historical context, architecture, and applications of these models, emphasizing their advantages in processing complex data. The paper also addresses the challenges of training deep models and the evolution of learning algorithms in this field.

Uploaded by

Shahnawaz Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Paper 9

The document provides an overview of deep generative models, highlighting their significance in artificial intelligence and detailing three key models: Deep Belief Networks (DBNs), deep autoencoders, and deep Boltzmann machines (DBMs). It discusses the historical context, architecture, and applications of these models, emphasizing their advantages in processing complex data. The paper also addresses the challenges of training deep models and the evolution of learning algorithms in this field.

Uploaded by

Shahnawaz Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IETE Technical Review

ISSN: 0256-4602 (Print) 0974-5971 (Online) Journal homepage: www.tandfonline.com/journals/titr20

An Overview of Deep Generative Models

Jungang Xu, Hui Li & Shilong Zhou

To cite this article: Jungang Xu, Hui Li & Shilong Zhou (2015) An Overview of Deep Generative
Models, IETE Technical Review, 32:2, 131-139, DOI: 10.1080/02564602.2014.987328
To link to this article: https://doi.org/10.1080/02564602.2014.987328

Published online: 20 Dec 2014.

Submit your article to this journal

Article views: 1968

View related articles

View Crossmark data

Citing articles: 14 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=titr20
An Overview of Deep Generative Models
Jungang Xu, Hui Li and Shilong Zhou
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

ABSTRACT
As an important category of deep models, deep generative model has attracted more and more attention with
the proposal of Deep Belief Networks (DBNs) and the fast greedy training algorithm based on restricted Boltz-
mann machines (RBMs). In the past few years, many different deep generative models are proposed and
used in the area of Artificial Intelligence. In this paper, three important deep generative models including
DBNs, deep autoencoder, and deep Boltzmann machine are reviewed. In addition, some successful applica-
tions of deep generative models in image processing, speech recognition and information retrieval are also
introduced and analysed.
Keywords:
Deep autoencoder, Deep belief networks, Deep boltzmann machine, Deep generative model, Restricted
boltzmann machine

1. INTRODUCTION Generally, a deep model is composed of multiple non-


linear modules, so that its loss function is almost non-
Human brain can easily obtain important information convex. The deep models are difficult to be optimized
from large amounts of perception data. Neural scien- because they may generate multiple local optimums
tists found that cerebral cortex of human brain can during the learning process. The local optimums
obtain the representations of the data by going through are often got using gradient-based optimization algo-
a multi-layer network model, instead of extracting fea- rithms, which causes that the training of deep models
tures directly [13]. In other words, human brain can encountered a bottleneck for a long time.
recognize the outside world by the extracted and
decomposed information, but not by the direct projec- On the contrary, the loss function of shallow structure
tion on the retina. The images on the retina are first such as traditional neural network with only one hid-
transformed to the information which can be repre- den layer and support vector machine is usually con-
sented directly by extracting and computing layer by vex, so that the parameter optimization becomes more
layer. And each layer of human perception system can effective. However, theoretical results show that using
reduce the data volume of the previous layer and shallow structure alone is not satisfying enough to
reserve the valuable structural information at the same obtain high-level features from large amount of per-
time. Artificial intelligence (AI) aims to add intelli- ception data [10,11]. In addition, both BP algorithm
gence into machine manually, so that the machine can and shallow model need a lot of labeled data. The
represent and understand the perception data like inspiration is that transforming the learning algorithm
human brain does. Therefore, how to extract high-level to solving convex optimization problem is a better way
representations from large amounts of data is a critical to train deep models rather than to solve non-convex
challenge to solve many problems of AI, such as image problem directly.
processing, speech recognition, and natural language
processing and so on. As strongly suggested by theo- In 2006, Hinton et al. proposed a deep generative
retical and biological views, constructing such intelli- model called deep belief networks (DBNs) and a fast
gent systems requires multi-layer nonlinear models in unsupervised learning algorithm for DBN [12]. The
deep structure. idea of stacking shallow structures greedily during the
learning procedure inspires the whole machine learn-
A common example of deep structure is the traditional ing field. First, the proposed learning algorithm can
multi-layer neural network. Back propagation (BP) is find a great set of parameters for multi-layer nonlinear
the first learning algorithm of this kind of deep net- models fast. Second, a small amount of labeled data is
works [4]. However, BP algorithm needs a lot of necessary during the whole training period. Finally,
labeled data and barely works are well in practice the values of hidden variables in the deepest layer are
when there are more than three layers [59]. easier to calculate. The proposal of DBN and the

IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015 131


Xu J, et al.: An Overview of Deep Generative Models

corresponding learning algorithm created a new situa- 2.1 Deep Generative Model
tion in the field of AI and steps closer to the final objec-
tive of AI. Deep generative model is usually represented as a
graphical model [15]. Sigmoid belief network is a kind
In this paper, we reviewed the deep generative mod- of deep generative model which is proposed and stud-
els, including its history, architecture, and applica- ied before 2006 and trained using variational approxi-
tions. The rest of the paper is organized as follows. mations [1619]. However, to calculate the multi-layer
Section 2 introduced the historical context of deep gen- joint distribution using this model is barely acceptable
erative models. Section 3 described the architecture of [5]. At the beginning of the twentieth century, Hinton
three typical deep generative models, including DBN, et al. proposed a kind of deep generative model called
deep autoencoder, and deep Boltzmann machine deep belief networks based on sigmoid belief networks
(DBM). Section 4 introduced and analysed some typi- [12]. Different from sigmoid belief networks, the top
cal applications of Artificial Intelligence. Section 5 pre- two layers in DBN form a restricted Boltzmann
sented the discussions and perspectives. machine (RBM) [2023], which inspires a fast training
method. The fast unsupervised learning algorithm of
DBN trains one layer at a time greedily and obtains a
2. THE HISTORICAL CONTEXT OF DEEP multi-layer probabilistic model finally. More deep gen-
erative models similar to DBN are proposed and
MODELS
trained by stacking shallow structures greedily, such
According to the theory of probability and statistics, as deep neural networks (DNNs), deep autoencoder,
there are two types of probabilistic model called gener- DBM, recurrent neural network (RNN), and so on
ative model and discriminative model. The generative [2426]. In the recent years, deep generative models
model models the joint distribution and the discrimi- draw more and more attention, and are used to solve
native model focuses on the conditional distribution. If AI problems since the corresponding training algo-
we need to predict the class y of the given observation rithms are fairly fast and need few labeled data.
x, for example, we can use the generative model to cal-
culate pðx; yÞ or use the discriminative model to calcu-
late pðy j xÞ [13,14]. Deep models are based on the 2.2 Deep Discriminative Model
probabilistic model, so the deep models can be catego-
As described in Section 1, the training algorithms like
rized into three classes: deep generative models, deep
BP have some limitations including being stuck in the
discriminative models, and hybrid deep architectures.
local optimum easily and time-consuming. Inspired by
The history of deep models is shown in Figure 1. Gen-
the structure of the visual system, LeCun et al.
erally, the theory of deep discriminative models is eas-
designed a deep discriminative model called convolu-
ier than that of deep generative models which are
tional neural networks (CNNs) and a training method
usually described as graphs. However, according to
of the model [27,28]. Although training deep models
the basic theory described above, deep discriminative
directly in a supervised way is very difficult, CNN is
models are usually trained in a supervised way which
an exception and it is capable of finding the optimum
is very difficult, and deep generative models can be
from the nonlinear space. However, both the tradi-
trained in unsupervised ways which have more poten-
tional training methods like BP and the training algo-
tial. Hybrid deep architectures combine the generative
rithms of CNN need large amounts of labeled data,
models and discriminative models and have some
which restricts the applications in the fields which are
potential real applications.
lack of labeled data like information retrieval.

2.3 Hybrid Deep Architecture


After DBN was proposed, the theory of combining
existing discriminative models with DBN was pro-
posed. There are two methods to train a hybrid deep
architecture. The first method is that the goal is dis-
crimination which is assisted with the outcomes of
generative architectures via better optimization or/
and regularization. Another mode of the training is to
learn the parameters in any of the deep generative
models in Section 2.1 by using the discriminative
Figure 1: The history of deep models. criteria [29].

132 IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015


Xu J, et al.: An Overview of Deep Generative Models

3. THE ARCHITECTURE OF DEEP


GENERATIVE MODELS
Deep generative model is an important category of
deep models, which have many advantages and have
been used in many applications of AI. In this section,
we introduced the structures and learning algorithms
of three typical deep generative models including
deep belief network, deep autoencoder, and DBM.
Most of other deep generative models are based on Figure 3: The pre-training of deep belief networks with
these three models. three hidden layers.

Generally, DBN is proposed earlier than the other two


models, which is an improved model of sigmoid belief RBMs layer by layer greedily during the pre-training
network with a fast training algorithm. Deep autoen- phase to find a great parameter space. Then, super-
coder is a specific category of autoencoders which uses vised learning algorithm is used to search the opti-
the similar training algorithm with DBN. Deep autoen- mum from the space, which is called fine-tuning.
coder is popular because of the convenience of fine- During the pre-training phase of DBN, each layer is
tuning. DBM is an undirected graphical model and the trained as an RBM. To approximate the post-probabil-
training algorithm is more complicated than the other ity of each layer, we perform the algorithm as follows:
two models. (1) sampling h1 » Qðh1 j xÞ from the first RBM where
Qðh1 j xÞ is approximating distribution of h1 , (2) calcu-
3.1 Deep Belief Network lating h2 with the sample h1 as the input of the second
RBM, and (3) repeating these two steps until the top
DBN is similar to sigmoid belief network, but the top
layer, see Figure 3.
two layers construct an RBM, which is different from
sigmoid belief network. The DBN has four layers
As a building block of training DBN, RBM plays a very
including a visible layer x and three hidden layers h1 ,
important role in deep learning. RBM is a restricted
h2 , and h3 ; see Figure 2. Different from sigmoid belief
type of Boltzmann machine (BM) which has been
network with factorized prior probability Pðh3 Þ on the
introduced as bidirectionally connected networks of
top layer, the top two layers of DBN form the distribu-
stochastic processing units [22]. A BM can be used to
tion of RBM which is an undirected graphical model
learn important aspects of an unknown probability
with probability Pðh2 ; h3 Þ. Therefore, the joint distribu-
distribution based on samples from this distribution.
tion of DBN is defined as Eq. (1).
However, there are practical limitations in using BM
! due to difficult and time-consuming learning process.
lY
¡ 2
l
Pðx; h ; . . . h Þ ¼ Pðh
1 l¡1 l
;h Þ k
Pðh jh kþ1
Þ Pðxjh1 Þ (1) RBM is proposed to alleviate this problem by imposing
k¼1
restrictions on the network topology [30].

Although RBMs are famous for their powerful expres-


There are two phases in training DBN, including pre-
sion and tractable inference [31], training an RBM can
training and fine-tuning. DBN is trained by stacking
be difficult in practice. The difficulties derive from the
intractability of the log-likelihood gradient which is
composed of a positive phase term and a negative
phase term. Calculating the exact value of the negative
phase term requires unbiased sampling from the
model distribution for a long time to ensure conver-
gence to stationarity, which is of exponential complex-
ity. Therefore, additional approximations are usually
introduced into the learning methods to yield more
efficient algorithms. Gibbs sampling based approxima-
tions of the negative phase term in the log-likelihood
gradient often lead to divergence of the training proce-
dure and result in spurious probability modes far from
the training data [32]. Therefore, RBM learning algo-
Figure 2: The graphical models of sigmoid belief networks rithms based on Gibbs sampling, such as contrastive
and deep belief networks. divergence (CD) [23,33,34], persistent contrastive

IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015 133


Xu J, et al.: An Overview of Deep Generative Models

divergence (PCD) [35], or fast persistent contrastive


divergence (FPCD) [36] show very poor mixing
[37,38]. Parallel tempering based approximations can
suppress the diverging problem, but the bias of the
approximations still exists [30,39,40]. Tempered transi-
tion [41], another extended ensemble Monte Carlo
method besides parallel tempering [42], can be used to
improve mixing rate and help training RBM more
effectively [43].

After pre-training, the obtained parameters are used


to initiate a deep network, and supervised learning
algorithms like BP are used to fine-tune the deep
model.

3.2 Deep Autoencoder

An autoencoder typically has an input layer which


Figure 4: The training of a deep autoencoder.
represents the original data or feature, one or more
hidden layers that represent the transformed feature,
and an output layer which matches the input layer for 3.3 Deep Boltzmann Machine
reconstruction [29]. A deep autoencoder is one type of
autoencoder with more than one hidden layer. There A DBM is similar to DBN, which contains many layers
are two parts in deep autoencoder including an adap- of hidden variables. However, a DBM is an undirected
tive, multilayer encoder network to transform the graphical model, where all connections between layers
high-dimensional data into a low-dimensional code are undirected and no connections between the varia-
and a similar decoder network to recover the data bles within the same layer [48,49].
from the code [25].
DBM is a special case of the general BM. Learning of
An autoencoder is often trained using one of the many BM can be carried out by applying a stochastic approx-
backpropagation variants. Initialized with random imation procedure that uses Markov Chain Monte
weights in the encoder and decoder, the parameters Carlo to approximate a model’s expected sufficient sta-
can be trained together by minimizing the discrepancy tistics, which shows effective results. However, it is
between the original data and its reconstruction. The rather slow for learning DBM if the learning procedure
required gradients are easily obtained by using the of BM is used in maximum likelihood estimation
chain rule to backpropagate error derivatives first method, particularly when the hidden units form
through the decoder network and then through the layers that become increasingly remote from the visi-
encoder network [44]. Though the training is often ble units [50]. Like DBN, a greedy layer-wise training
reasonably effective, there are fundamental problems algorithm is proposed to initialize model parameters
when using back-propagation to train networks with of DBM, where multiple RBMs are stacked layer by
many hidden layers. Once the errors get back-propa- layer greedily to form a deep model. However, the
gated to the first few layers, they become minuscule, input is doubled for the lower-level RBM to compen-
and the training becomes quite ineffective [4547]. sate for the lack of top-down input into h1 , see Figure 5.
However, the training of deep autoencoder works Greedily pre-training the weights of a DBM in this way
well if the initial weights are close to the good solu- serves two purposes. First, it initializes the weights to
tion. Such initial weights can be found by stacking reasonable values. Second, it ensures that there is a
RBMs layer by layer like the DBN pre-training fast way of performing approximate inference by a sin-
algorithm. Then the pre-trained multiple layers of gle upward pass through stacking modified RBMs.
feature detectors are unfolded to produce encoder
and decoder networks that initially use the same 4. TYPICAL APPLICATIONS OF DEEP
weights. The global fine-tuning stage then replaces GENERATIVE MODELS
stochastic activities by deterministic, real-valued
probabilities and uses BP through the whole autoen- As we mentioned above, deep generative models are
coder to fine-tune the weights for optimal reconstruc- widely used in the field of AI. In this section, we intro-
tion; see Figure 4. duce a set of typical and successful applications of

134 IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015


Xu J, et al.: An Overview of Deep Generative Models

retrieval purposes [57]; on a large collection image


retrieval task, deep generative models also produced
strong results.

A probabilistic version based on DBM was developed


to solve multimodal learning applications [58], in
which a probability density is defined on the joint
space of multimodal inputs, and the states of suitably
defined latent variables are used for the representa-
tion. The advantage of this probabilistic formulation is
that the missing modality’s information can be filled
naturally by sampling from its conditional distribu-
tion. For the bi-modal data consisting of image and
Figure 5: Deep Boltzmann machine and its pre-training. text, the multimodal DBM is shown to outperform
deep multimodal autoencoder as well as multimodal
deep generative models in image processing, speech DBN in classification and information retrieval tasks
recognition, and information retrieval. [58,59].

4.1 Selected Applications in Image Processing 4.2 Selected Applications in Speech Recognition

Traditional image recognition technologies include The state-of-the-art hidden Markov model (HMM) sys-
wavelet transformation, Gabor filter, Bayes Network tems with observation probabilities approximated
decision, etc., for example, a novel approach in pursuit with Gaussian mixture models (GMMs) have been
of recognizing facial expression was proposed in refer- used in speech recognition for a long time and the tra-
ence [51], where facial feature is represented by a ditional neural network is barely used because of the
hybrid of Gabor wavelet transform of an image and low performance.
local transitional pattern code. However, the effect and
efficiency of traditional image recognition technologies Until a few years ago, a five-layer DBN was used to
are still not very satisfied. DBN was proposed and replace the Gaussian mixture component of the
tested on simple image recognition task on MNIST GMMHMM, and the monophone state was used as
data-set of handwritten digits, which is a common the modeling unit to model phone data [60]. Although
data-set for machine learning and pattern recognition monophones are generally accepted as a weaker pho-
experiments [5,5254]. DBN showed promising results netic representation than triphones, the DBNHMM
and outperformed most of the existing models. At the approach with monophones was shown to achieve
same time, deep autoencoder was developed and dem- higher phone recognition accuracy than the state-of-
onstrated with success on dimensionality reduction the-art triphone GMMHMM systems [61]. In more
task [27]. The parameters of deep autoencoder are ini- recent work, one popular type of sequence classifica-
tialized by stacking multiple RBMs and training each tion criterion, maximum mutual information, was
RBM greedily, which allows deep autoencoder net- successfully applied to learn DBN weights for the
works to learn low-dimensional codes that work much Texas Instruments and Massachusetts Institute of
better than principal components analysis as a tool to Technology (TIMIT) phone recognition task [6264].
reduce the dimensionality of data.
The DBNHMM was extended from the monophone
A modified DBN is developed where the top-layer phonetic representation to the triphone or context-
model uses a third-order Boltzmann machine [55]. dependent counterpart and from phone recognition to
This type of DBN is applied to the NORB database large vocabulary speech recognition [6571]. The
which is a three-dimensional object recognition task. experiments on the Bing mobile voice search data-set
Then, two strategies to improve the robustness of the collected under the real usage scenario demonstrate
DBN are developed [56]. First, sparse connections in that the triphone DBNHMM significantly outper-
the first layer of the DBN are used as a way to regular- forms the state-of-the-art HMM system [60]. Three fac-
ize the model. Second, a probabilistic de-noising algo- tors additional to the DBN contribute to the success:
rithm is developed. Both techniques are shown to be the use of triphones as the DBN modeling units, the
effective in improving robustness against occlusion use of the best available triphone GMMHMM to gen-
and random noise in a noisy image recognition task. erate the alignment with each state in the triphones,
DBN has also been successfully applied to create com- and the tuning of the transition probabilities. The
pact but meaningful representations of images for experiments also indicated that the decoding time of a

IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015 135


Xu J, et al.: An Overview of Deep Generative Models

five-layer DBNHMM is almost the same as that of the information retrieval are also described respectively.
state-of-the-art triphone GMMHMM [68,69]. Although various models of deep learning and their
applications are proposed, there are a lot of works to
4.3 Selected Applications in Information Retrieval do in the future. First, more promoted deep generative
models are needed with architecture more close to
Semantic hashing is the first method which is used to human brains and simpler training theories. Second,
model documents to high-level features using deep after DistBelief is proposed as a distributed large scale
generative models [72,73]. Based on the word-count fea- deep network by Google company, distributed and
tures, the hidden variables in the final layer of a DBN parallel training algorithms of deep generative models
give a much better representation of each document become a hot research area, and in these algorithms
than the widely used latent semantic analysis and the map/reduce programming model will be used [75].
traditional term frequency-inverse document frequency These large scale deep networks are promising to pro-
(TF-IDF) approach for information retrieval. Documents cess big data. Third, the application of deep generative
are mapped to the space of memory addresses where models in information retrieval are worth to develop
semantically similar text documents are located at further, existing deep models are suitable to process
nearby address to facilitate rapid document retrieval. sensitive data with multiple layers like image data and
speech data, but they are too complex to deal with
While pre-training, a constrained conditional Poisson plain data like text data.
model is used to model word-count vectors, and then
normal RBMs are stacked layer by layer until the top
layer. Then the deep model is unrolled to a deep Funding
autoencoder and fine-tuned by BP algorithm. After the
deep model is trained, the retrieval process starts with This work was supported by the National Natural Science
mapping each query document into a binary code by Foundation of China [grant number 61372171]; the National
performing a forward pass through the model with Key Technology R&D Program of China [grant number
thresholding. Then the Hamming distance between 2012BAH23B03].
the query binary code and all other documents’ binary
codes are computed efficiently.
REFERENCES
Recently, a type of DBM is proposed to extract distrib-
1. T. S. Lee, and D. Mumford, “Hierarchical Bayesian inference in
uted semantic representations from a large unstruc- the visual cortex,” The Journal of the Optical Society of America
tured collection of documents, which overcomes the A, Vol. 20, no. 7, pp. 143448, Jul. 2003.
apparent difficulty of training a DBM with judicious 2. T. Serre, L. Wolf, and S. Bileschi, “Robust object recognition
parameter tying [74]. This enables an efficient pre- with cortex-like mechanisms,” IEEE Trans. on Pattern Analysis
and Machine Intelligence, Vol. 29, no. 3, pp. 41126, Mar.
training algorithm and a state initialization scheme for 2007.
fast inference. The model can be trained just as effi- 3. T. S. Lee, D. Mumford, and R. Romero, “The role of the primary
ciently as a standard RBM. The experiments showed visual cortex in higher level vision,” Vision Research, Vol. 38,
that the model assigns better log probability to unseen no. 15, pp. 242954, Aug. 1998.
data than the Replicated Softmax model. Features 4. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
representations by back-propagating errors,” Nature, Vol. 323,
extracted from the model outperform latent Dirichlet no. 7, pp. 5336, Oct. 1986.
allocation (LDA), Replicated Softmax, and document 5. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy
neural autoregressive distribution estimator (Doc- layer-wise training of deep networks,” in Advances in Neural
NADE) models on document retrieval and document Information Processing Systems, Vol. 19, B. Scho € lkopf, J. C.
Platt and T. Hoffman, Eds. Cambridge, MA: MIT Press, 2006,
classification tasks. pp. 15360.
6. H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Explor-
ing strategies for training deep neural networks,” Journal of
5. DISCUSSIONS AND PERSPECTIVES Machine Learning Research, Vol. 1, pp. 140, Jan. 2009.
Deep learning is recently proposed as a promising 7. P. J. Werbos, Beyond Regression: New Tools for Prediction and
Analysis in the Behavioral Sciences. Boston: Harvard
research field and widely used as an effective tool in University, 1974.
many applications. Deep generative model is a cate- 8. R. Hecht-Nielsen, “Replicator neural networks for universal
gory of deep models and characterized by training fast optimal source coding,” Science, Vol. 269, pp. 18603, Sept.
and labeled data free. We introduced the architectures 1995.
and training methods of three popular deep generative 9. G. Tesauro, “Practical issues in temporal difference learning,”
Machine Learning, Vol. 8, no. 34, pp. 25777, May 1992.
models including DBN, deep autoencoder, and DBM.
10. Y. Bengio, “Learning deep architectures for AI,” Foundations
And some typical applications of deep generative and trendsÒ in Machine Learning, Vol. 2, no. 1, pp. 1127,
models in image processing, speech recognition, and Jan. 2009.

136 IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015


Xu J, et al.: An Overview of Deep Generative Models

11. Y. Bengio, and Y. LeCun, “Scaling learning algorithms towards 30. K. H. Cho, T. Raiko, and A. Ilin, “Parallel tempering is efficient
AI,” Large-Scale Kernel Machines, Vol. 34, pp. 141, Sept. for learning restricted Boltzmann machines,” in Proceedings of
2007. the 2010 International Joint Conference on Neural Networks,
12. G. E. Hinton, S. Osindero, and Y. W. The, “A learning algorithm Thessaloniki, 2010, pp. 18.
for deep belief nets,” Neural Computation, Vol. 18, no. 7, 31. N. Le Roux, and Y. Bengio, “Representational power of
pp. 152754, Jul. 2006. restricted Boltzmann machines and deep belief networks,”
13. B. Taskar, P. Abbeel, and D. Koller, “Discriminative probabilistic Neural Computation, Vol. 20, no. 6, pp. 163149, Jun. 2008.
models for relational data,” in Proceedings of Conference 32. A. Fischer, and C. Igel, “Empirical analysis of the divergence of
on Uncertainty in Artificial Intelligence, Alberta, 2002, Gibbs sampling based learning algorithms for restricted Boltz-
pp. 48592. mann machines,” in Proceedings of the 20th International
14. J. A. Lasserre, C. M. Bishop, and T. P. Minka, “Principled Conference on Artificial Neural Networks, Thessaloniki, 2010,
hybrids of generative and discriminative models,” in Proceed- pp. 20817.
ings of IEEE Computer Society Conference on Computer Vision 33. G. E. Hinton, “Products of experts,” in Proceedings of the 9th
and Pattern Recognition, New York, 2006, pp. 8794. International Conference on Artificial Neural Networks, London,
15. M. I. Jordan, Learning in Graphical Models. Dordrecht: Kluwer, 1999, pp. 16.
1998. 34. G. E. Hinton, “Learning multiple layers of representation,” Trends
16. P. Dayan, G. E. Hinton, R. Neal, and R. Zemel. “The Helmholtz in Cognitive Sciences, Vol. 11, no. 10, pp. 42834, Oct. 2007.
machine,” Neural Computation, Vol. 7, no. 5, pp. 889904, 35. T. Tieleman, “Training restricted Boltzmann machines using
Sept. 1995. approximations to the likelihood gradient,” in Proceedings of
17. G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The “wake- the 25th International Conference on Machine Learning, New
sleep” algorithm for unsupervised neural network,” Science, York, 2008, pp. 106471.
Vol. 268, no. 5214, pp. 1558161, May 1995. 36. T. Tieleman, and G. Hinton, “Using fast weights to improve per-
18. L. K. Saul, T. Jaakkola, and M. I. Jordan, “Mean field theory for sistent contrastive divergence,” in Proceedings of the 26th
sigmoid belief networks,” Journal of Artificial Intelligence Annual International Conference on Machine Learning, New
Research, Vol. 4, no. 1, pp. 6176, Jan. 1996. York, 2009, pp. 103340.
19. I. Titov, and J. Henderson, “Constituent parsing with incremen- 37. Y. Bengio, and O. Delalleau, “Justifying and generalizing
tal sigmoid belief networks,” in Proceedings of Meeting of contrastive divergence,” Neural Computation, Vol. 21, no. 6,
Association for Computational Linguistics, Prague, 2007, pp. 160121, Jun. 2009.
pp. 6329. 38. A. Fischer, and C. Igel, “Empirical analysis of the divergence of
20. P. Smolensky, “Information processing in dynamical systems: Gibbs sampling based learning algorithms for restricted
foundations of harmony theory,” Parallel Distributed Process- Boltzmann machines,” in Proceedings of the 20th International
ing: Explorations in the Microstructure of Cognition, Vol. 1, Conference on Artificial Neural Networks, Thessaloniki, 2010,
pp. 194281, Feb. 1986. pp. 20817.
21. Y. Freund, and D. Haussler, “Unsupervised learning of distribu- 39. D. J. Earl, and M. W. Deem, “Parallel tempering: theory, appli-
tions on binary vectors using two layer networks,” in Advances cations, and new perspectives,” Physical Chemistry Chemical
in Neural Information Processing Systems, Vol. 4, J. E. Moody, Physics, Vol. 7, pp. 39106, Aug. 2005.
S. J. Hanson, and R.P. Lippmann, Eds. Denver, CO: Morgan 40. G. Desjardins, A. Courville, and Y. Bengio, “Parallel tempering
Kaufmann, 1991, pp. 9129. for training of restricted Boltzmann machines,” in Proceedings
22. G. E. Hinton, “Training products of experts by minimizing con- of the 13th International Conference on Artificial Intelligence
trastive divergence,” Neural Computation, Vol. 14, no. 8, and Statistics, New York, 2010, pp. 14552.
pp. 1771800, Aug. 2002. 41. R. M. Neal, “Sampling from multimodal distributions using tem-
23. M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponential family pered transitions,” Statistics and Computing, Vol. 6, no. 4,
harmoniums with an application to information retrieval,” in pp. 35366, Dec. 1996.
Advances in Neural Information Processing Systems, Vol. 17, L. 42. Y. Iba, “Extended ensemble monte carlo,” International Journal
K. Saul, Y. Weiss and L. Bottou, Eds. Cambridge, MA: MIT of Modern Physics, Vol. 12, no. 5, pp. 62356, Jun. 2001.
Press, 2004, pp. 14818. 43. J. Xu, H. Li, and S. Zhou, “Improving Mixing Rate with Tem-
24. R. Salakhutdinov, and G. E. Hinton, “Deep Boltzmann pered Transition for Learning Restricted Boltzmann Machines,”
machines,” in Proceedings of International Conference on Arti- Neurocomputing, Vol. 139, pp. 32835, Sept. 2014.
ficial Intelligence and Statistics, Florida, 2009, pp. 44855. 44. D. C. Plaut, and G. E. Hinton, “Learning sets of filters using
25. G. E. Hinton, and R. Salakhutdinov, “Reducing the dimension- back-propagation,” Computer, Speech and Language, Vol. 2,
ality of data with neural networks,” Science, Vol. 313, no. 5786, no. 1, pp. 3561, Mar. 1987.
pp. 5047, May 2006. 45. D. DeMers, and G. Cottrell, “Non-linear dimension reduction,”
26. R. Collobert, and J. Weston, “A unified architecture for natural in Advances in Neural Information Processing Systems, Vol. 5,
language processing: Deep neural networks with multitask S. J. Hanson, J. D. Cowan and C. L. Giles, Eds. San Mateo,
learning,” in Proceedings of International Conference on CA: Morgan Kaufmann, 1992, pp. 5807.
Machine learning, Helsinki, 2008, pp. 1607. 46. R. Hecht-Nielsen, “Replicator neural networks for universal
27. M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, “Efficient optimal source coding,” Science, Vol. 269, no. 5232,
learning of sparse representations with an energy-based pp. 18603, Sept. 1995.
model,” in Advances in Neural Information Processing 47. N. Kambhatla, and T. K. Leen, “Dimension reduction by local
€ lkopf, J. C. Platt and T. Hoffman, Eds.
Systems, Vol. 19, B. Scho principal component analysis,” Neural Computation, Vol. 9, no.
Cambridge: MIT Press, 2006, pp. 113744. 7, pp. 1493516, Oct. 1997.
28. P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for 48. R. Salakhutdinov, and G. E. Hinton, “Deep Boltzmann
convolutional neural networks applied to visual document machines,” in Proceedings of the 12th International Confer-
analysis,” in Proceedings of the 7th International Conference ence on Artificial Intelligence and Statistics, Clearwater Beach,
on Document Analysis and Recognition, Washington DC, 2009, pp. 44855.
2003, pp. 95863.
49. R. Salakhutdinov, and G. Hinton, “A better way to pretrain deep
29. L. Deng, and D. Yu, “Deep learning for signal and information Boltzmann machines,” Advances in Neural Information Proc-
processing,” Microsoft Research Report, Redmond, 2013. essing Systems, Vol. 25, F. Pereira, C. J. C. Burges, L. Bottou

IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015 137


Xu J, et al.: An Overview of Deep Generative Models

and K. Q. Weinberger, Eds. Cambridge, MA: MIT Press, 2012, Proceedings of the 37th International Conference on Acoustics,
pp. 19. Speech, and Signal Processing, Kyoto, 2012, pp. 427376.
50. R. Salakhutdinov, “Learning deep generative models,” Ph.D. 64. A. Mohamed, D. Yu, and L. Deng, “Investigation of full-
Dissertation, Graduate Department of Computer Science, sequence training of deep belief networks for speech recogni-
Univ. Toronto, Toronto, 2009. tion,” in Proceedings of the 11th Annual Conference of the
51. A. Tanveer, J. Taskeed, and C. Ui-Pil, “Facial expression recog- International Speech Communication Association, Makuhari,
nition using local transitional pattern on gabor filtered facial 2010, pp. 28469.
images”, IETE Technical Review, Vol. 30, no. 1, pp. 4752, 65. D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in
Jan. 2013. deep neural networks for large vocabulary speech recogni-
52. G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algo- tion,” in Proceedings of the 37th International Conference on
rithm for deep belief nets,” Neural Computation, Vol. 18, no. 7, Acoustics, Speech, and Signal Processing, Kyoto, 2012,
pp. 152754, Jul. 2006. pp. 440912.
53. J. Luo, and A. Brodsky, “An EM-based multi-step piecewise 66. D. Yu, S. Wang, Z. Karam, and L. Deng, “Language recognition
surface regression learning algorithm,” in Proceedings of the using deep-structured conditional random fields,” in Proceed-
7th International Conference on Data Mining, Las Vegas, 2011, ings of the 35th International Conference on Acoustics, Speech
pp. 28692. and Signal Processing, 2010, pp. 50303.
54. J. Luo, A. Brodsky, and Y. Li, “An EM-based ensemble learning 67. F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in
algorithm on piecewise surface regression problem,” Interna- context-dependent deep neural networks for conversational
tional Journal of Applied Mathematics and Statistics, Vol. 28, speech transcription,” in Proceedings of the 2011 IEEE Work-
no. 4, pp. 5974, Aug. 2012. shop on Automatic Speech Recognition and Understanding,
Hawaii, 2011, pp. 249.
55. V. Nair, and G.Hinton, “3-d object recognition with deep belief
nets,” in Advances in Neural Information Processing Systems, 68. G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
Vol. 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Wil- DBN-HMMs in large vocabulary continuous speech recogni-
liams and A. Culotta, Eds. Cambridge, MA: MIT Press, 2009, tion,” in Proceedings of the 36th International Conference on
pp. 133947. Acoustics, Speech, and Signal Processing, Prague, 2011,
pp. 468891.
56. Y. Tang, and C. Eliasmith, “Deep networks for robust visual rec-
ognition,” in Proceedings of the 27th International Conference 69. G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent,
on Machine Learning, Haifa, 2010, pp. 105562. pre-trained deep neural networks for large vocabulary speech
recognition,” IEEE Trans. Audio, Speech, & Language Proc.,
57. A. Taralba, R. Fergus, and Y. Weiss, “Small codes and large Vol. 20, no. 1, pp. 3042, Jan. 2012.
image databases for recognition,” in Proceedings of Computer
Vision and Pattern Recognition, Anchorage, 2008, pp. 18. 70. Y. Kubo, T. Hori, and A. Nakamura, “Integrating deep neural
networks into structural classification approach based on
58. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multi- weighted finite-state transducers,” in Proceedings of the 13th
modal deep learning,” in Proceedings of the 28th International Annual Conference of the International Speech Communication
Conference on Machine Learning, Bellevue, 2011, pp. 68996. Association, Portland, 2012.
59. N. Srivastava, and R. Salakhutdinov, “Multimodal learning with 71. L. Deng, J. Li, K. Huang, D. Yao, F. Yu, M. Seide, G. Seltzer, X.
deep Boltzmann machines,”in Advances in Neural Information Zweig, J. He, Y. Williams, and A. Acero. “Recent advances in
Processing Systems, Vol. 25, F. Pereira, C. J. C. Burges, L. deep learning for speech research at Microsoft,” in Proceed-
Bottou and K. Q. Weinberger, Eds. Montreal, Canada: NIPS, ings of International Conference on Acoustics, Speech and Sig-
2012, pp. 222230. nal Processing, Vancouver, 2013, pp. 86048.
60. A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks 72. G. Hinton, and R. Salakhutdinov, “Discovering binary codes for
for phone recognition,” in Proceedings of Neural Information documents by learning deep generative models,” Topics in
Processing Systems 2009 Workshop on Deep Learning for Cognitive Science, Vol. 3, no. 1, pp. 7491, Jan. 2011.
Speech Recognition and Related Applications, Vancouver,
2009. 73. R. Salakhutdinov, and G. Hinton, “Semantic hashing,” Interna-
tional Journal of Approximate Reasoning, Vol. 50, no. 7,
61. G. Sivaram, and H. Hermansky, “Sparse multilayer perceptron pp. 96978, Jul. 2009.
for phoneme recognition,” IEEE Trans. Audio, Speech, &
Language Processing, Vol. 20, no. 1, pp. 239, Jan. 2012. 74. N. Srivastava, R. Salakhutdinov, and G. E. Hinton, “Modeling
documents with deep Boltzmann machines,” in Proceedings
62. A. Mohamed, G. Dahl, and G. Hinton, “Acoustic Modeling of the 29th Conference on Uncertainty in Artificial Intelligence,
Using Deep Belief Networks,” IEEE Trans. Audio, Speech, & Bellevue, 2013, pp. 61624.
Language Processing, Vol. 20, no. 1, pp. 1422, Jan. 2012.
75. W. Fang, W. Pan, and Z. Cui, “View of MapReduce: Program-
63. A. Mohamed, G. Hinton, and G. Penn, “Understanding how ming model, methods, and its applications”, IETE Technical
deep belief networks perform acoustic modelling,” in Review, Vol. 29, no. 5, pp. 3807, Sept. 2012.

138 IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015


Xu J, et al.: An Overview of Deep Generative Models

Authors
Jungang Xu is an associate professor of the Shilong Zhou is an MS student of School of
School of Computer and Control Engineering, Computer and Control Engineering, University
University of Chinese Academy of Sciences. of Chinese Academy of Sciences. He received
He received the PhD degree in computer the BS degree in software engineering from
applied technology from Graduate University Northeast University in 2012. His current
of Chinese Academy of Sciences in 2003. Dur- research interests include deep learning and
ing 20032005, he was a post-doctor of Tsing- information retrieval.
hua University. His current research interests
include deep learning, parallel computing, big Email: [email protected].
data management, etc.
Email: [email protected].

Hui Li is an MS student of School of Computer


and Control Engineering, University of Chi-
nese Academy of Sciences. She received the
BS degree in software engineering from Jilin
University in 2011. Her current research inter-
ests include deep learning theory and its
application.
Email: [email protected].

DOI: 10.1080/02564602.2014.987328; Copyright © 2014 by the IETE

IETE TECHNICAL REVIEW | VOL 32 | NO 2 | MARAPR 2015 139

You might also like