Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions
Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions
Lecture 16a
Learning a joint model of images and captions
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Modeling the joint density of images and captions
(Srivastava and Salakhutdinov, NIPS 2012)
• Goal: To build a joint density 1. Train a multilayer model of images.
model of captions and 2. Train a separate multilayer model of
standard computer vision word-count vectors.
feature vectors extracted
3. Then add a new top layer that is
from real photographs. connected to the top layers of both
– This needs a lot more individual models.
computation than
– Use further joint training of the
building a joint density whole system to allow each
model of labels and digit modality to improve the earlier
images! layers of the other modality.
Modeling the joint density of images and captions
(Srivastava and Salakhutdinov, NIPS 2012)
• Instead of using a deep belief net, use a deep Boltzmann machine that
has symmetric connections between all pairs of layers.
– Further joint training of the whole DBM allows each modality to
improve the earlier layers of the other modality.
– That’s why they used a DBM.
– They could also have used a DBN and done generative fine-tuning
with contrastive wake-sleep.
• But how did they pre-train the hidden layers of a deep Boltzmann
Machine?
– Standard pre-training leads to composite model that is a DBN not
a DBM.
Combining three RBMs to make a DBM
h3
• The top and bottom
W3 2W3
RBMs must be pre- h3
trained with the weights h2
W3
in one direction twice
as big as in the other h2 h2
direction. 2W2 2W2 W2
– This can be
justified!
h1 h1
• The middle layers do W1
geometric model h1
averaging. 2W1 W1
v
v
Neural Networks for Machine Learning
Lecture 16b
Hierarchical coordinate frames
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Why convolutional neural networks are doomed
pj Tj
face
pose of mouth Tij Thj
i.e. relationship TiTij ≈ ThThj
to camera
pi Ti ph Th
mouth nose
A crucial property of the pose vectors
Lecture 16c
Bayesian optimization of neural network
hyperparameters
Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Let machine learning figure out the hyper-parameters!
(Snoek, Larochelle & Adams, NIPS 2012)
• One of the commonest reasons • Naive grid search: Make a list of
for not using neural networks is alternative values for each hyper-
that it requires a lot of skill to set parameter and then try all possible
hyper-parameters. combinations.
– Number of layers – Can we do better than this?
– Number of units per layer • Sampling random combinations:
– Type of unit This is much better if some hyper-
– Weight penalty parameters have no effect.
– Learning rate – Its a big waste to exactly repeat
the settings of the other hyper-
– Momentum etc. etc.
parameters.
Machine learning to the rescue
Lecture 16d
The fog of progress
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Why we cannot predict the long-term future