0% found this document useful (0 votes)
46 views19 pages

Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions

The document discusses modeling the joint density of images and captions using neural networks. It proposes: 1. Training separate multilayer models for images and word vectors 2. Adding a top layer connecting the two models for further joint training, allowing each modality to improve the other's earlier layers. It also discusses using a deep Boltzmann machine instead of a deep belief net for its symmetric connections, and justifies its pre-training method. Finally, it covers using Bayesian optimization to automatically determine neural network hyperparameters by modeling previous results, predicting new outcomes, and iteratively testing the most promising configurations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views19 pages

Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions

The document discusses modeling the joint density of images and captions using neural networks. It proposes: 1. Training separate multilayer models for images and word vectors 2. Adding a top layer connecting the two models for further joint training, allowing each modality to improve the other's earlier layers. It also discusses using a deep Boltzmann machine instead of a deep belief net for its symmetric connections, and justifies its pre-training method. Finally, it covers using Bayesian optimization to automatically determine neural network hyperparameters by modeling previous results, predicting new outcomes, and iteratively testing the most promising configurations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Neural Networks for Machine Learning

Lecture 16a
Learning a joint model of images and captions

Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Modeling the joint density of images and captions
(Srivastava and Salakhutdinov, NIPS 2012)
•  Goal: To build a joint density 1. Train a multilayer model of images.
model of captions and 2. Train a separate multilayer model of
standard computer vision word-count vectors.
feature vectors extracted
3. Then add a new top layer that is
from real photographs. connected to the top layers of both
–  This needs a lot more individual models.
computation than
–  Use further joint training of the
building a joint density whole system to allow each
model of labels and digit modality to improve the earlier
images! layers of the other modality.
Modeling the joint density of images and captions
(Srivastava and Salakhutdinov, NIPS 2012)

•  Instead of using a deep belief net, use a deep Boltzmann machine that
has symmetric connections between all pairs of layers.
–  Further joint training of the whole DBM allows each modality to
improve the earlier layers of the other modality.
–  That’s why they used a DBM.
–  They could also have used a DBN and done generative fine-tuning
with contrastive wake-sleep.
•  But how did they pre-train the hidden layers of a deep Boltzmann
Machine?
–  Standard pre-training leads to composite model that is a DBN not
a DBM.
Combining three RBMs to make a DBM
h3
•  The top and bottom
W3 2W3
RBMs must be pre- h3
trained with the weights h2
W3
in one direction twice
as big as in the other h2 h2
direction. 2W2 2W2 W2
–  This can be
justified!
h1 h1
•  The middle layers do W1
geometric model h1
averaging. 2W1 W1
v
v
Neural Networks for Machine Learning

Lecture 16b
Hierarchical coordinate frames

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Why convolutional neural networks are doomed

•  Pooling loses the precise •  Convolutional nets that just


spatial relationships between use translations cannot
higher-level parts such as a extrapolate their understanding
nose and a mouth. of geometric relationships to
–  The precise spatial radically new viewpoints.
relationships are needed –  People are very good at
for identity recognition. extrapolating. After seeing
–  Overlapping the pools a new shape once they can
helps a bit. recognize it from a different
viewpoint.
The hierarchical coordinate frame approach
•  Use a group of neurons to •  Recognize larger features by
represent the conjunction of using the consistency of the
the shape of a feature and its poses of their parts.
pose relative to the retina.
–  The pose relative to the
retina is the relationship
between the coordinate
frame of the retina and the
intrinsic coordinate frame nose and mouth nose and mouth
make consistent make inconsistent
of the feature. predictions for predictions for
pose of face pose of face
Two layers in a hierarchy of parts
•  A higher level visual entity is present if several lower level visual entities
can agree on their predictions for its pose (inverse computer graphics!)

pj Tj
face
pose of mouth Tij Thj
i.e. relationship TiTij ≈ ThThj
to camera

pi Ti ph Th
mouth nose
A crucial property of the pose vectors

•  They allow spatial •  The invariant geometric


transformations to be modeled properties of a shape are in the
by linear operations. weights, not in the activities.
–  This makes it easy to learn –  The activities are
a hierarchy of visual equivariant: As the pose of
entities. the object varies, the
–  It makes it easy to activities all vary.
generalize across –  The percept of an object
viewpoints. changes as the viewpoint
changes.
Evidence that our visual systems impose coordinate frames in
order to represent shapes (after Irvin Rock)

The square and the diamond are


What country is very different percepts that make
this? Hint: Sarah different properties obvious.
Palin
Neural Networks for Machine Learning

Lecture 16c
Bayesian optimization of neural network
hyperparameters

Geoffrey Hinton
Nitish Srivastava,
Kevin Swersky
Tijmen Tieleman
Abdel-rahman Mohamed
Let machine learning figure out the hyper-parameters!
(Snoek, Larochelle & Adams, NIPS 2012)
•  One of the commonest reasons •  Naive grid search: Make a list of
for not using neural networks is alternative values for each hyper-
that it requires a lot of skill to set parameter and then try all possible
hyper-parameters. combinations.
–  Number of layers –  Can we do better than this?
–  Number of units per layer •  Sampling random combinations:
–  Type of unit This is much better if some hyper-
–  Weight penalty parameters have no effect.
–  Learning rate –  Its a big waste to exactly repeat
the settings of the other hyper-
–  Momentum etc. etc.
parameters.
Machine learning to the rescue

•  Instead of using random •  We assume that the amount of


combinations of values for the computation involved in
hyper-parameters, why not look evaluating one setting of the
at the results so far? hyper-parameters is huge.
–  Predict regions of the hyper- –  Much more than the work
parameter space that might involved in building a
give better results. model that predicts the
–  We need to predict how well result from knowing
a new combination will do previous results with
and also model the different settings of the
uncertainty of that prediction. hyper-parameters.
Gaussian Process models
•  These models assume that •  GP models do more than just
similar inputs give similar outputs. predicting a single value.
–  This is a very weak but very –  They predict a Gaussian
sensible prior for the effects of distribution of values.
hyper-parameters. •  For test cases that are close to
•  For each input dimension, they several, consistent training
learn the appropriate scale for cases the predictions are fairly
measuring similarity. sharp.
–  Is 200 similar to 300? •  For test cases far from any
–  Look to see if they give similar training cases, the predictions
results in the data so far. have high variance.
A sensible way to decide what to try
•  Keep track of the best setting so A B C
far.
•  After each experiment this might
stay the same or it might improve current
if the latest result is the best. best value
•  Pick a setting of the hyper-
parameters such that the
expected improvement in our
best setting is big.
–  don’t worry about the
worst bet best bet
downside (hedge funds!)
How well does Bayesian optimization work?

•  If you have the resources to run a lot of experiments, Bayesian


optimization is much better than a person at finding good
combinations of hyper-parameters.
–  This is not the kind of task we are good at.
–  We cannot keep in mind the results of 50 different
experiments and see what they predict.
•  It’s much less prone to doing a good job for the method we like
and a bad job for the method we are comparing with.
–  People cannot help doing this. They try much harder for their
own method because they know it ought to work better!
Neural Networks for Machine Learning

Lecture 16d
The fog of progress

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Why we cannot predict the long-term future

•  Consider driving at night. The number of photons you receive from


2
the tail-lights of the car in front falls off as 1 / d
•  Now suppose there is fog.
2
–  For small distances its still 1 / d
–  But for big distances its exp(-d) because fog absorbs a certain
fraction of the photons per unit distance.
•  So the car in front becomes completely invisible at a distance at
2
which our short-range 1 / d model predicts it will be very visible.
–  This kills people.
The effect of exponential progress

•  Over the short term, things •  So the long term future of


change slowly and its easy to machine learning and neural
predict progress. nets is a total mystery.
–  We can all make quite –  But over the next five
good guesses about what years, its highly probable
will be in the iPhone 6. that big, deep neural
•  But in the longer run our networks will do amazing
perception of the future hits a things.
wall, just like fog.

You might also like