0% found this document useful (0 votes)
8 views36 pages

5th Unit

Uploaded by

tejaswini reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views36 pages

5th Unit

Uploaded by

tejaswini reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Applications

Lecture slides for Chapter 12 of Deep Learning


www.deeplearningbook.org
Ian Goodfellow
2018-10-25
Disclaimer
• Details of applications change much faster than the
underlying conceptual ideas

• A printed book is updated on the scale of years, state-


of-the-art results come out constantly

• These slides are somewhat more up to date

• Applications involve much more specific knowledge, the


limitations of my own knowledge will be much more
apparent in these slides than others

(Goodfellow 2018)
Large Scale Deep Learning
Number of neurons (logarithmic scale)

1011 Human
1010
17 20
109 16 19 Octopus
108 14 18
107 11 Frog
106 8
105 3 Bee
Ant
104
103 Leech
13
102
101 1 2 12 15 Roundworm
6 9
100 5 10
10 1 4 7
10 2 Sponge
1950 1985 2000 2015 2056
Year
ure 1.11: Since the introduction of hidden units, artificial neural networks have doub
ize roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015
Figure 1.11 (Goodfellow 2018)
Fast Implementations
• CPU

• Exploit fixed point arithmetic in CPU families where this offers a speedup

• Cache-friendly implementations

• GPU

• High memory bandwidth

• No cache

• Warps must be synchronized

• TPU

• Similar to GPU in many respects but faster

• Often requires larger batch size

• Sometimes requires reduced precision

(Goodfellow 2018)
Distributed Implementations
• Distributed

• Multi-GPU

• Multi-machine

• Model parallelism

• Data parallelism

• Trivial at test time

• Synchronous or asynchronous SGD at train time

(Goodfellow 2018)
Synchronous SGD

TensorFlow tutorial (Goodfellow 2018)


Example: ImageNet in 18
minutes for $40

Blog post (Goodfellow 2018)


Model Compression
• Large models often have lower test error

• Very large model trained with dropout

• Ensemble of many models

• Want small model for low resource use at test time

• Train a small model to mimic the large one

• Obtains better test error than directly training a small


model
(Goodfellow 2018)
Quantization

Important for
mobile deployment

(TensorFlow Lite)
(Goodfellow 2018)
Dynamic Structure: Cascades

(Viola and Jones, 2001)

(Goodfellow 2018)
Dynamic Structure

Outrageously Large Neural Networks

(Goodfellow 2018)
Dataset Augmentation for
Computer Vision
Affine Elastic
Noise
Distortion Deformation

Horizontal Random
Hue Shift
flip Translation

(Goodfellow 2018)
Generative Modeling:
Sample Generation

Training Data Sample Generator


(CelebA) (Karras et al, 2017)

Covered in Part III Progressed rapidly


after the book was
Underlies many written
graphics and
speech applications
(Goodfellow 2018)
Graphics

(Table by Augustus Odena) (Goodfellow 2018)


Video Generation

(Wang et al, 2018)


(Goodfellow 2018)
Everybody Dance Now!

(Chan et al 2018)
(Goodfellow 2018)
Model-Based Optimization

(Killoran et al, 2017)


g optimization with a learned predictor model. a) Original experimental
and measured binding scores (horizontal axis); we fit a model to this data
an oracle for scoring generated sequences. Plot shows scores on held-out
(Goodfellow 2018)
elation 0.97). b) Data is restricted to sequences with oracle scores in the
Designing Physical Objects

(Hwang et al 2018)

(Goodfellow 2018)
translations (Cho et al., 2014a) and for generating translated sentences (Sutskever
et al., 2014). Jean et al. (2014) scaled these models to larger vocabularies.

12.4.5.1
Attention Mechanisms
Using an Attention Mechanism and Aligning Pieces of Data

↵(t 1)
↵(t) ↵(t+1)

⇥ ⇥ ⇥

h(t 1)
h(t) h(t+1)

Figure 12.6: A modern attention mechanism, as introduced by Bahdanau et al. (2015), is


Figure 12.6
essentially a weighted average. A context vector c is formed by taking a weighted average
Important in hmany
of feature vectors (t)
with vision, speech,
weights ↵(t) . In someand NLP applications
applications, the feature vectors h are
hidden units of a neural network, but they may also be raw input to the model. The
Improved rapidly
weights ↵(t) are produced after
by thethe book
model itself.was written
They are usually values in the interval
(Goodfellow 2018)
(t)
Attention for Images

Attention mechanism from


Wang et al 2018
Image model from Zhang et al 2018
(Goodfellow 2018)
Generating Training Data

(Bousmalis et al, 2017)

(Goodfellow 2018)
Generating Training Data

(Bousmalis et al, 2017)

(Goodfellow 2018)
natural
imply bylanguage. Depending
looking up two storedonprobabilities.
how the model For is designed,
this a token
to exactly may
reproduce
word, a character,
nference or even
in Pn , we must omita byte. Tokens
the final are always
character discrete
from each entities.
sequence whenThewe
st
rain
kens Natural Language Processing
successful
Pn 1 . language models were based on models of fixed-length sequences
Ascalled n-grams.
an example, An n-gram how
we demonstrate is a sequence of n tokens.
a trigram model computes the probability
of the sentence
Models based on“THE define
DOG RAN
n-grams The first words
the conditional
AWAY.” of the sentence
probability cannot
of the n-th be
token
handled
the by the default
preceding n 1formula
tokens. based
The on conditional
model uses probability
products of because
these there is no
conditional
• An important predecessor to deep NLP is the family
ontext attothe
butions beginning
define of the sentence.
the probability Instead, over
distribution we must usesequences:
longer the marginal prob-
ability of
over models
words atbased onofn-grams:
the start the sentence. We thus evaluate P3 (THE DOG RAN).

Finally, the last word may be predicted Y using the typical case, of using the condi-
P (xdistribution
ional 1 , . . . , x⌧ ) =PP (x1 , .|.DOG
(AWAY . , xnRAN). P (xt |this
1 ) Putting xt n+1 t 1 ).equation(12.5)
, . . . , xwith
together 12.6,
we obtain: t=n
461
P (THE DOG RAN AWAY) = P3 (THE DOG RAN)P3 (DOG RAN AWAY)/P2 (DOG RAN).
(12.7)
Improve
A fundamental with:of maximum likelihood for n-gram models is that Pn
limitation
as estimated from training set counts is very likely to be zero in many cases, even
-Smoothing
hough the tuple (xt n+1 , . . . , xt ) may appear in the test set. This can cause two
-Backoff
different kinds of catastrophic outcomes. When Pn 1 is zero, the ratio is undefined,
o the model -Word
does notcategories
even produce a sensible output. When Pn 1 is non-zero but
Pn is zero, the test log-likelihood is 1. To avoid such catastrophic outcomes,
(Goodfellow 2018)
Word Embeddings in Neural
CHAPTER 12. APPLICATIONS

Language
multiple latent variables (Mnih and Hinton, Models
2007).

6 22
France
7 China
Russian 21
8 French 2009
2008
English
9 20 2004
2003 2007
2001
10 2006
Germany Iraq
Ontario 19 2000
2005 1999
11 Europe
EU
Union
Africa
African
Assembly Japan 1995 2002
12
European 18 19981996
1997
BritishNorth
13 Canada
Canadian
14 South 17
34 32 30 28 26 35.0 35.5 36.0 36.5 37.0 37.5 38.0

Figure 12.3: Two-dimensional visualizations of word embeddings obtained from a neural


machine translation model (Bahdanau et al., 2015),
Figure 12.3 zooming in on specific areas where
semantically related words have embedding vectors that are close to each other. Countries
(Goodfellow 2018)
High-Dimensional Output
Layers for Large Vocabularies

• Short list

• Hierarchical softmax

• Importance sampling

• Noise contrastive estimation

(Goodfellow 2018)
A Hierarchy of Words and
Word Categories
CHAPTER 12. APPLICATIONS

(0) (1)

(0,0) (0,1) (1,0) (1,1)

w0 w1 w2 w3 w4 w5 w6 w7

(0,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,0) (1,0,1) (1,1,0) (1,1,1)

Figure 12.4
Figure 12.4: Illustration of a simple hierarchy of word categories, with 8 words w0 , . . . , w7 (Goodfellow 2018)
Neural Machine Translation
HAPTER 12. APPLICATIONS

Output object (English


sentence)

Decoder

Intermediate, semantic representation

Encoder

Source object (French sentence or image)

Figure 12.5
igure 12.5: The encoder-decoder architecture to map back and forth between a surfac
epresentation (such as a sequence of words or an image) and a semantic representatio
By using the output of an encoder of data from one modality (such as the encoder(Goodfellow
mappin 2018)
Google Neural Machine Translation

Wu et al 2016
(Goodfellow 2018)
Speech Recognition
Current speech recognition
is based on seq2seq with
attention

Graphic from
“Listen, Attend, and Spell”
Chan et al 2015

(Goodfellow 2018)
Speech Synthesis

WaveNet
(van den Oord et al, 2016)

(Goodfellow 2018)
Deep RL for Atari game playing

(Mnih et al 2013)

Convolutional network estimates the value function (future


rewards) used to guide the game-playing agent.

(Note: deep RL didn’t really exist when we started the book,


became a success while we were writing it, extremely hot topic by the time the book was printed)
(Goodfellow 2018)
Superhuman Go Performance
Monte Carlo tree search, with convolutional networks for value
function and policy

(Silver et al, 2016)


(Goodfellow 2018)
Robotics

(Google Brain) (Goodfellow 2018)


Healthcare and Biosciences

(Google Brain) (Goodfellow 2018)


Autonomous Vehicles

(WayMo) (Goodfellow 2018)


Questions

(Goodfellow 2018)

You might also like