5th Unit
5th Unit
(Goodfellow 2018)
Large Scale Deep Learning
Number of neurons (logarithmic scale)
1011 Human
1010
17 20
109 16 19 Octopus
108 14 18
107 11 Frog
106 8
105 3 Bee
Ant
104
103 Leech
13
102
101 1 2 12 15 Roundworm
6 9
100 5 10
10 1 4 7
10 2 Sponge
1950 1985 2000 2015 2056
Year
ure 1.11: Since the introduction of hidden units, artificial neural networks have doub
ize roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015
Figure 1.11 (Goodfellow 2018)
Fast Implementations
• CPU
• Exploit fixed point arithmetic in CPU families where this offers a speedup
• Cache-friendly implementations
• GPU
• No cache
• TPU
(Goodfellow 2018)
Distributed Implementations
• Distributed
• Multi-GPU
• Multi-machine
• Model parallelism
• Data parallelism
(Goodfellow 2018)
Synchronous SGD
Important for
mobile deployment
(TensorFlow Lite)
(Goodfellow 2018)
Dynamic Structure: Cascades
(Goodfellow 2018)
Dynamic Structure
(Goodfellow 2018)
Dataset Augmentation for
Computer Vision
Affine Elastic
Noise
Distortion Deformation
Horizontal Random
Hue Shift
flip Translation
(Goodfellow 2018)
Generative Modeling:
Sample Generation
(Chan et al 2018)
(Goodfellow 2018)
Model-Based Optimization
(Hwang et al 2018)
(Goodfellow 2018)
translations (Cho et al., 2014a) and for generating translated sentences (Sutskever
et al., 2014). Jean et al. (2014) scaled these models to larger vocabularies.
12.4.5.1
Attention Mechanisms
Using an Attention Mechanism and Aligning Pieces of Data
↵(t 1)
↵(t) ↵(t+1)
⇥ ⇥ ⇥
h(t 1)
h(t) h(t+1)
(Goodfellow 2018)
Generating Training Data
(Goodfellow 2018)
natural
imply bylanguage. Depending
looking up two storedonprobabilities.
how the model For is designed,
this a token
to exactly may
reproduce
word, a character,
nference or even
in Pn , we must omita byte. Tokens
the final are always
character discrete
from each entities.
sequence whenThewe
st
rain
kens Natural Language Processing
successful
Pn 1 . language models were based on models of fixed-length sequences
Ascalled n-grams.
an example, An n-gram how
we demonstrate is a sequence of n tokens.
a trigram model computes the probability
of the sentence
Models based on“THE define
DOG RAN
n-grams The first words
the conditional
AWAY.” of the sentence
probability cannot
of the n-th be
token
handled
the by the default
preceding n 1formula
tokens. based
The on conditional
model uses probability
products of because
these there is no
conditional
• An important predecessor to deep NLP is the family
ontext attothe
butions beginning
define of the sentence.
the probability Instead, over
distribution we must usesequences:
longer the marginal prob-
ability of
over models
words atbased onofn-grams:
the start the sentence. We thus evaluate P3 (THE DOG RAN).
⌧
Finally, the last word may be predicted Y using the typical case, of using the condi-
P (xdistribution
ional 1 , . . . , x⌧ ) =PP (x1 , .|.DOG
(AWAY . , xnRAN). P (xt |this
1 ) Putting xt n+1 t 1 ).equation(12.5)
, . . . , xwith
together 12.6,
we obtain: t=n
461
P (THE DOG RAN AWAY) = P3 (THE DOG RAN)P3 (DOG RAN AWAY)/P2 (DOG RAN).
(12.7)
Improve
A fundamental with:of maximum likelihood for n-gram models is that Pn
limitation
as estimated from training set counts is very likely to be zero in many cases, even
-Smoothing
hough the tuple (xt n+1 , . . . , xt ) may appear in the test set. This can cause two
-Backoff
different kinds of catastrophic outcomes. When Pn 1 is zero, the ratio is undefined,
o the model -Word
does notcategories
even produce a sensible output. When Pn 1 is non-zero but
Pn is zero, the test log-likelihood is 1. To avoid such catastrophic outcomes,
(Goodfellow 2018)
Word Embeddings in Neural
CHAPTER 12. APPLICATIONS
Language
multiple latent variables (Mnih and Hinton, Models
2007).
6 22
France
7 China
Russian 21
8 French 2009
2008
English
9 20 2004
2003 2007
2001
10 2006
Germany Iraq
Ontario 19 2000
2005 1999
11 Europe
EU
Union
Africa
African
Assembly Japan 1995 2002
12
European 18 19981996
1997
BritishNorth
13 Canada
Canadian
14 South 17
34 32 30 28 26 35.0 35.5 36.0 36.5 37.0 37.5 38.0
• Short list
• Hierarchical softmax
• Importance sampling
(Goodfellow 2018)
A Hierarchy of Words and
Word Categories
CHAPTER 12. APPLICATIONS
(0) (1)
w0 w1 w2 w3 w4 w5 w6 w7
Figure 12.4
Figure 12.4: Illustration of a simple hierarchy of word categories, with 8 words w0 , . . . , w7 (Goodfellow 2018)
Neural Machine Translation
HAPTER 12. APPLICATIONS
Decoder
Encoder
Figure 12.5
igure 12.5: The encoder-decoder architecture to map back and forth between a surfac
epresentation (such as a sequence of words or an image) and a semantic representatio
By using the output of an encoder of data from one modality (such as the encoder(Goodfellow
mappin 2018)
Google Neural Machine Translation
Wu et al 2016
(Goodfellow 2018)
Speech Recognition
Current speech recognition
is based on seq2seq with
attention
Graphic from
“Listen, Attend, and Spell”
Chan et al 2015
(Goodfellow 2018)
Speech Synthesis
WaveNet
(van den Oord et al, 2016)
(Goodfellow 2018)
Deep RL for Atari game playing
(Mnih et al 2013)
(Goodfellow 2018)