MODULE 5 Part 1
MODULE 5 Part 1
DEEP LEARNING
Module-5 PART -I
1
SYLLABUS
2
Computer Vision
Applications of computer vision range from reproducing human
visual abilities, such as recognizing faces, to creating entirely
TRACE KTU
new categories of visual abilities.
one recent computer vision application is to recognize sound
waves from the vibrations they induce in objects visible in a
video
Most deep learning for computer vision is used for object
recognition or detection of some form, whether this means
reporting which object is present in an image
Annotating an image with bounding boxes around each object,
4 transcribing a sequence of symbols from an image, or labeling each
pixel in an image with the identity of the object it belongs to.
Deep learning models capable of image synthesis are usually useful
for image restoration, a computer vision task involving repairing
defects in images or removing objects from images.
Preprocessing:-
TRACE KTU
The images should be standardized so that their pixels all lie in the
same, reasonable range, like [0,1] or [-1, 1].
Formatting images to have the same scale is the only kind of
preprocessing that is strictly necessary.
Many computer vision architectures require images of a standard
size, so images must be cropped or scaled to fit that size.
Dataset augmentation may be seen as a way of preprocessing
5 the training set only.
Dataset augmentation is an excellent way to reduce the
generalization error of most computer vision models
A related idea applicable at test time is to show the model many
different versions of the same input (for example, the same image
cropped at slightly different locations) and have the different
instantiations of the model vote to determine the output.
TRACE KTU
This latter idea can be interpreted as an ensemble approach,
and helps to reduce generalization error.
Other kinds of preprocessing are applied to both the train and
the test set with the goal of putting each example into a more
canonical form in order to reduce the amount of variation that
the model needs to account for.
both reduce generalization error and reduce the size of the
6
model needed to fit the training set
Contrast Normalization
One of the most obvious sources of variation that can be safely
removed for many tasks is the amount of contrast in the
image
Contrast simply refers to the magnitude of the difference
TRACE KTU
between the bright and the dark pixels in an image.
In the context of deep learning, contrast usually refers to the
standard deviation of the pixels in an image or region of an
image.
7 Suppose we have an image represented by a tensor X Rr×c×3,
with Xi,j,1 being the red intensity at row i and column j , Xi,j,2
giving the green intensity and Xi,j,3 giving the blue intensity.
Then the contrast of the entire image is given by
TRACE KTU
Global contrast normalization (GCN) aims to prevent images
from having varying amounts of contrast by subtracting the
mean from each image, then rescaling it so that the standard
deviation across its pixels is equal to some constant s.
no scaling factor can change the contrast of a zero-contrast
8 image (one whose pixels all have equal intensity). Images with
very low but non-zero contrast often have little information
content.
Introducing a small, positive regularization parameter λ to bias
the estimate of the standard deviation.
One can constrain the denominator to be at least ɛ . given an
TRACE KTU
input image x, gcn produces an output image x’, defined such
that
TRACE KTU
deviation rather than L2 norm because the standard
deviation includes division by the number of pixels, so
GCN based on standard deviation allows the same s to be
used regardless of image size
One can understand GCN as mapping examples to a
spherical shell
GCN avoids these problems by reducing each example to a
direction rather than a direction and a distance
10
TRACE KTU
by PCA has spherical contours.
Sphering is more commonly known as whitening
Global contrast normalization will often fail to highlight
image features we would like to stand out, such as edges and
corners
11
TRACE KTU
12 Local contrast normalization ensures that the contrast is
normalized across each small window, rather than over
the image as a whole.
In all cases, one modifies each pixel by subtracting a
mean of nearby pixels and dividing by a standard
deviation of nearby pixels.
In some cases, this is literally the mean and standard
TRACE KTU
deviation of all pixels in a rectangular window centered
on the pixel to be modified (Pinto et al., 2008).
In other cases, this is a weighted mean and weighted
standard deviation using Gaussian weights centered on
the pixel to be modified.
13
TRACE KTU
Local contrast normalization is a differentiable
operation and can also be used as a nonlinearity applied to
the hidden layers of a network, as well as a preprocessing
operation applied to the input.
Dataset Augmentation
14
It is easy to improve the generalization of a classifier by
increasing the size of the training set by adding extra
copies of the training examples that have been modified
with transformations that do not change the class.
Object recognition is a classification task that is especially
amenable to this form of dataset augmentation because the
TRACE KTU
class is invariant to so many transformations and the
input can be easily transformed with many geometric
operations.
In specialized computer vision applications, more advanced
transformations are commonly used for dataset
augmentation. These schemes include random perturbation
of the colors in an image (Krizhevsky et al., 2012) and
nonlinear geometric distortions of the input (LeCun et al.,
1998b).
Speech Recognition
15
The task of speech recognition is to map an acoustic signal
containing a spoken natural language utterance into the
corresponding sequence of words intended by the speaker
Let X = (x(1), x(2) , . . . , x(T)) denote the sequence of acoustic
input vectors (traditionally produced by splitting the audio into
20ms frames).
TRACE KTU
Most speech recognition systems preprocess the input using
specialized hand-designed features,
Let y = (y1 , y2 , . . . , yN ) denote the target output sequence
(usually a sequence of words or characters). The automatic
speech recognition (ASR) task consists of creating a function
f ASR that computes the most probable linguistic sequence y
given the acoustic sequence X:
16
where P is the true conditional distribution relating the
inputs X to the targets y.
To solve speech recognition tasks, unsupervised
pretraining was used to build deep feedforward networks
whose layers were each initialized by training an RBM.
TRACE KTU
These networks take spectral acoustic representations in a
fixed-size input window (around a center frame) and predict
the conditional probabilities of HMM states for that center
frame.
17
Another important push, still ongoing, has been towards end-
to-end deep learning speech recognition systems that
completely remove the HMM.
The first major breakthrough in this direction came from
Graves et al. (2013) who trained a deep LSTM RNN , using MAP
inference over the frame-to phoneme alignment, as in LeCun et
TRACE KTU
al. (1998b) and in the CTC framework (Graves et al., 2006;
Graves, 2012)
Another contemporary step toward end-to-end deep learning
ASR is to let the system learn how to “align” the acoustic-level
information with the phonetic-level Information
Natural Language Processing
18
Natural language processing (NLP) is the use of human
languages, such as English or French, by a computer
Natural language processing includes applications such as
machine translation, in which the learner must read a
sentence in one human language and emit an equivalent
sentence in another human language
TRACE KTU
Many NLP applications are based on language models that define
a probability distribution over sequences of words, characters
or bytes in a natural language.
To build an efficient model of natural language, we must usually
use techniques that are specialized for processing sequential
data. In many cases, we choose to regard natural language as a
sequence of words, rather than a sequence of individual
characters or bytes
n-grams
19
A language model defines a probability distribution over
sequences of tokens in a natural language
Depending on how the model is designed, a token may be a word,
a character, or even a byte. Tokens are always discrete entities.
TRACE KTU
Language models were based on models of fixed-length
sequences of tokens called n-grams
TRACE KTU
two stored probabilities. For this to exactly reproduce inference
in Pn, we must omit the final character from each sequence when
we train P n−1.
we demonstrate how a trigram model computes the
probability of the sentence “ THE DOG RAN AWAY.”
22
we must use the marginal probability over words at the start of
the sentence. We thus evaluate P3(THE DOG RAN ).
Finally, the last word may be predicted using the typical case, of
using the conditional distribution P(AWAY | DOG RAN). Putting
this together with equation 12.6.
we obtain:
TRACE KTU
A fundamental limitation of maximum likelihood for n-gram
23 models is that Pn as estimated from training set counts is very
likely to be zero in many cases, even though the tuple (xt−n+1, . .
. , xt) may appear in the test set.
When Pn−1 is zero, the ratio is undefined, so the model does not
even produce a sensible output. When Pn−1 is non-zero but Pn is
zero, the test log-likelihood is −∞. To avoid such catastrophic
outcomes, most n-gram models employ some form of smoothing.
TRACE KTU
Smoothing techniques shift probability mass from the
observed tuples to unobserved ones that are similar more
likely to avoid counts of zero.
24 One basic technique consists of adding non-zero probability mass
to all of the possible next symbol values
Another very popular idea is to form a mixture model containing
higher-order and lower-order n-gram models, with the higher-
order models providing more capacity and the lower-order models
being
TRACE KTU
Back-off methods look-up the lower-order n-grams if the frequency
of the context xt−1, . . . , xt−n+1 is too small to use the higher-order
model.
Classical n-gram models are particularly vulnerable to the curse of
dimensionality.
One way to view a classical n-gram model is that it is performing
nearest-neighbor lookup
25
The problem for a language model is even more severe than
usual, because any two different words have the same
distance from each other in one-hot vector space
To overcome these problems, a language model must be able to
share knowledge between one word and other semantically
similar words.
TRACE KTU
To improve the statistical efficiency of n-gram models, class-
based language Models introduce the notion of word categories
and then share statistical strength between words that are in the
same category.
Neural Language Models
26
TRACE KTU
29
on specific areas of a learned word embedding
space to show how semantically similar words
map to representations that are close to each
other Neural networks in other domains also
define embeddings. For example, a hidden layer of
TRACE KTU
a convolutional network provides an “image
embedding.”
High-Dimensional Outputs
30
TRACE KTU
In many applications, V contains hundreds of thousands of
words.
The naive approach to representing such a distribution is to
apply an affine transformation from a hidden representation to
the output space, then apply the softmax function
Suppose we have a vocabulary V with size | V | . The weight matrix
31
describing the linear component of this affine transformation is very
large, because its output dimension is | V | .
This imposes a high memory cost to represent the matrix, and a
high computational cost to multiply by it.
Because the softmax is normalized across all | V | outputs,
The high computational costs of the output layer thus arise both
TRACE KTU
at training time (to compute the likelihood and its gradient) and
at test time (to compute probabilities for all or selected words).
Suppose that h is the top hidden layer used to predict the
32 output probabilities yˆ. If we parametrize the transformation
from h to yˆ with learned weights W and learned biases b,
then the affine-softmax output layer performs the following
computations:
TRACE KTU
If h contains nh elements then the above operation is O(| V |
nh). With nh in the thousands and | V | in the hundreds of
thousands, this operation dominates the computation of
most neural language models
Use of a Short List
33
The first neural language models (Bengio et al., 2001, 2003) dealt
with the high cost of using a softmax over a large number of
output words by limiting the vocabulary size to 10,000 or 20,000
words.
TRACE KTU
Schwenk and Gauvain (2002) and Schwenk (2007) built upon
this approach by splitting the vocabulary V into a shortlist L
of most frequent words (handled by the neural net) and a tail
of more rare words (handled by an n-gram model).
To be able to combine the two predictions, the neural net also has
to predict the probability that a word appearing after context C
belongs to the tail list.
34 This may be achieved by adding an extra sigmoid
output unit to provide an estimate of P (i T| C ). The
extra output can then be used to achieve an estimate of
the probability distribution over all words in V as
follows
TRACE KTU
35
An obvious disadvantage of the short list approach is that the
potential generalization advantage of the neural language
models is limited to the most frequent words, where,
arguably, it is the least useful.
This disadvantage has stimulated the exploration of
alternative methods to deal with high-dimensional outputs,
TRACE KTU
36 Hierarchical Softmax
A classical approach (Goodman, 2001) to reducing the
computational burden of high-dimensional output layers over
large vocabulary sets V is to decompose probabilities
hierarchically.
Instead of necessitating a number of computations
TRACE KTU
proportional to |V | (and also proportional to the number of
hidden units, nh), the |V | factor can be reduced to as low
as log |V | .
Bengio (2002) and Morin and Bengio (2005) introduced this
factorized approach to the context of neural language models.
37
hierarchy as building categories of words, then categories
of categories of words, then categories of categories of
categories of words, etc.
These nested categories form a tree, with words at the leaves.
In a balanced tree, the tree has depth O(log |V|).
TRACE KTU
The probability of a choosing a word is given by the product of
the probabilities of choosing the branch leading to that
word at every node on a path from the root of the tree to
the leaf containing the word.
Describe how to use multiple paths to identify a single word in
order to better model words that have multiple meanings.
Computing the probability of a word then involves
summation over all of the paths that lead to that word.
38
TRACE KTU
Figure 12.4: Illustration of a simple hierarchy of word
categories, with 8 words w0 , . . . , w7 organized into a
three level hierarchy. The leaves of the tree represent
actual specific words. Internal nodes represent groups
of words. Any node can be indexed by the sequence of
binary decisions (0=left, 1=right) to reach the node from
the root.
39
Super-class (0) contains the classes (0, 0) and (0, 1), which
respectively contain the sets of words {w0,w1} and {w2,w3},
and similarly super-class (1) contains the classes (1,0) and (1,
1), which respectively contain the words (w4,w5) and (w6,w7)
If the tree is sufficiently balanced,the maximum depth
(number of binary decisions) is on the order of the
logarithm of the number of words | V| : the choice of one
TRACE KTU
out of |V | words can be obtained by doing O(log | V| )
operations (one for each of the nodes on the path from the
root).
In this example, computing the probability of a word y can
be done by multiplying three probabilities, associated with
the binary decisions to move left or right at each node on the
path from the root to a node y.
The probability of sampling an output y decomposes into a
40 product of conditional probabilities, using the chain rule for
conditional probabilities, with each node indexed by the prefix
of these bits
For example, node (1, 0) corresponds to the prefix (b0 (w4 ) =
1, b1 (w4) = 0), and the probability of w4 can be decomposed
as follows:
TRACE KTU
41 An important advantage of the hierarchical softmax is that it
brings computational benefits both at training time and at test
time, if at test time we want to compute the probability of
specific words.
A disadvantage is that in practice the hierarchical softmax tends
to give worse test results, This may be due to a poor choice of
word classes.
TRACE KTU
42 Common Word Embedding
to develop effective learning models in situations where
labeled data is scarce but wild, unlabeled data is
plentiful.
We’ll approach this problem by learning embeddings, or
low-dimensional representations, in an unsupervised
TRACE KTU
fashion.
Because these unsupervised models allow us to offload all of
the heavy lifting of automated feature selection, we can use
the generated embeddings to solve learning problems using
smaller models that require less data.
43
TRACE KTU
44
TRACE KTU
Fig:-General architectures for designing encoders and decoders that generate
embeddings by mapping words to their respective contexts (A) or vice versa (B)
45
TRACE KTU
Fig:-An example of generating one-hot vector representations for words using a simple document
We define a new network architecture that we call the
autoencoder. We first take the input and compress it into a low-
46 dimensional vector.
This part of the network is called the encoder because it is
responsible for producing the low-dimensional embedding or
code.
The second part of the network, instead of mapping the
embedding to an arbitrary label as we would in a feed-forward
TRACE KTU
network, tries to invert the computation of the first half of the
network and reconstruct the original input. This piece is known
as the decoder. The overall architecture is illustrated in Figure