0% found this document useful (0 votes)
7 views46 pages

MODULE 5 Part 1

Module-5 of the Deep Learning course covers various application areas including computer vision, speech recognition, and natural language processing. It discusses techniques such as dataset augmentation, contrast normalization, and the use of neural language models to improve performance in these fields. Key concepts include the preprocessing of images for computer vision, the mapping of acoustic signals in speech recognition, and the development of language models using n-grams and neural embeddings.

Uploaded by

uos4367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

MODULE 5 Part 1

Module-5 of the Deep Learning course covers various application areas including computer vision, speech recognition, and natural language processing. It discusses techniques such as dataset augmentation, contrast normalization, and the use of neural language models to improve performance in these fields. Key concepts include the preprocessing of images for computer vision, the mapping of acoustic signals in speech recognition, and the development of language models using n-grams and neural embeddings.

Uploaded by

uos4367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CST414

DEEP LEARNING
Module-5 PART -I

1
SYLLABUS
2

Module-5 (Application Areas)


 Applications – computer vision, speech recognition, natural
language processing, common word embedding: continuous Bag-
of-Words, Word2Vec, global vectors for word representation (GloVe).
Research Areas – autoencoders, representation learning, boltzmann
TRACE KTU
machines, deep belief networks.
Applications
3
 solve applications in computer vision, speech recognition,
natural language processing

Computer Vision
 Applications of computer vision range from reproducing human
visual abilities, such as recognizing faces, to creating entirely
TRACE KTU
new categories of visual abilities.
 one recent computer vision application is to recognize sound
waves from the vibrations they induce in objects visible in a
video
 Most deep learning for computer vision is used for object
recognition or detection of some form, whether this means
reporting which object is present in an image
 Annotating an image with bounding boxes around each object,
4 transcribing a sequence of symbols from an image, or labeling each
pixel in an image with the identity of the object it belongs to.
 Deep learning models capable of image synthesis are usually useful
for image restoration, a computer vision task involving repairing
defects in images or removing objects from images.
Preprocessing:-

TRACE KTU
 The images should be standardized so that their pixels all lie in the
same, reasonable range, like [0,1] or [-1, 1].
 Formatting images to have the same scale is the only kind of
preprocessing that is strictly necessary.
 Many computer vision architectures require images of a standard
size, so images must be cropped or scaled to fit that size.
 Dataset augmentation may be seen as a way of preprocessing
5 the training set only.
 Dataset augmentation is an excellent way to reduce the
generalization error of most computer vision models
 A related idea applicable at test time is to show the model many
different versions of the same input (for example, the same image
cropped at slightly different locations) and have the different
instantiations of the model vote to determine the output.
TRACE KTU
 This latter idea can be interpreted as an ensemble approach,
and helps to reduce generalization error.
 Other kinds of preprocessing are applied to both the train and
the test set with the goal of putting each example into a more
canonical form in order to reduce the amount of variation that
the model needs to account for.
 both reduce generalization error and reduce the size of the
6
model needed to fit the training set
Contrast Normalization
 One of the most obvious sources of variation that can be safely
removed for many tasks is the amount of contrast in the
image
 Contrast simply refers to the magnitude of the difference

TRACE KTU
between the bright and the dark pixels in an image.
 In the context of deep learning, contrast usually refers to the
standard deviation of the pixels in an image or region of an
image.
7  Suppose we have an image represented by a tensor X Rr×c×3,
with Xi,j,1 being the red intensity at row i and column j , Xi,j,2
giving the green intensity and Xi,j,3 giving the blue intensity.
 Then the contrast of the entire image is given by

TRACE KTU
 Global contrast normalization (GCN) aims to prevent images
from having varying amounts of contrast by subtracting the
mean from each image, then rescaling it so that the standard
deviation across its pixels is equal to some constant s.
 no scaling factor can change the contrast of a zero-contrast
8 image (one whose pixels all have equal intensity). Images with
very low but non-zero contrast often have little information
content.
 Introducing a small, positive regularization parameter λ to bias
the estimate of the standard deviation.
 One can constrain the denominator to be at least ɛ . given an

TRACE KTU
input image x, gcn produces an output image x’, defined such
that

 Small images cropped randomly are more likely to have nearly


constant intensity, making aggressive regularization more
useful..
 The scale parameter s can usually be set to 1, as done by
9 Coates et al. (2011),or chosen to make each individual
pixel have standard deviation across examples close to 1,
 The standard deviation in equation 12.3 is just a rescaling of
the L2 norm of the image (assuming the mean of the image
has already been removed).
 It is preferable to define GCN in terms of standard

TRACE KTU
deviation rather than L2 norm because the standard
deviation includes division by the number of pixels, so
GCN based on standard deviation allows the same s to be
used regardless of image size
 One can understand GCN as mapping examples to a
spherical shell
 GCN avoids these problems by reducing each example to a
direction rather than a direction and a distance
10

 There is a preprocessing operation known as sphering and it is


not the same operation as GCN.
 Sphering does not refer to making the data lie on a spherical
shell, but rather to rescaling the principal components to have
equal variance, so that the multivariate normal distribution used

TRACE KTU
by PCA has spherical contours.
 Sphering is more commonly known as whitening
 Global contrast normalization will often fail to highlight
image features we would like to stand out, such as edges and
corners
11

TRACE KTU
12  Local contrast normalization ensures that the contrast is
normalized across each small window, rather than over
the image as a whole.
 In all cases, one modifies each pixel by subtracting a
mean of nearby pixels and dividing by a standard
deviation of nearby pixels.
 In some cases, this is literally the mean and standard
TRACE KTU
deviation of all pixels in a rectangular window centered
on the pixel to be modified (Pinto et al., 2008).
 In other cases, this is a weighted mean and weighted
standard deviation using Gaussian weights centered on
the pixel to be modified.
13

 Local contrast normalization can usually be implemented


efficiently by using separable convolution to compute
feature maps of local means and local standard deviations,
then using element-wise subtraction and element-wise
division on different feature maps.

TRACE KTU
 Local contrast normalization is a differentiable
operation and can also be used as a nonlinearity applied to
the hidden layers of a network, as well as a preprocessing
operation applied to the input.
Dataset Augmentation
14
 It is easy to improve the generalization of a classifier by
increasing the size of the training set by adding extra
copies of the training examples that have been modified
with transformations that do not change the class.
 Object recognition is a classification task that is especially
amenable to this form of dataset augmentation because the

TRACE KTU
class is invariant to so many transformations and the
input can be easily transformed with many geometric
operations.
 In specialized computer vision applications, more advanced
transformations are commonly used for dataset
augmentation. These schemes include random perturbation
of the colors in an image (Krizhevsky et al., 2012) and
nonlinear geometric distortions of the input (LeCun et al.,
1998b).
Speech Recognition
15
 The task of speech recognition is to map an acoustic signal
containing a spoken natural language utterance into the
corresponding sequence of words intended by the speaker
 Let X = (x(1), x(2) , . . . , x(T)) denote the sequence of acoustic
input vectors (traditionally produced by splitting the audio into
20ms frames).

TRACE KTU
 Most speech recognition systems preprocess the input using
specialized hand-designed features,
 Let y = (y1 , y2 , . . . , yN ) denote the target output sequence
(usually a sequence of words or characters). The automatic
speech recognition (ASR) task consists of creating a function
f ASR that computes the most probable linguistic sequence y
given the acoustic sequence X:
16
 where P is the true conditional distribution relating the
inputs X to the targets y.
 To solve speech recognition tasks, unsupervised
pretraining was used to build deep feedforward networks
whose layers were each initialized by training an RBM.

TRACE KTU
 These networks take spectral acoustic representations in a
fixed-size input window (around a center frame) and predict
the conditional probabilities of HMM states for that center
frame.
17
 Another important push, still ongoing, has been towards end-
to-end deep learning speech recognition systems that
completely remove the HMM.
 The first major breakthrough in this direction came from
Graves et al. (2013) who trained a deep LSTM RNN , using MAP
inference over the frame-to phoneme alignment, as in LeCun et

TRACE KTU
al. (1998b) and in the CTC framework (Graves et al., 2006;
Graves, 2012)
 Another contemporary step toward end-to-end deep learning
ASR is to let the system learn how to “align” the acoustic-level
information with the phonetic-level Information
Natural Language Processing
18
 Natural language processing (NLP) is the use of human
languages, such as English or French, by a computer
 Natural language processing includes applications such as
machine translation, in which the learner must read a
sentence in one human language and emit an equivalent
sentence in another human language

TRACE KTU
 Many NLP applications are based on language models that define
a probability distribution over sequences of words, characters
or bytes in a natural language.
 To build an efficient model of natural language, we must usually
use techniques that are specialized for processing sequential
data. In many cases, we choose to regard natural language as a
sequence of words, rather than a sequence of individual
characters or bytes
n-grams
19
 A language model defines a probability distribution over
sequences of tokens in a natural language
 Depending on how the model is designed, a token may be a word,
a character, or even a byte. Tokens are always discrete entities.

TRACE KTU
 Language models were based on models of fixed-length
sequences of tokens called n-grams

 An n-gram is a sequence of n tokens.

 Models based on n-grams define the conditional probability of


the n-th token given the preceding n − 1 tokens
 The model uses products of these conditional distributions to
20 define the probability distribution over longer sequences

 The probability distribution over the initial sequence P (x1 , . . . ,


xn−1) may be modeled by a different model with a smaller value of n.
TRACE KTU
 Training n-gram models is straightforward because the maximum
likelihood estimate can be computed simply by counting how
many times each possible n gram occurs in the training set.
 For small values of n, models have particular names: unigram for
n=1, bigram for n=2, and trigram for n=3.
 These names derive from the Latin prefixes for the corresponding
21 numbers and the Greek suffix “-gram” denoting something
that is written.
 Usually we train both an n-gram model and an n−1 gram model
simultaneously.
 This makes it easy to compute

TRACE KTU
 two stored probabilities. For this to exactly reproduce inference
in Pn, we must omit the final character from each sequence when
we train P n−1.
 we demonstrate how a trigram model computes the
probability of the sentence “ THE DOG RAN AWAY.”
22
 we must use the marginal probability over words at the start of
the sentence. We thus evaluate P3(THE DOG RAN ).
 Finally, the last word may be predicted using the typical case, of
using the conditional distribution P(AWAY | DOG RAN). Putting
this together with equation 12.6.
 we obtain:
TRACE KTU
 A fundamental limitation of maximum likelihood for n-gram
23 models is that Pn as estimated from training set counts is very
likely to be zero in many cases, even though the tuple (xt−n+1, . .
. , xt) may appear in the test set.
 When Pn−1 is zero, the ratio is undefined, so the model does not
even produce a sensible output. When Pn−1 is non-zero but Pn is
zero, the test log-likelihood is −∞. To avoid such catastrophic
outcomes, most n-gram models employ some form of smoothing.
TRACE KTU
 Smoothing techniques shift probability mass from the
observed tuples to unobserved ones that are similar more
likely to avoid counts of zero.
24  One basic technique consists of adding non-zero probability mass
to all of the possible next symbol values
 Another very popular idea is to form a mixture model containing
higher-order and lower-order n-gram models, with the higher-
order models providing more capacity and the lower-order models
being

TRACE KTU
 Back-off methods look-up the lower-order n-grams if the frequency
of the context xt−1, . . . , xt−n+1 is too small to use the higher-order
model.
 Classical n-gram models are particularly vulnerable to the curse of
dimensionality.
 One way to view a classical n-gram model is that it is performing
nearest-neighbor lookup
25
 The problem for a language model is even more severe than
usual, because any two different words have the same
distance from each other in one-hot vector space
 To overcome these problems, a language model must be able to
share knowledge between one word and other semantically
similar words.
TRACE KTU
 To improve the statistical efficiency of n-gram models, class-
based language Models introduce the notion of word categories
and then share statistical strength between words that are in the
same category.
Neural Language Models
26

 Neural language models or NLMs are a class of language model


designed to overcome the curse of dimensionality problem for
modeling natural language sequences by using a distributed
representation of words.
 neural language models are able to recognize that two words are
TRACE KTU
similar without losing the ability to encode each word as distinct
from the other.
 Neural language models share statistical strength between one word
(and its context) and other similar words and contexts.
 For example, if the word dog and the word cat map to representations
that share many attributes, then sentences that contain the word cat
can inform the predictions that will be made by the model for
sentences that contain the word dog, and vice-versa.
27
 The curse of dimensionality requires the model to
generalize to a number of sentences that is exponential in
the sentence length.
 The model counters this curse by relating each training
sentence to an exponential number of similar sentences.
 word embeddings--view the raw symbols as points in a space of
TRACE KTU
dimension equal to the vocabulary size. The word
representations embed those points in a feature space of
lower dimension
 In the original space, every word is represented by a one-hot
vector, so every pair of words is at Euclidean distance √2
from each other
 In the embedding space, words that frequently appear in similar
28
contexts (or any pair of words sharing some “features” learned
by the model) are close to each other. This often results in
words with similar meanings being neighbors

TRACE KTU
29
on specific areas of a learned word embedding
space to show how semantically similar words
map to representations that are close to each
other Neural networks in other domains also
define embeddings. For example, a hidden layer of

TRACE KTU
a convolutional network provides an “image
embedding.”
High-Dimensional Outputs
30

 we often want our models to produce words (rather than


characters) as the fundamental unit of the output For large
vocabularies, it can be very computationally expensive to
represent an output distribution over the choice of a word,
because the vocabulary size is large.

TRACE KTU
 In many applications, V contains hundreds of thousands of
words.
 The naive approach to representing such a distribution is to
apply an affine transformation from a hidden representation to
the output space, then apply the softmax function
 Suppose we have a vocabulary V with size | V | . The weight matrix
31
describing the linear component of this affine transformation is very
large, because its output dimension is | V | .
 This imposes a high memory cost to represent the matrix, and a
high computational cost to multiply by it.
 Because the softmax is normalized across all | V | outputs,
 The high computational costs of the output layer thus arise both
TRACE KTU
at training time (to compute the likelihood and its gradient) and
at test time (to compute probabilities for all or selected words).
 Suppose that h is the top hidden layer used to predict the
32 output probabilities yˆ. If we parametrize the transformation
from h to yˆ with learned weights W and learned biases b,
then the affine-softmax output layer performs the following
computations:

TRACE KTU
 If h contains nh elements then the above operation is O(| V |
nh). With nh in the thousands and | V | in the hundreds of
thousands, this operation dominates the computation of
most neural language models
Use of a Short List
33

 The first neural language models (Bengio et al., 2001, 2003) dealt
with the high cost of using a softmax over a large number of
output words by limiting the vocabulary size to 10,000 or 20,000
words.

TRACE KTU
 Schwenk and Gauvain (2002) and Schwenk (2007) built upon
this approach by splitting the vocabulary V into a shortlist L
of most frequent words (handled by the neural net) and a tail
of more rare words (handled by an n-gram model).
 To be able to combine the two predictions, the neural net also has
to predict the probability that a word appearing after context C
belongs to the tail list.
34  This may be achieved by adding an extra sigmoid
output unit to provide an estimate of P (i T| C ). The
extra output can then be used to achieve an estimate of
the probability distribution over all words in V as
follows

TRACE KTU
35
 An obvious disadvantage of the short list approach is that the
potential generalization advantage of the neural language
models is limited to the most frequent words, where,
arguably, it is the least useful.
 This disadvantage has stimulated the exploration of
alternative methods to deal with high-dimensional outputs,
TRACE KTU
36 Hierarchical Softmax
 A classical approach (Goodman, 2001) to reducing the
computational burden of high-dimensional output layers over
large vocabulary sets V is to decompose probabilities
hierarchically.
 Instead of necessitating a number of computations
TRACE KTU
proportional to |V | (and also proportional to the number of
hidden units, nh), the |V | factor can be reduced to as low
as log |V | .
 Bengio (2002) and Morin and Bengio (2005) introduced this
factorized approach to the context of neural language models.
37
 hierarchy as building categories of words, then categories
of categories of words, then categories of categories of
categories of words, etc.
 These nested categories form a tree, with words at the leaves.
 In a balanced tree, the tree has depth O(log |V|).

TRACE KTU
 The probability of a choosing a word is given by the product of
the probabilities of choosing the branch leading to that
word at every node on a path from the root of the tree to
the leaf containing the word.
 Describe how to use multiple paths to identify a single word in
order to better model words that have multiple meanings.
Computing the probability of a word then involves
summation over all of the paths that lead to that word.
38

TRACE KTU
 Figure 12.4: Illustration of a simple hierarchy of word
categories, with 8 words w0 , . . . , w7 organized into a
three level hierarchy. The leaves of the tree represent
actual specific words. Internal nodes represent groups
of words. Any node can be indexed by the sequence of
binary decisions (0=left, 1=right) to reach the node from
the root.
39
 Super-class (0) contains the classes (0, 0) and (0, 1), which
respectively contain the sets of words {w0,w1} and {w2,w3},
and similarly super-class (1) contains the classes (1,0) and (1,
1), which respectively contain the words (w4,w5) and (w6,w7)
 If the tree is sufficiently balanced,the maximum depth
(number of binary decisions) is on the order of the
logarithm of the number of words | V| : the choice of one

TRACE KTU
out of |V | words can be obtained by doing O(log | V| )
operations (one for each of the nodes on the path from the
root).
 In this example, computing the probability of a word y can
be done by multiplying three probabilities, associated with
the binary decisions to move left or right at each node on the
path from the root to a node y.
 The probability of sampling an output y decomposes into a
40 product of conditional probabilities, using the chain rule for
conditional probabilities, with each node indexed by the prefix
of these bits
 For example, node (1, 0) corresponds to the prefix (b0 (w4 ) =
1, b1 (w4) = 0), and the probability of w4 can be decomposed
as follows:

TRACE KTU
41  An important advantage of the hierarchical softmax is that it
brings computational benefits both at training time and at test
time, if at test time we want to compute the probability of
specific words.
 A disadvantage is that in practice the hierarchical softmax tends
to give worse test results, This may be due to a poor choice of
word classes.

TRACE KTU
42 Common Word Embedding
 to develop effective learning models in situations where
labeled data is scarce but wild, unlabeled data is
plentiful.
 We’ll approach this problem by learning embeddings, or
low-dimensional representations, in an unsupervised
TRACE KTU
fashion.
 Because these unsupervised models allow us to offload all of
the heavy lifting of automated feature selection, we can use
the generated embeddings to solve learning problems using
smaller models that require less data.
43

TRACE KTU
44

TRACE KTU
Fig:-General architectures for designing encoders and decoders that generate
embeddings by mapping words to their respective contexts (A) or vice versa (B)
45

TRACE KTU

Fig:-An example of generating one-hot vector representations for words using a simple document
 We define a new network architecture that we call the
autoencoder. We first take the input and compress it into a low-
46 dimensional vector.
 This part of the network is called the encoder because it is
responsible for producing the low-dimensional embedding or
code.
 The second part of the network, instead of mapping the
embedding to an arbitrary label as we would in a feed-forward

TRACE KTU
network, tries to invert the computation of the first half of the
network and reconstruct the original input. This piece is known
as the decoder. The overall architecture is illustrated in Figure

You might also like