Module 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Module 5

Applications of Deep Learning: Large scale


Deep learning, Computer vision, speech
recognition, NLP, other applications.
Introduction to Generative Adversarial
Networks(GANs) and its applications
Large Scale Deep Learning
Introduction
• Network sizes have grown exponentially for
the past three decades.
• Because the size of neural networks is of
paramount importance
– deep learning requires high performance
hardware and software infrastructure
Fast Implementations
• CPU
– Exploit fixed point arithmetic in CPU families where this
offers a speedup
– Cache-friendly implementations
• GPU
– High memory bandwidth
– No cache
– Warps must be synchronized
• TPU
– Similar to GPU in many respects but faster
– Often requires larger batch size
– Sometimes requires reduced precision
Distributed Implementations
• Distributed
– Multi-GPU
– Multi-machine
• Model parallelism
• Data parallelism
– Trivial at test time
– Synchronous or asynchronous SGD at train time
Model Compression
• Large models often have lower test error
– Very large model trained with dropout
– Ensemble of many models
• Want small model for low resource use at test
time
• Train a small model to mimic the large one
– Obtains better test error than directly training a
small model
Dynamic Structure
• Accelerating Data Processing Systems-one essential strategy
for improving their efficiency is the incorporation of dynamic
structure into the computation graph that outlines the
necessary operations for processing input data.
• By introducing dynamic structure, data-processing systems
can dynamically determine which subset of neural networks
or other machine learning models should be executed for a
given input.
• The concept of dynamic structure within neural networks is
often termed "conditional computation" and holds
significance in optimizing the overall computational process.
• The underlying idea is to compute features only when they
are needed, potentially leading to significant speed
enhancements in data processing systems.
Cascade of Classifiers
• Cascade of Classifiers
– In scenarios where the goal is to detect rare objects or
events, the cascade strategy offers an effective approach
to accelerate inference. This approach involves a sequence
of classifiers, each with a specific role.
• Efficient Resource Allocation
– The initial classifiers in the sequence have low capacity but
are trained for high recall, ensuring that rare objects are
not falsely rejected. The final classifier, which has high
precision, confirms the presence of the object.
• Reduced Computation
– By using a cascade of classifiers, we efficiently allocate
computation resources and reject inputs as soon as any
one classifier in the sequence rejects them, avoiding the
need for full inference for every example.
Cascade Strategies
• Two approaches can be taken to achieve high capacity in a
cascade.
• In one approach, each member of the cascade has high
individual capacity, ensuring the system as a whole has high
capacity.
• Alternatively, the cascade can be composed of members with
low capacity, and the overall high capacity results from the
combination of many smaller models.
• Decision Trees as Dynamic Structure
– Decision trees themselves represent dynamic structure because each
node in the tree determines which of its subtrees should be evaluated
for each input.
– To combine deep learning with dynamic structure, one approach is to
train decision trees where each node uses a neural network to make
splitting decisions.
Mixture of Experts
• Mixture of Experts
– A neural network known as the "gater" is employed to select
which of several "expert networks" will compute the output
based on the current input. This concept is known as the
"mixture of experts."
– It can be implemented as a "soft mixture of experts" or a "hard
mixture of experts."
• Accelerating Training and Inference
– The "hard mixture of experts" approach, where a single expert is
chosen for each example, significantly accelerates training and
inference without sacrificing the quality of the approximation.
• Obstacles in Dynamically Structured Systems
– One significant challenge in dynamically structured systems is
the reduced degree of parallelism resulting from different code
branches being followed for various inputs.
– This limitation hinders operations that can be described as
matrix multiplication or batch convolution on a minibatch of
examples.
Applications of Deep Learning
• Deep learning is used to solve applications in
– computer vision,
– speech recognition,
– natural language processing
Computer Vision
Introduction
• Computer Vision is a field of artificial intelligence where
machines learn to see and understand the visual world.

• Computer vision been one of the most active research areas


for deep learning applications.
• Most deep learning for computer vision is used for object
recognition or image classification or Optical Character
Recognition

• Many application areas require sophisticated preprocessing


• Because the original input comes in a form that is difficult for
many deep learning architectures to represent.
Preprocessing
1. Standardization of Pixel Values: Images should have pixel values
standardized to a consistent range, such as [0,1] or [-1, 1]. Mixing
images with different pixel value ranges (e.g., [0,1] and [0,255]) can
lead to problems.
2. Formatting Images to Have the Same Scale: It's essential to ensure
that images have the same scale. This is often required for many
computer vision architectures. Images may need to be cropped or
scaled to fit a standard size.
3. Variable-Sized Inputs: Some convolutional models accept variably-
sized inputs and dynamically adjust the size of their pooling regions to
keep the output size constant.
4. Variable-Sized Output: Some convolutional models have variable-
sized output that automatically scales in size with the input. For
example, models that denoise or label each pixel in an image may
adjust their output size based on the input.
Preprocessing
• Data Augmentation: Dataset augmentation involves creating
variations of the training data, like rotating, flipping, or changing
the brightness of images
• Test-Time Data Variation:
– At test time, a related concept is to show the model different
versions of the same input, such as cropping an image at slightly
different positions.
– The model considers these variations and combines their
predictions to improve accuracy, akin to an ensemble approach.
• Reducing Generalization Error:
– Both dataset augmentation during training and test-time data
variation help reduce the generalization error of computer
vision models. These techniques enhance the model's ability to
perform well on diverse, real-world data by exposing it to
various perspectives and situations.
Generalization Error
• It refers to the difference in performance between a model on the
training data (the data it was trained on) and its performance on
new, unseen data (the data it was not trained on).
• The goal in machine learning is to create models that not only
perform well on the data they were trained on but also generalize
effectively to new, real-world data.
• If a model has a low generalization error, it means it can make
reliable predictions on unseen data.
• However, if the generalization error is too low, the model may be
overfitting the training data, meaning it’s too focused on the
training data and does not perform well on training data.
• Balancing model complexity, training data size, and generalization
error is a fundamental challenge in machine learning
Contrast Normalization
• Contrast normalization, also known as contrast enhancement or contrast
stretching, is a fundamental image processing technique in computer
vision and digital image processing.
• Contrast refers to the magnitude of the difference between the bright and
the dark pixels in an image.
• In terms of Deep Learning In deep learning, contrast usually refers to the
standard deviation of the pixels in an image or region of an image.
Global contrast normalization (GCN)
• Global Contrast Normalization (GCN) is a technique used to make images
in a dataset have consistent levels of contrast
• Global contrast normalization (GCN) aims to prevent images from having
varying amounts of contrast by subtracting the mean from each image,
then rescaling it so that the standard deviation across its pixels is equal to
some constant s.
• However, if an image has very little contrast or if all its pixels have the
same brightness, GCN can make it worse by adding noise.
GCN
• Large images are cropped to interesting objects by setting λ = 0 and avoid
division by 0 in extremely rare cases by setting epsilon to an extremely low
value like 10−8.
• Small images cropped randomly are more likely to have nearly constant
intensity, making aggressive regularization more useful.
• The scale parameter s can usually be set to 1 or chosen to make each
individual pixel have standard deviation across examples close to 1.
Example
• Imagine you have a collection of photos of various outdoor scenes. Some
photos were taken on a bright sunny day with vivid colors, while others
were taken on a cloudy day with dull colors.
• 1. Subtracting the Average Brightness:
– You find the average brightness of all the photos, and it's like having a
"neutral" brightness level of 50 (on a scale of 0 to 100).
• 2. Adjusting for Consistent Color Intensity:
– To ensure that all. photos have a consistent color intensity, you
subtract 50 from the brightness of each pixel in each photo to make
them centered around this "neutral" level.
– This helps in removing any overall brightness bias.
GCN and LCN
• The standard deviation in Eq. defines GCN in terms of standard deviation
rather than L2 norm. The standard deviation includes division by the number
of pixels, so GCN based on standard deviation allows the same s to be used
regardless of image size.
• Counterintuitively, there is a preprocessing operation known as Sphering.
• It is not the same operation as GCN on a spherical shell, but rather rescaling
the principal components to have equal variance so that the multivariate
normal distribution used by PCA has spherical contours.
• Sphering is more commonly known as Whitening.
• GCN fail to highlight image features such as edges and corners when the
scene with a large dark area and a large bright area (such as a city square
with half the image in the shadow of a building)
– GCN will ensure there is a large difference between the brightness of the dark area and the
brightness of the light area but fails to ensure that edges within the dark region stand out.
• This motivates local contrast normalization(LCN)
– ensures that the contrast is normalized across each small window, rather than over the
image as a whole.
GCN and LCN
Speech Recognition
• Speech recognition is a technology and a ability of a machine or
program to identify and understand human speech.
• The task of speech recognition is to map an acoustic signal containing a
spoken natural language utterance into the corresponding sequence of
words intended by the speaker.
• Most speech recognition systems preprocess the input using specialized
hand-designed features, but some deep learning systems learn features
from raw input.
Speech Recognition: Introduction
• 2009–2012: state-of-the art speech recognition systems
primarily combined hidden Markov models (HMMs) and
Gaussian mixture models (GMMs).
• GMMs modeled the association between acoustic features
and phonemes. Whereas HMMs modeled the sequence of
phonemes.
• Hidden Markov Models (HMMs): HMMs have been a
fundamental component of many traditional ASR systems.
They are used to model the temporal sequence of phonemes
or sub-word units in speech. Each phoneme is associated with
an HMM, and the system selects the most likely sequence of
HMMs to represent the spoken words.
The GMM-HMM Model in ASR
• The GMM-HMM model family generates acoustic waveforms
in two steps.
– Firstly, an HMM generates a sequence of phonemes and sub-
phonemic states, including the beginning, middle, and end of each
phoneme.
– a GMM is employed to transform these discrete symbols into brief
segments of audio waveform.
• ASR was an early adopter of neural networks in the late 1980s
and early 1990s.
• Neural network-based ASR systems showed performance
comparable to GMM-HMM systems.
• Transition toward using neural networks for ASR occurred in
the late 2000s.
Neural Networks in speech
recognition
• Transition from GMMs to Neural Networks: With the advent of
larger and deeper models and larger datasets, neural networks started
replacing Gaussian Mixture Models (GMMs) in the association of acoustic
features with phonemes or sub-phonemic states.
• Unsupervised pretraining with Restricted Boltzmann
Machines (RBMs):
– Unsupervised pretraining was used to build deep feedforward
networks.
– Each layer of these networks was initialized by training an RBM.
– These networks processed spectral acoustic representations and
predicted the conditional probabilities of Hidden Markov Model
(HMM) states for a central frame.
– This approach significantly improved recognition rates, reducing the
phoneme error rate from 26% to 20.7% on datasets like TIMIT.
– speaker-adaptive features contributed to reducing error rates
Neural Networks in speech
recognition
• Incorporation of Speaker-Adaptive Features: Further advancements
included the addition of speaker-adaptive features, which contributed to
reducing error rates.
• Transition to Large-Vocabulary Speech Recognition: The architecture
expanded from phoneme recognition to large-vocabulary speech
recognition. This involved recognizing sequences of words from a large
vocabulary.
• Shift to Modern Techniques: Over time, deep networks for speech
recognition evolved, moving away from pretraining and Boltzmann
machines. Techniques such as rectified linear units and dropout were
adopted.
• Collaboration Between Industry and Academia: Major speech research
groups in industry collaborated with academic researchers, resulting in
breakthroughs in deep learning for speech recognition. These
breakthroughs are now integrated into products like mobile phones.
• As datasets grew and deep net methods matured, it became clear that
the unsupervised pretraining phase was either unnecessary or did not
significantly improve performance.
Unprecedented Improvements:
• The introduction of deep learning in speech recognition led to
unprecedented improvements in word error rates (around
30%).
• This shift came after a decade during which traditional GMM-
HMM technology showed limited improvement despite the
growth in training data.
• Rapid adoption of deep neural networks in industrial products
• Ongoing research- The success of deep learning in speech
recognition spurred ongoing research into deep learning
algorithms and architectures for Automatic Speech
Recognition (ASR).
Deep Learning in ASR
• Innovations in Convolutional Networks
– Use of convolutional networks to replicate weights across time
and frequency.
– Treating the input spectrogram as a two-dimensional image;
with one axis representing time and the other representing the
frequency of spectral components.
• Transition to End-to-End Deep Learning
– Elimination of Hidden Markov Models (HMMs).
– Breakthrough by Graves et al. (2013) with deep LSTM RNN.
– Deep RNNs introducing depth due to layer stacking and time
unfolding.
– This work achieved a remarkable phoneme error rate of 17.7%
on the TIMIT dataset.
DL Applications: NLP
GAN

Generative Adversarial Networks


GANs
• Generative Adversarial Networks
• Generative Models
– We try to learn the underlying the distribution from which
our dataset comes from.
– Eg: Variational AutoEncoders (VAE)
• Adversarial Training
– GANS are made up of two competing networks
(adversaries) that are trying beat each other
• Networks
– Neural Networks
Introduction
• Generative Adversarial Networks (GANs) are a powerful class
of neural networks that are used for unsupervised learning.
• It was developed and introduced by Ian J. Goodfellow in 2014.
• GANs are basically made up of a system of two competing
neural network models which compete with each other and
are able to analyze, capture and copy the variations within a
dataset.
• GANs can create anything whatever we feed to them, as it
Learn- Generate-Improve
Introduction
• GANs are generative models that generates new samples based on
learning the regularities or patterns in input data.
– Note generative modeling is an unsupervised learning task in machine
learning
• GANs has a clever way of training a generative model by framing
the problem as a supervised learning problem with two sub-models
or neural networks :
– generator model – is trained to generate new samples
– discriminator model-tries to classify examples as either real (from the
domain) or fake (generated).
– These two networks compete against each other
• Application of GANs.
– Image Super-Resolution
– Creating Art.
– Image-to-Image Translation
– Data Augmentation
– Music and Voice Generation
– Text to Image generation
Working
• Generator:
– The generator takes random noise as input and generates
data samples.
– These generated samples start off as random noise but
gradually become more like the real data from the training
set as the GAN is trained.
– It learns to map the random noise to data samples in a way
that, ideally, it becomes indistinguishable from real data.
• Discriminator:
– The discriminator acts as a classifier.
– Its purpose is to distinguish between real data samples from
the training set and the fake data generated by the
generator.
– The discriminator is trained on real data and the generated
data and learns to assign high probabilities to real data and
low probabilities to generated data.
Working
• The training process of GANs can be described as a two-player
minimax game
• The generator's objective is to generate data that is
convincing enough to fool the discriminator.
– Its loss function is minimized when the discriminator classifies the
generated data as real.
• The discriminator's objective is to become better at
distinguishing real data from fake data.
– Its loss function is minimized when it correctly classifies real data as
real and generated data as fake.
• During training, the generator and discriminator play this
game in a competitive manner.
• The generator tries to improve its ability to generate realistic
data, while the discriminator aims to improve its ability to
differentiate between real and fake data.
Steps in training
Steps
• Define GAN architecture based on application
• Train Discriminator to distinguish real or fake using the current
ability of the Generator.
• Train the generator to fake data that can fool the
discriminator
• Continue discriminator and generator training for multiple
epochs such that generated images are classified incorrectly
by the Discriminator!
• Save generator model to create new , realistic fake data
Why we need GAN
• Most of the mainstream neural nets can be easily fooled into
misclassifying things by adding only a small amount of noise
into the original data.
• Sometimes the model after adding noise has higher
confidence in the wrong prediction than when it predicted
correctly.
• The reason for such adversary is that most machine learning
models learn from a limited amount of data, which is a huge
drawback, as it is prone to overfitting.
• Also, the mapping between the input and the output is almost
linear and even a small change in a point in the feature space
might lead to misclassification of data.

You might also like