The document discusses applications of deep learning for computer vision tasks. It covers preprocessing techniques like standardizing pixel values and image formatting. Data augmentation and test-time variations are used to reduce generalization error. Global contrast normalization aims to make image contrast consistent across datasets.
The document discusses applications of deep learning for computer vision tasks. It covers preprocessing techniques like standardizing pixel values and image formatting. Data augmentation and test-time variations are used to reduce generalization error. Global contrast normalization aims to make image contrast consistent across datasets.
The document discusses applications of deep learning for computer vision tasks. It covers preprocessing techniques like standardizing pixel values and image formatting. Data augmentation and test-time variations are used to reduce generalization error. Global contrast normalization aims to make image contrast consistent across datasets.
The document discusses applications of deep learning for computer vision tasks. It covers preprocessing techniques like standardizing pixel values and image formatting. Data augmentation and test-time variations are used to reduce generalization error. Global contrast normalization aims to make image contrast consistent across datasets.
Deep learning, Computer vision, speech recognition, NLP, other applications. Introduction to Generative Adversarial Networks(GANs) and its applications Large Scale Deep Learning Introduction • Network sizes have grown exponentially for the past three decades. • Because the size of neural networks is of paramount importance – deep learning requires high performance hardware and software infrastructure Fast Implementations • CPU – Exploit fixed point arithmetic in CPU families where this offers a speedup – Cache-friendly implementations • GPU – High memory bandwidth – No cache – Warps must be synchronized • TPU – Similar to GPU in many respects but faster – Often requires larger batch size – Sometimes requires reduced precision Distributed Implementations • Distributed – Multi-GPU – Multi-machine • Model parallelism • Data parallelism – Trivial at test time – Synchronous or asynchronous SGD at train time Model Compression • Large models often have lower test error – Very large model trained with dropout – Ensemble of many models • Want small model for low resource use at test time • Train a small model to mimic the large one – Obtains better test error than directly training a small model Dynamic Structure • Accelerating Data Processing Systems-one essential strategy for improving their efficiency is the incorporation of dynamic structure into the computation graph that outlines the necessary operations for processing input data. • By introducing dynamic structure, data-processing systems can dynamically determine which subset of neural networks or other machine learning models should be executed for a given input. • The concept of dynamic structure within neural networks is often termed "conditional computation" and holds significance in optimizing the overall computational process. • The underlying idea is to compute features only when they are needed, potentially leading to significant speed enhancements in data processing systems. Cascade of Classifiers • Cascade of Classifiers – In scenarios where the goal is to detect rare objects or events, the cascade strategy offers an effective approach to accelerate inference. This approach involves a sequence of classifiers, each with a specific role. • Efficient Resource Allocation – The initial classifiers in the sequence have low capacity but are trained for high recall, ensuring that rare objects are not falsely rejected. The final classifier, which has high precision, confirms the presence of the object. • Reduced Computation – By using a cascade of classifiers, we efficiently allocate computation resources and reject inputs as soon as any one classifier in the sequence rejects them, avoiding the need for full inference for every example. Cascade Strategies • Two approaches can be taken to achieve high capacity in a cascade. • In one approach, each member of the cascade has high individual capacity, ensuring the system as a whole has high capacity. • Alternatively, the cascade can be composed of members with low capacity, and the overall high capacity results from the combination of many smaller models. • Decision Trees as Dynamic Structure – Decision trees themselves represent dynamic structure because each node in the tree determines which of its subtrees should be evaluated for each input. – To combine deep learning with dynamic structure, one approach is to train decision trees where each node uses a neural network to make splitting decisions. Mixture of Experts • Mixture of Experts – A neural network known as the "gater" is employed to select which of several "expert networks" will compute the output based on the current input. This concept is known as the "mixture of experts." – It can be implemented as a "soft mixture of experts" or a "hard mixture of experts." • Accelerating Training and Inference – The "hard mixture of experts" approach, where a single expert is chosen for each example, significantly accelerates training and inference without sacrificing the quality of the approximation. • Obstacles in Dynamically Structured Systems – One significant challenge in dynamically structured systems is the reduced degree of parallelism resulting from different code branches being followed for various inputs. – This limitation hinders operations that can be described as matrix multiplication or batch convolution on a minibatch of examples. Applications of Deep Learning • Deep learning is used to solve applications in – computer vision, – speech recognition, – natural language processing Computer Vision Introduction • Computer Vision is a field of artificial intelligence where machines learn to see and understand the visual world.
• Computer vision been one of the most active research areas
for deep learning applications. • Most deep learning for computer vision is used for object recognition or image classification or Optical Character Recognition
• Many application areas require sophisticated preprocessing
• Because the original input comes in a form that is difficult for many deep learning architectures to represent. Preprocessing 1. Standardization of Pixel Values: Images should have pixel values standardized to a consistent range, such as [0,1] or [-1, 1]. Mixing images with different pixel value ranges (e.g., [0,1] and [0,255]) can lead to problems. 2. Formatting Images to Have the Same Scale: It's essential to ensure that images have the same scale. This is often required for many computer vision architectures. Images may need to be cropped or scaled to fit a standard size. 3. Variable-Sized Inputs: Some convolutional models accept variably- sized inputs and dynamically adjust the size of their pooling regions to keep the output size constant. 4. Variable-Sized Output: Some convolutional models have variable- sized output that automatically scales in size with the input. For example, models that denoise or label each pixel in an image may adjust their output size based on the input. Preprocessing • Data Augmentation: Dataset augmentation involves creating variations of the training data, like rotating, flipping, or changing the brightness of images • Test-Time Data Variation: – At test time, a related concept is to show the model different versions of the same input, such as cropping an image at slightly different positions. – The model considers these variations and combines their predictions to improve accuracy, akin to an ensemble approach. • Reducing Generalization Error: – Both dataset augmentation during training and test-time data variation help reduce the generalization error of computer vision models. These techniques enhance the model's ability to perform well on diverse, real-world data by exposing it to various perspectives and situations. Generalization Error • It refers to the difference in performance between a model on the training data (the data it was trained on) and its performance on new, unseen data (the data it was not trained on). • The goal in machine learning is to create models that not only perform well on the data they were trained on but also generalize effectively to new, real-world data. • If a model has a low generalization error, it means it can make reliable predictions on unseen data. • However, if the generalization error is too low, the model may be overfitting the training data, meaning it’s too focused on the training data and does not perform well on training data. • Balancing model complexity, training data size, and generalization error is a fundamental challenge in machine learning Contrast Normalization • Contrast normalization, also known as contrast enhancement or contrast stretching, is a fundamental image processing technique in computer vision and digital image processing. • Contrast refers to the magnitude of the difference between the bright and the dark pixels in an image. • In terms of Deep Learning In deep learning, contrast usually refers to the standard deviation of the pixels in an image or region of an image. Global contrast normalization (GCN) • Global Contrast Normalization (GCN) is a technique used to make images in a dataset have consistent levels of contrast • Global contrast normalization (GCN) aims to prevent images from having varying amounts of contrast by subtracting the mean from each image, then rescaling it so that the standard deviation across its pixels is equal to some constant s. • However, if an image has very little contrast or if all its pixels have the same brightness, GCN can make it worse by adding noise. GCN • Large images are cropped to interesting objects by setting λ = 0 and avoid division by 0 in extremely rare cases by setting epsilon to an extremely low value like 10−8. • Small images cropped randomly are more likely to have nearly constant intensity, making aggressive regularization more useful. • The scale parameter s can usually be set to 1 or chosen to make each individual pixel have standard deviation across examples close to 1. Example • Imagine you have a collection of photos of various outdoor scenes. Some photos were taken on a bright sunny day with vivid colors, while others were taken on a cloudy day with dull colors. • 1. Subtracting the Average Brightness: – You find the average brightness of all the photos, and it's like having a "neutral" brightness level of 50 (on a scale of 0 to 100). • 2. Adjusting for Consistent Color Intensity: – To ensure that all. photos have a consistent color intensity, you subtract 50 from the brightness of each pixel in each photo to make them centered around this "neutral" level. – This helps in removing any overall brightness bias. GCN and LCN • The standard deviation in Eq. defines GCN in terms of standard deviation rather than L2 norm. The standard deviation includes division by the number of pixels, so GCN based on standard deviation allows the same s to be used regardless of image size. • Counterintuitively, there is a preprocessing operation known as Sphering. • It is not the same operation as GCN on a spherical shell, but rather rescaling the principal components to have equal variance so that the multivariate normal distribution used by PCA has spherical contours. • Sphering is more commonly known as Whitening. • GCN fail to highlight image features such as edges and corners when the scene with a large dark area and a large bright area (such as a city square with half the image in the shadow of a building) – GCN will ensure there is a large difference between the brightness of the dark area and the brightness of the light area but fails to ensure that edges within the dark region stand out. • This motivates local contrast normalization(LCN) – ensures that the contrast is normalized across each small window, rather than over the image as a whole. GCN and LCN Speech Recognition • Speech recognition is a technology and a ability of a machine or program to identify and understand human speech. • The task of speech recognition is to map an acoustic signal containing a spoken natural language utterance into the corresponding sequence of words intended by the speaker. • Most speech recognition systems preprocess the input using specialized hand-designed features, but some deep learning systems learn features from raw input. Speech Recognition: Introduction • 2009–2012: state-of-the art speech recognition systems primarily combined hidden Markov models (HMMs) and Gaussian mixture models (GMMs). • GMMs modeled the association between acoustic features and phonemes. Whereas HMMs modeled the sequence of phonemes. • Hidden Markov Models (HMMs): HMMs have been a fundamental component of many traditional ASR systems. They are used to model the temporal sequence of phonemes or sub-word units in speech. Each phoneme is associated with an HMM, and the system selects the most likely sequence of HMMs to represent the spoken words. The GMM-HMM Model in ASR • The GMM-HMM model family generates acoustic waveforms in two steps. – Firstly, an HMM generates a sequence of phonemes and sub- phonemic states, including the beginning, middle, and end of each phoneme. – a GMM is employed to transform these discrete symbols into brief segments of audio waveform. • ASR was an early adopter of neural networks in the late 1980s and early 1990s. • Neural network-based ASR systems showed performance comparable to GMM-HMM systems. • Transition toward using neural networks for ASR occurred in the late 2000s. Neural Networks in speech recognition • Transition from GMMs to Neural Networks: With the advent of larger and deeper models and larger datasets, neural networks started replacing Gaussian Mixture Models (GMMs) in the association of acoustic features with phonemes or sub-phonemic states. • Unsupervised pretraining with Restricted Boltzmann Machines (RBMs): – Unsupervised pretraining was used to build deep feedforward networks. – Each layer of these networks was initialized by training an RBM. – These networks processed spectral acoustic representations and predicted the conditional probabilities of Hidden Markov Model (HMM) states for a central frame. – This approach significantly improved recognition rates, reducing the phoneme error rate from 26% to 20.7% on datasets like TIMIT. – speaker-adaptive features contributed to reducing error rates Neural Networks in speech recognition • Incorporation of Speaker-Adaptive Features: Further advancements included the addition of speaker-adaptive features, which contributed to reducing error rates. • Transition to Large-Vocabulary Speech Recognition: The architecture expanded from phoneme recognition to large-vocabulary speech recognition. This involved recognizing sequences of words from a large vocabulary. • Shift to Modern Techniques: Over time, deep networks for speech recognition evolved, moving away from pretraining and Boltzmann machines. Techniques such as rectified linear units and dropout were adopted. • Collaboration Between Industry and Academia: Major speech research groups in industry collaborated with academic researchers, resulting in breakthroughs in deep learning for speech recognition. These breakthroughs are now integrated into products like mobile phones. • As datasets grew and deep net methods matured, it became clear that the unsupervised pretraining phase was either unnecessary or did not significantly improve performance. Unprecedented Improvements: • The introduction of deep learning in speech recognition led to unprecedented improvements in word error rates (around 30%). • This shift came after a decade during which traditional GMM- HMM technology showed limited improvement despite the growth in training data. • Rapid adoption of deep neural networks in industrial products • Ongoing research- The success of deep learning in speech recognition spurred ongoing research into deep learning algorithms and architectures for Automatic Speech Recognition (ASR). Deep Learning in ASR • Innovations in Convolutional Networks – Use of convolutional networks to replicate weights across time and frequency. – Treating the input spectrogram as a two-dimensional image; with one axis representing time and the other representing the frequency of spectral components. • Transition to End-to-End Deep Learning – Elimination of Hidden Markov Models (HMMs). – Breakthrough by Graves et al. (2013) with deep LSTM RNN. – Deep RNNs introducing depth due to layer stacking and time unfolding. – This work achieved a remarkable phoneme error rate of 17.7% on the TIMIT dataset. DL Applications: NLP GAN
Generative Adversarial Networks
GANs • Generative Adversarial Networks • Generative Models – We try to learn the underlying the distribution from which our dataset comes from. – Eg: Variational AutoEncoders (VAE) • Adversarial Training – GANS are made up of two competing networks (adversaries) that are trying beat each other • Networks – Neural Networks Introduction • Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for unsupervised learning. • It was developed and introduced by Ian J. Goodfellow in 2014. • GANs are basically made up of a system of two competing neural network models which compete with each other and are able to analyze, capture and copy the variations within a dataset. • GANs can create anything whatever we feed to them, as it Learn- Generate-Improve Introduction • GANs are generative models that generates new samples based on learning the regularities or patterns in input data. – Note generative modeling is an unsupervised learning task in machine learning • GANs has a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models or neural networks : – generator model – is trained to generate new samples – discriminator model-tries to classify examples as either real (from the domain) or fake (generated). – These two networks compete against each other • Application of GANs. – Image Super-Resolution – Creating Art. – Image-to-Image Translation – Data Augmentation – Music and Voice Generation – Text to Image generation Working • Generator: – The generator takes random noise as input and generates data samples. – These generated samples start off as random noise but gradually become more like the real data from the training set as the GAN is trained. – It learns to map the random noise to data samples in a way that, ideally, it becomes indistinguishable from real data. • Discriminator: – The discriminator acts as a classifier. – Its purpose is to distinguish between real data samples from the training set and the fake data generated by the generator. – The discriminator is trained on real data and the generated data and learns to assign high probabilities to real data and low probabilities to generated data. Working • The training process of GANs can be described as a two-player minimax game • The generator's objective is to generate data that is convincing enough to fool the discriminator. – Its loss function is minimized when the discriminator classifies the generated data as real. • The discriminator's objective is to become better at distinguishing real data from fake data. – Its loss function is minimized when it correctly classifies real data as real and generated data as fake. • During training, the generator and discriminator play this game in a competitive manner. • The generator tries to improve its ability to generate realistic data, while the discriminator aims to improve its ability to differentiate between real and fake data. Steps in training Steps • Define GAN architecture based on application • Train Discriminator to distinguish real or fake using the current ability of the Generator. • Train the generator to fake data that can fool the discriminator • Continue discriminator and generator training for multiple epochs such that generated images are classified incorrectly by the Discriminator! • Save generator model to create new , realistic fake data Why we need GAN • Most of the mainstream neural nets can be easily fooled into misclassifying things by adding only a small amount of noise into the original data. • Sometimes the model after adding noise has higher confidence in the wrong prediction than when it predicted correctly. • The reason for such adversary is that most machine learning models learn from a limited amount of data, which is a huge drawback, as it is prone to overfitting. • Also, the mapping between the input and the output is almost linear and even a small change in a point in the feature space might lead to misclassification of data.