0% found this document useful (0 votes)
13 views22 pages

Unit-2 L2

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views22 pages

Unit-2 L2

Uploaded by

pari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

REGULARIZATION

DATA SET AUGMENTATION


• Regularization: Regularization: Overview, Parameter Penalties, Norm Penalties as Constrained
Optimization, Regularization and Underconstrained Problems, Data Augmentation, Noise
Robustness, Batch Normalization, Semi-Supervised Learning, Multi-Task Learning, Early
Stopping, Parameter Tying and Parameter Sharing, SparseRepresentations, Bagging, Dropout.
Tuning Neural Networks, Hyperparameters
DATA AUGMENTATION

• More data is better:


• Best way to make an ML model generalize better is to
train it on more data.
• In practice, amount of data is limited
• Get around the problem by creating synthesized data
• For some ML tasks, it is straightforward to synthesize
data
DATA AUGMENTATION
• Augmentation for classification:
• Data augmentation is easiest for classification
• Classifier takes high-dimensional input x and
summarizes it with a single category identity y
• Main task of classifier is to be invariant to a wide
variety of transformations
• Generate new samples (x,y) just by transforming
inputs
• Approach not easily generalized to other problems
• For density estimation problem
• it is not possible generate new data without
solving density estimation
DATA AUGMENTATION
• Effective for Object Recognition:
• Data set augmentation is very effective for the
classification problem of object recognition
• Images are high-dimensional and include a variety
of variations, may easily simulated
• Translating the images a few pixels can greatly
improve performance
• Even when designed to be invariant using convolution
and pooling
• Rotating and scaling are also effective
DATA AUGMENTATION
• Main data augmentation methods:
DATA AUGMENTATION
• Caution in Data Augmentation:
• Not apply transformation that would change the class
• OCR example: ‘b’ vs ‘d’ and ‘6’ vs ‘9’
• Horizontal flips and 180-degree rotations are not appropriate
ways.

• Some transformations are not easy to perform


• Out-of-plane rotation cannot be implemented as a simple
geometric operation on pixels
REGULARIZATION
NOISE ROBUSTNESS
NOISE ROBUSTNESS
• Noise injection
• Noise applied to inputs is a data augmentation
• For some models, the addition of noise with infinitesimal
variance at the input is equivalent to imposing a penalty on
the norm of the weights, e.g., λwTw
• Noise applied to hidden units
• Noise injection can be much more powerful than simply
shrinking the parameters
• Noise applied to hidden units is so important that it merits its
own separate discussion
• Dropout is the main development of this approach
NOISE ROBUSTNESS
• Adding Noise to Weights
• This technique is primarily used with RNNs
• This can be interpreted as a stochastic implementation
of Bayesian inference over the weights
• Bayesian learning considers model weights to be uncertain
and representable via a probability distribution p(w) that
reflects that uncertainty,
• Adding noise to weights is a practical, stochastic way to
reflect this uncertainty
NOISE ROBUSTNESS
• Adding Noise to Weights
• Noise applied to weights is equivalent to
traditional regularization, encouraging stability
• This can be seen in a regression setting
• Train to map to a scalar using least squares between model
prediction and true values y:

• We perturb weights with


• For small η, this is equivalent to a regularization term
• It encourages parameters to regions where small
perturbations of weights have small influence on
output
NOISE ROBUSTNESS
• Injecting Noise at Output Targets:
• Most datasets have some mistakes in y labels
• Harmful to maximize log p(y|x) when y is a mistake
• To prevent it we explicitly model noise on labels
• Ex: we assume training set label y is correct with
probability 1-ε, and otherwise any of the other labels
may be correct
• This can be incorporated into the cost function
• Ex: Local Smoothing regularizes a model based on a
softmax with k output values by replacing the hard 0
and 1 classification targets with targets of ε/(k-1) and 1-
ε respectively
REGULARIZATION
SEMI-SUPERVISED LEARNING
SEMI-SUPERVISED LEARNING
• Both unlabeled examples from P(x) and labeled
examples from P(x,y) are used to estimate P(y|x) or
predict y from x.
SEMI-SUPERVISED LEARNING
• Both unlabeled examples from P(x) and labeled
examples from P(x,y) are used to estimate P(y|x) or
predict y from x.
• In the context of deep learning it refers to learning a
representation h = f(x).
• The goal is to learn a representation so that examples
from the same class have similar representations.
• Unsupervised learning can provide useful clues for
how to group examples in representational space.
• Examples that cluster tightly in the input space should
be mapped to similar representations
SEMI-SUPERVISED LEARNING
• A linear classifier in the new space may achieve better
generalization.
• A variant is the application of PCA as a preprocessing step
before applying a classifier to the projected data.
• Instead of separate unsupervised and supervised components
in the model, construct models in which generative models of
either P(x) or P(x,y) shares parameters with a discriminative
model of P(y|x).
• One can then trade-off the supervised criterion –log P(y|x)
with the unsupervised or generative one (such as –log P(x) or
–log P(x,y)).
• The generative criterion then expresses a prior belief about
the solution to the supervised problem
REGULARIZATION
MULTI-TASK LEARNING
MULTI-TASK LEARNING
• Sharing parameters over tasks:
• Multi-task learning is a way to improve
generalization by pooling the examples out of
several tasks
• Examples can be seen as providing soft constraints on
the parameters
• In the same way that additional training examples
put more pressure on the parameters of the model
towards values that generalize well
• Different supervised tasks, predicting y(i) given x
• Share the same input x, as well as some
intermediate representation h(shared) capturing a
MULTI-TASK LEARNING
MULTI-TASK LEARNING
• Common multi-task situation:
• Common input but different target
random variables
• Lower layers (whether feedforward or
includes a generative component with
downward arrows) can be shared across
such tasks.
• Task-specific parameters h(1), h(2) can be
learned on top of those yielding a shared
representation h(shared)
• Common pool of factors explain
variations of Input x while each task is
associated with a Subset of these factors
MULTI-TASK LEARNING
• Model can be divided into two parts
1. Task specific parameters
• Which only benefit from the examples
of their task to achieve good
generalization.
• These are the upper layers of the neural
network
2. Generic parameters
• Shared across all tasks
• Which benefit from the pooled data of all
tasks
• These are the lower levels of the neural

You might also like