We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22
REGULARIZATION
DATA SET AUGMENTATION
• Regularization: Regularization: Overview, Parameter Penalties, Norm Penalties as Constrained Optimization, Regularization and Underconstrained Problems, Data Augmentation, Noise Robustness, Batch Normalization, Semi-Supervised Learning, Multi-Task Learning, Early Stopping, Parameter Tying and Parameter Sharing, SparseRepresentations, Bagging, Dropout. Tuning Neural Networks, Hyperparameters DATA AUGMENTATION
• More data is better:
• Best way to make an ML model generalize better is to train it on more data. • In practice, amount of data is limited • Get around the problem by creating synthesized data • For some ML tasks, it is straightforward to synthesize data DATA AUGMENTATION • Augmentation for classification: • Data augmentation is easiest for classification • Classifier takes high-dimensional input x and summarizes it with a single category identity y • Main task of classifier is to be invariant to a wide variety of transformations • Generate new samples (x,y) just by transforming inputs • Approach not easily generalized to other problems • For density estimation problem • it is not possible generate new data without solving density estimation DATA AUGMENTATION • Effective for Object Recognition: • Data set augmentation is very effective for the classification problem of object recognition • Images are high-dimensional and include a variety of variations, may easily simulated • Translating the images a few pixels can greatly improve performance • Even when designed to be invariant using convolution and pooling • Rotating and scaling are also effective DATA AUGMENTATION • Main data augmentation methods: DATA AUGMENTATION • Caution in Data Augmentation: • Not apply transformation that would change the class • OCR example: ‘b’ vs ‘d’ and ‘6’ vs ‘9’ • Horizontal flips and 180-degree rotations are not appropriate ways.
• Some transformations are not easy to perform
• Out-of-plane rotation cannot be implemented as a simple geometric operation on pixels REGULARIZATION NOISE ROBUSTNESS NOISE ROBUSTNESS • Noise injection • Noise applied to inputs is a data augmentation • For some models, the addition of noise with infinitesimal variance at the input is equivalent to imposing a penalty on the norm of the weights, e.g., λwTw • Noise applied to hidden units • Noise injection can be much more powerful than simply shrinking the parameters • Noise applied to hidden units is so important that it merits its own separate discussion • Dropout is the main development of this approach NOISE ROBUSTNESS • Adding Noise to Weights • This technique is primarily used with RNNs • This can be interpreted as a stochastic implementation of Bayesian inference over the weights • Bayesian learning considers model weights to be uncertain and representable via a probability distribution p(w) that reflects that uncertainty, • Adding noise to weights is a practical, stochastic way to reflect this uncertainty NOISE ROBUSTNESS • Adding Noise to Weights • Noise applied to weights is equivalent to traditional regularization, encouraging stability • This can be seen in a regression setting • Train to map to a scalar using least squares between model prediction and true values y:
• We perturb weights with
• For small η, this is equivalent to a regularization term • It encourages parameters to regions where small perturbations of weights have small influence on output NOISE ROBUSTNESS • Injecting Noise at Output Targets: • Most datasets have some mistakes in y labels • Harmful to maximize log p(y|x) when y is a mistake • To prevent it we explicitly model noise on labels • Ex: we assume training set label y is correct with probability 1-ε, and otherwise any of the other labels may be correct • This can be incorporated into the cost function • Ex: Local Smoothing regularizes a model based on a softmax with k output values by replacing the hard 0 and 1 classification targets with targets of ε/(k-1) and 1- ε respectively REGULARIZATION SEMI-SUPERVISED LEARNING SEMI-SUPERVISED LEARNING • Both unlabeled examples from P(x) and labeled examples from P(x,y) are used to estimate P(y|x) or predict y from x. SEMI-SUPERVISED LEARNING • Both unlabeled examples from P(x) and labeled examples from P(x,y) are used to estimate P(y|x) or predict y from x. • In the context of deep learning it refers to learning a representation h = f(x). • The goal is to learn a representation so that examples from the same class have similar representations. • Unsupervised learning can provide useful clues for how to group examples in representational space. • Examples that cluster tightly in the input space should be mapped to similar representations SEMI-SUPERVISED LEARNING • A linear classifier in the new space may achieve better generalization. • A variant is the application of PCA as a preprocessing step before applying a classifier to the projected data. • Instead of separate unsupervised and supervised components in the model, construct models in which generative models of either P(x) or P(x,y) shares parameters with a discriminative model of P(y|x). • One can then trade-off the supervised criterion –log P(y|x) with the unsupervised or generative one (such as –log P(x) or –log P(x,y)). • The generative criterion then expresses a prior belief about the solution to the supervised problem REGULARIZATION MULTI-TASK LEARNING MULTI-TASK LEARNING • Sharing parameters over tasks: • Multi-task learning is a way to improve generalization by pooling the examples out of several tasks • Examples can be seen as providing soft constraints on the parameters • In the same way that additional training examples put more pressure on the parameters of the model towards values that generalize well • Different supervised tasks, predicting y(i) given x • Share the same input x, as well as some intermediate representation h(shared) capturing a MULTI-TASK LEARNING MULTI-TASK LEARNING • Common multi-task situation: • Common input but different target random variables • Lower layers (whether feedforward or includes a generative component with downward arrows) can be shared across such tasks. • Task-specific parameters h(1), h(2) can be learned on top of those yielding a shared representation h(shared) • Common pool of factors explain variations of Input x while each task is associated with a Subset of these factors MULTI-TASK LEARNING • Model can be divided into two parts 1. Task specific parameters • Which only benefit from the examples of their task to achieve good generalization. • These are the upper layers of the neural network 2. Generic parameters • Shared across all tasks • Which benefit from the pooled data of all tasks • These are the lower levels of the neural