0% found this document useful (0 votes)
2 views

Regularization

The document discusses regularization techniques in machine learning, focusing on bias versus variance trade-offs, early stopping, and L1/L2 regularization methods to prevent overfitting. It highlights symptoms of high variance and high bias, and outlines strategies to address these issues, including adding training data and adjusting model complexity. Additionally, it provides practical examples of implementing these techniques in PyTorch.

Uploaded by

ayaazouz1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Regularization

The document discusses regularization techniques in machine learning, focusing on bias versus variance trade-offs, early stopping, and L1/L2 regularization methods to prevent overfitting. It highlights symptoms of high variance and high bias, and outlines strategies to address these issues, including adding training data and adjusting model complexity. Additionally, it provides practical examples of implementing these techniques in PyTorch.

Uploaded by

ayaazouz1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Regularization

Bias versus variance


A central problem in machine learning is how to make an algorithm that will perform well
not just on the training data, but also on new inputs. Many strategies used in machine
learning are explicitly designed to reduce the test error, possibly at the expense of increased
training error. These strategies are known collectively as regularization. A great many forms
of regularization are available to the deeplearning practitioner. In fact, developing more
effective regularization strategies has been one of the major research efforts in the field
High Variance - the cause of the poor performance is high variance.

Symptoms:
 Training error is much lower than test error
 Training error is lower than ϵ
 Test error is above ϵ
Actions:
 Add more training data
 Reduce model complexity - complex models are prone to high variance

High Bias - the model being used is not robust enough to produce an accurate prediction.
Symptoms:
 Training error is higher than ϵ
Actions:
 Use more complex model
 Add features
 Use more training data
Early Stopping
When training large models with sufficient representational capacity to overfit the task, we
often observe that training error decreases steadily over time, but validation set error begins
to rise again. See figure below for an example of this behavior, which occurs reliably. This
means we can obtain a model with better validation set error (and thus, hopefully better test
set error) by returning to the parameter setting at the point int ime with the lowest validation
set error. Every time the error on the validation set improves, we store a copy of the model
parameters. When the training algorithm terminates, we return these parameters, rather than
the latest parameters. The algorithm terminates when no parameters have improved over the
best recorded validation error for some pre-specified number of iterations. This strategy is
known as early stopping.
It is probably the most commonly used form of regularization in deep learning. Its popularity
is due to both its effectiveness and its simplicity. One way to think of early stopping is as a
very efficient hyperparameter selection algorithm. In this view, the number of training steps
is just another hyperparameter. We can see in figure that this hyperparameter has a U-
shaped validation set performance curve. Most hyperparameters that control model capacity
have such a U-shaped validation set performance curve.

In the case of early stopping, we are controlling the effective capacity of the model by
determining how many steps it can take to fit the training set. Most hyperparameters must
be chosen using an expensive guess and check process, where we set a hyperparameterat
the start of training, then run training for several steps to see its effect. The“training time”
hyperparameter is unique in that by definition, a single run oft raining tries out many values
of the hyperparameter. The only significant cost to choosing this hyperparameter
automatically via early stopping is running the validation set evaluation periodically during
training.
Ideally, this is done in parallel to the training process on a separate machine, separate CPU,
or separate GPU from the main training process. If such resources are not available, then
the cost of these periodic evaluations may be reduced by using a validation set that is small
compared to the training set or by evaluating the validation set error less frequently and
obtaining a lower-resolution estimate of the optimal training time.
Batch normalization

Training
Batch normalization: Test
Comparison of Normalization Layers

torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True,


track_running_stats=True, device=None, dtype=None)

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True,


track_running_stats=True, device=None, dtype=None)
Decorrelated Batch Normalization
Dropout
torch.nn.Dropout(p=0.5, inplace=False)
During training, randomly zeroes some of the elements of the input tensor with probability
p using samples from a Bernoulli distribution. Each channel will be zeroed out
independently on every forward call.

Dataset Augmentation
L2 regularization
L2 regularization, also known as weight decay, is an important technique used in deep
learning to prevent overfitting, which occurs when a model becomes too complex and fits
too closely to the training data, leading to poor generalization performance on new, unseen
data.

In L2 regularization, a penalty term is added to the loss function of the model, which
encourages the model to have smaller weights. This penalty term is proportional to the
square of the magnitude of the weights, which means that larger weights are penalized
more heavily than smaller weights. By penalizing larger weights, the model is encouraged
to choose simpler solutions, which can lead to better generalization performance.

L2 regularization can also be seen as a way of implementing Occam's razor, which states
that among competing hypotheses, the one that makes the fewest assumptions should be
selected. In the context of deep learning, this means that simpler models that generalize
well to new data are preferred over more complex models that fit the training data
perfectly but may not generalize well.

Overall, L2 regularization is an important tool in the deep learning toolbox that can help
improve model generalization performance and prevent overfitting.
In Pytorch:
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=0.01)
or
optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=0.01)
L1 regularization
L1 and L2 regularization are two common techniques used in machine learning to prevent
overfitting by adding a penalty term to the loss function of a model. While both techniques
aim to reduce the complexity of the model and improve generalization performance, they
differ in how they achieve this goal.

The main difference between L1 and L2 regularization is in the penalty term that is added
to the loss function. In L1 regularization, the penalty term is proportional to the absolute
value of the weights, while in L2 regularization, it is proportional to the square of the
magnitude of the weights.

This difference in penalty term leads to some key differences in the effect of L1 and L2
regularization on the model. One of the main differences is that L1 regularization tends to
result in sparse weight vectors, where many of the weights are exactly zero. This is
because the penalty term in L1 regularization encourages some of the weights to be set to
zero, effectively removing some of the features from the model.

On the other hand, L2 regularization tends to distribute the weight values more evenly,
with no weight being exactly zero, and instead all weights are just small. This makes L2
regularization more suited to models where all features are potentially important, as it will
still consider them all.
In summary, L1 regularization tends to produce sparse models, while L2 regularization
produces models with small, non-zero weights. Choosing between the two regularization
techniques depends on the specific problem and the characteristics of the data and the
model.

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1) # Example linear model


optimizer = optim.SGD(model.parameters(), lr=0.01) # No weight decay for L1
regularization
criterion = nn.MSELoss() # Example loss function

l1_lambda = 0.01 # Strength of L1 regularization


l1_reg = nn.L1Loss() # L1 regularization module

# Training loop
for inputs, labels in dataloader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
l1_loss = 0
for param in model.parameters():
l1_loss += l1_lambda * l1_reg(param)
loss += l1_loss
loss.backward()
optimizer.step()

You might also like