Unit 4
Unit 4
This penalty discourages the model from becoming too complex or having large
parameter values, which helps in controlling the model’s ability to fit noise in the
training data.
Regularization methods include L1 and L2 regularization, dropout, early stopping,
and more. By applying regularization, models become more robust and better at
making accurate predictions on unseen data.
Techniques to Reduce
Underfitting 1. Increase
model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better
results.
Example An epoch is when all the training data is used at once and is defined as
the total number of iterations of all the training data in one cycle for training the
machine learning model. Another way to define an epoch is the number of passes
a training dataset takes
around an algorithm.
A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when
testing with test data results in High variance. Then the model does not categorize
the data correctly, because of too many details and noise. The causes of overfitting
are the non-parametric and non-linear methods because these types of machine
learning algorithms have more freedom in building the model based on the dataset
and therefore they can really build unrealistic models. A solution to avoid
overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.
Parameter Norm Penalties are regularization methods that apply a penalty to the
norm of parameters in the objective function of a neural network.
Lasso Regression
where,
• m – Number of Features
• n – Number of Examples
• y_i – Actual Target Value
• y_i(hat) – Predicted Target Value
Ridge Regression
A regression model that uses the L2 regularization technique is
called Ridge regression. Ridge regression adds the “squared magnitude” of the
coefficient as a penalty term to the loss function(L).
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general
cost function by adding another term known as the regularization term.
In L2, we have:
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may
be reduced to zero here. Hence, it is very useful when we are trying to
compress our model. Otherwise, we usually prefer L2 over it.
In keras, we can directly apply regularization to any layer using the regularizers.
Below I have applied regularizer on dense layer having 500 neurons and relu
activation function.
In [11]:
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",i
nput_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Flatten()) #l2 regularizer
model.add(Dense(500,kernel_regularizer=regularizers.l2(0.01),activati
on="relu")) model.add(Dense(2,activation="softmax"))#2 represent
output layer neurons
Note: Here the value 0.01 is the value of regularization parameter, i.e., lambda,
which we need to optimize further
Likewise, if Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α
should be small. This is now similar to the parameter norm penalty regularized
objective function as both of them encourage lower values of the norm. Thus,
parameter norm penalties naturally impose a constraint, like the L²-regularization,
defining a constrained L²-ball.
Larger α implies a smaller constrained region as it pushes the values really low,
hence, allowing a small radius and vice versa. The idea of constraints over
penalties is important for several reasons. Large penalties might cause non-convex
optimization algorithms to get stuck in local minima due to small values of θ,
leading to the formation of so-called dead cells, as the weights entering and
leaving them are too small to have an impact.
Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.
whenever the data generating distribution truly has no variance in some direction,
or when no variance in observed in some direction because there are fewer
examples (rows of X) than input features (columns of X). In this case, many
forms of regularization correspond to
inverti
Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data.
In machine learning, we were not able to increase the size of training data as the
labeled data was too costly.
But, now let’s consider we are dealing with images. In this case, there are a few
ways of increasing the size of the training data – rotating the image, flipping,
scaling, shifting, etc. In the below image, some transformation has been done on
the handwritten digits dataset.
This technique is known as data augmentation. This usually provides a big leap in
improving the accuracy of the model. It can be considered as a mandatory trick in
order to improve our predictions.
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image width_shift_range=0.1, #
randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total
height) horizontal_flip=False, # randomly flip images
vertical_flip=False) # randomly flip images
datagen.fit(x_train)
Dropout
This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used
regularization technique in the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one
shown
So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as
shown below.
So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine
learning.
Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network
model.
In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.5 as the probability of dropping in
my neural network architecture after last hidden layer having 64 kernels and after
first dense layer having 500 neurons.
exam
ple
linkc
ode
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",i
nput_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
# 1st dropout
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(500,activation="relu"))
# 2nd dropout
model.add(Dropout(0.2))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the
training set as the validation set. When we see that the performance on the
validation set is getting worse, we immediately stop the training on the model.
This is known as early stopping.
In the above image, we will stop training at the dotted line since after that our
model will start overfitting on the training data.
in keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop
immendiately if validation error will not decreased after 3 epochs.
In [14]:
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc',
patience=3) epochs = 20 # batch_size = 256
Here, monitor denotes the quantity that needs to be monitored and ‘val_err’
denotes the validation error.
Patience denotes the number of epochs with no further improvement after which
the training will be stopped. For better understanding, let’s take a look at the
above image again. After the dotted line, each epoch will result in a higher value
of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our
model will stop because no further improvement is seen.
Noise Robustness
Noise applied to inputs is a data augmentation, For some models addition of noise
with extremely small variance at the input is equivalent to imposing a penalty on
the norm of the weights.
Noise applied to hidden units, Noise injection can be much more powerful than
simply shrinking the parameters. Noise applied to hidden units is so important that
Dropout is the main development of this approach.
Training a neural network with a small dataset can cause the network to memorize
all training examples, in turn leading to overfitting and poor performance on a
holdout dataset. One approach to making the input space smoother and easier to
learn is to add noise to inputs during training.
• Small datasets can make learning challenging for neural nets and the examples
can be memorized.
• Adding noise during training can make the training process more robust and
reduce generalization error.
• Noise is traditionally added to the inputs, but can also be added to weights,
gradients, and even activation functions.
random noise can be added to other parts of the network during training. Some
examples include:
The addition of noise to weights allows the approach to be used throughout the
network in a consistent way instead of adding noise to inputs and layer
activations. This is particularly useful in recurrent neural networks.
The addition of noise to gradients focuses more on improving the robustness of
the optimization process itself rather than the structure of the input domain. The
amount of noise can start high at the beginning of training and decrease over time,
much like a decaying learning rate. This approach has proven to be an effective
method for very deep networks and for a variety of different network types
Adding noise to the activations, weights, or gradients all provide a more generic
approach to adding noise that is invariant to the types of input variables provided
to the model.
Semi-Supervised Learning
Semi-supervised learning is a type of machine learning that falls in between
supervised and unsupervised learning. It is a method that uses a small amount of
labeled data and a large amount of unlabeled data to train a model. The goal of
semi-supervised learning is to learn a function that can accurately predict the
output variable based on the input variables, similar to supervised learning.
However, unlike supervised learning, the algorithm is trained on a dataset that
contains both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of
unlabeled data available, but it’s too expensive or difficult to label all of it.
Multi-Task Learning
Hard Parameter Sharing – A common hidden layer is used for all tasks but
several task specific layers are kept intact towards the end of the model. This
technique is very useful as by learning a representation for various tasks by a
common hidden layer, we reduce the risk of overfitting.
Soft Parameter Sharing – Each model has their own sets of weights and
biases and the distance between these parameters in different models is
regularized so that the parameters become similar and can represent all the
tasks.
1. Task relatedness: MTL is most effective when the tasks are related or have
some commonalities, such as natural language processing, computer vision,
and healthcare.
2. Data limitation: MTL can be useful when the data is limited, as it allows the
model to leverage the information shared across tasks to improve the
generalization performance.
3. Shared feature extractor: A common approach in MTL is to use a shared
feature extractor, which is a part of the network that is shared across tasks and
is used to extract features from the input data.
4. Task-specific heads: Task-specific heads are used to make predictions for each
task and are typically connected to the shared feature extractor.
5. Shared decision-making layer: another approach is to use a shared decision-
making layer, where the decision-making layer is shared across tasks, and the
task-specific layers are connected to the shared decision-making layer.
Parameter Typing
Two models are doing the same classification task (with the same set of classes),
but their input distributions are somewhat different.
and
W(B)
are the two models that transfer the input to two different but related outputs.
Assume the tasks are comparable enough (possibly with similar input and output
distributions) that the model parameters should be near to each
We can take advantage of this data by regularising it. We can apply a parameter
norm penalty of the following form We utilised an L2 penalty here, but there are
other options.
Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm,
were regularised to be close to the parameters of another model, trained in an
unsupervised paradigm, using this method (to capture the distribution of the
observed input data).
Many of the parameters in the classifier model might be linked with similar
parameters in the unsupervised model thanks to the designs.
Sparse representation (SR) is used to represent data with as few atoms as possible
in a given overcomplete dictionary. By using the SR, we can concisely represent
the data and easily extract the valuable information from the data
the terms "sparse" and "dense" are commonly used to describe the
distribution of zero and non-zero array members in machine learning (e.g.
vector or matrix). Sparse matrices are those that primarily consist of zeros,
while dense matrices have a large number of nonzero entries.
Machine learning makes use of sparse and dense representations due to their
usefulness in efficient data representation. While dense representations are useful
for capturing intricate interactions between data points, sparse representations can
help minimize the amount of a dataset.
sparse Matrix Representations can be done in many ways following are two
common representations: 1. Array representation
2. Linked list representation
Example -
Let's understand the array representation of sparse matrix with the help of the
example given below -
In the above figure, we can observe a 5x4 sparse matrix containing 7 non-zero
elements and 13 zero elements. The above matrix occupies 5x4 = 20 memory
space. Increasing the size of matrix will increase the wastage space.
The size of the table depends upon the total number of non-zero elements in the
given sparse matrix. Above table occupies 8x3 = 24 memory space which is more
than the space occupied by the sparse matrix. So, what's the benefit of using the
sparse matrix? Consider the case if the matrix is 8*8 and there are only 8 non-zero
elements in the matrix, then the space occupied by the sparse matrix would be 8*8
= 64, whereas the space occupied by the table represented using triplets would be
8*3 = 24.
Example -
Let's understand the linked list representation of sparse matrix with the help of the
example given below -
In the above figure, the sparse matrix is represented in the linked list form. In the
node, the first field represents the index of the row, the second field represents the
index of the column, the third field represents the value, and the fourth field
contains the address of the next node.
In the above figure, the first field of the first node of the linked list contains 0,
which means 0th row, the second field contains 2, which means 2 nd column, and
the third field contains 1 that is the non-zero element. So, the first node represents
that element 1 is stored at the 0 th row-2nd column in the given sparse matrix. In a
similar manner, all of the nodes represent the non-zero elements of the sparse
matrix.
sparse code follows the more all-encompassing idea of neural code. Consider the
case when you have binary neurons. So, basically:
• The neural networks will get some inputs and deliver outputs
• Some neurons in the neural network will be frequently activated while
others won’t be activated at all to calculate the outputs
• The average activity ratio refers to the number of activations on some
data, whereas the neural code is the observation of those activations
for a specific input
• Neural coding is the process of instructing your neurons to produce a
reliable neural code
Now that we know what a neural code is, we can speculate on what it may be like.
Then, data will be encoded using a sparse code while taking into
consideration the following scenarios:
These are the methods which are being followed to represent image and its
classifications
• Improve Accuracy
Note: Random Forest Algorithm is one of the most common Bagging Algorithm.
• Assess the ensemble’s performance on test data and use the aggregated models
for predictions on new data.
• If needed, retrain the ensemble with new data or integrate new models into the
existing ensemble.
The main idea behind ensemble learning is the usage of multiple algorithms and
models that are used together for the same task. While single models use only one
algorithm to create prediction models, bagging and boosting methods aim to
combine several of those to achieve better prediction with higher consistency
compared to individual learnings.
Image classification
In the above example, it was observed that a specific record was predicted as a
dog by the logistic regression and decision tree models, while a support vector
machine identified it as a cat. As various models have their distinct advantages
and disadvantages for particular records, it is the key idea of ensemble learning to
combine all three models instead of selecting only one approach that showed the
highest accuracy.
The procedure is called aggregation or voting and combines the predictions of all
underlying models, to come up with one prediction that is assumed to be more precise
than any sub-
model that would stay alone.
Boosting is an ensemble learning method that involves training homogenous weak
learners sequentially such that a base model depends on the previously fitted base
models. All these base learners are then combined in a very adaptive way to
obtain an ensemble model.
In boosting, the ensemble model is the weighted sum of all constituent base
learners. There are two meta-algorithms in boosting that differentiate how the
base models are aggregated:
• Adaptive Boosting (AdaBoost)
• Gradient Boosting
• XGBoost
• Adaptive Learning
• Reduces Bias
• Flexibility
• Bagging is best for high variance and low bias models while boosting is
effective when the model must be adaptive to errors, suitable for bias and
variance errors.
• Generally, boosting techniques are not prone to overfitting. Still, it can be if the
number of models or iterations is high, whereas the Bagging technique is less
prone to overfitting.
• Boosting is suitable for bias and variance, while bagging is suitable for high-
variance and low-bias models.
tangent propagation does not require explicitly visiting a new input point. Instead,
it analytically regularizes the model to resist perturbation in the directions
corresponding to the specified transformation. While this analytical approach is
intellectually elegant,
it has two major drawbacks. First, it only regularizes the model to resist
infinitesimal perturbation. Explicit dataset augmentation confers resistance to
larger perturbations( means changes in datasets) Second, the infinitesimal
approach poses difficulties for models based on rectified linear units. These
models can only shrink their derivatives by turning units off or shrinking their
weights.
They are not able to shrink their derivatives by saturating at a high value with
large weights, as sigmoid or tanh units can. Dataset augmentation works well with
rectified linear units because different subsets of rectified units can activate for
different transformed versions of each original input. Tangent propagation is also
related to double backprop (Drucker and LeCun, 1992) and adversarial training
In the Figure one f(X) are the hypothesis and x1 , x2 ,x3 are the instances and
these instances fit to proper hypothesis shown in first figure and in second fig we
can see the instances classified and machine learns to fit to proper hypothesis by
doing necessary modification by using
Notice the first term in this definition of E is the original squared error of the
network versus training values, and the second term is the squared error in the
network versus training derivatives.
In the third figure we can see the instances are classified properly and maintaining
accuracy.
An Illustrative Example
Remarks To summarize, TANGENTPROP uses prior knowledge in the form of
desired derivatives of the target function with respect to transformations of its
inputs.