0% found this document useful (0 votes)
34 views35 pages

Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views35 pages

Unit 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Unit 4

Regularization for Deep Learning: Parameter norm Penalties, Norm


Penalties as Constrained Optimization, Regularization and Under-
Constrained Problems, Dataset Augmentation, Noise Robustness, Semi-
Supervised learning, Multi-task learning, Early Stopping, Parameter Typing
and Parameter Sharing, Sparse Representations, Bagging and other
Ensemble Methods, Dropout, Adversarial Training, Tangent Distance,
tangent Prop and Manifold, Tangent Classifier

Regularization is a set of techniques that can prevent overfitting in neural


networks and thus improve the accuracy of a Deep Learning model when facing
completely new data from the problem domain.

Regularization is a technique used in machine learning and deep learning to


prevent overfitting and improve the generalization performance of a model. It
involves adding a penalty term to the loss function during training.

This penalty discourages the model from becoming too complex or having large
parameter values, which helps in controlling the model’s ability to fit noise in the
training data.
Regularization methods include L1 and L2 regularization, dropout, early stopping,
and more. By applying regularization, models become more robust and better at
making accurate predictions on unseen data.

Underfitting in Machine Learning


A statistical model or a machine learning algorithm is said to have underfitting
when a model is too simple to capture data complexities. It represents the inability
of the model to learn the training data effectively result in poor performance both
on the training and testing data. In simple terms, an underfit model’s are
inaccurate, especially when applied to new, unseen examples. It mainly happens
when we uses very simple model with overly simplified assumptions. To address
underfitting problem of the model, we need to use more complex models, with
enhanced feature representation, and less regularization. Note: The underfitting
model has High bias and low variance.

Bias and Variance in Machine Learning


• Bias: Bias refers to the error due to overly simplistic assumptions in the
learning algorithm. These assumptions make the model easier to comprehend
and learn but might not capture the underlying complexities of the data. It is
the error due to the model’s inability to represent the true relationship between
input and output accurately. When a model has poor performance both on the
training and testing data means high bias because of the simple model,
indicating underfitting.
• Variance: Variance, on the other hand, is the error due to the model’s
sensitivity to fluctuations in the training data. It’s the variability of the model’s
predictions for different instances of training data. High variance occurs when
a model learns the training data’s noise and random fluctuations rather than
the underlying pattern. As a result, the model performs well on the training
data but poorly on the testing data, indicating overfitting.

Reasons for Underfitting


1. The model is too simple, So it may be not capable to represent the
complexities in the data.
2. The input features which is used to train the model is not the adequate
representations of underlying factors influencing the target variable.
3. The size of the training dataset used is not enough.
4. Excessive regularization are used to prevent the overfitting, which constraint
the model to capture the data well.

Techniques to Reduce
Underfitting 1. Increase
model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better
results.

Example An epoch is when all the training data is used at once and is defined as
the total number of iterations of all the training data in one cycle for training the
machine learning model. Another way to define an epoch is the number of passes
a training dataset takes
around an algorithm.

Overfitting in Machine Learning

A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when
testing with test data results in High variance. Then the model does not categorize
the data correctly, because of too many details and noise. The causes of overfitting
are the non-parametric and non-linear methods because these types of machine
learning algorithms have more freedom in building the model based on the dataset
and therefore they can really build unrealistic models. A solution to avoid
overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.

Reasons for Overfitting:


1. High variance and low bias.
2. The model is too complex.
3. The size of the training data. Techniques to Reduce Overfitting 1. Increase
training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the
training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization.
5. Use dropout for neural networks to tackle overfitting.
Good Fit in a Statistical Model
Ideally, the case when the model makes the predictions with 0 error, is said to
have a good fit on the data. This situation is achievable at a spot between
overfitting and underfitting. In order to understand it, we will have to look at the
performance of our model with the passage of time, while it is learning from the
training dataset.
With the passage of time, our model will keep on learning, and thus the error for
the model on the training and testing data will keep on decreasing. If it will learn
for too long, the model will become more prone to overfitting due to the presence
of noise and less useful details. Hence the performance of our model will
decrease. In order to get a good fit, we will stop at a point just before where the
error starts increasing. At this point, the model is said to have good skills in
training datasets as well as our unseen testing dataset.

Parameter norm Penalties

Parameter Norm Penalties are regularization methods that apply a penalty to the
norm of parameters in the objective function of a neural network.

Different Regularization Techniques in Deep Learning


Now that we have an understanding of how regularization helps in reducing
overfitting, we’ll learn a few different techniques in order to apply regularization
in deep learning.

Lasso Regression

A regression model which uses the L1 Regularization technique is called


LASSO(Least Absolute Shrinkage and Selection Operator) regression. Lasso
Regression adds the “absolute value of magnitude” of the coefficient as a penalty
term to the loss function(L). Lasso regression also helps us achieve feature
selection by penalizing the weights to approximately equal to zero if that feature
does not serve any purpose in the model.

where,

• m – Number of Features
• n – Number of Examples
• y_i – Actual Target Value
• y_i(hat) – Predicted Target Value

Ridge Regression
A regression model that uses the L2 regularization technique is
called Ridge regression. Ridge regression adds the “squared magnitude” of the
coefficient as a penalty term to the loss function(L).

L2 & L1 regularization

L1 and L2 are the most common types of regularization. These update the general
cost function by adding another term known as the regularization term.

Cost function = Loss (say, binary cross entropy) + Regularization term


Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight matrices
leads to simpler models. Therefore, it will also reduce overfitting to quite an
extent.

However, this regularization term differs in L1 and L2.

In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter whose


value is optimized for better results. L2 regularization is also known as weight
decay as it forces the weights to decay towards zero (but not exactly zero).

In L1, we have:

In this, we penalize the absolute value of the weights. Unlike L2, the weights may
be reduced to zero here. Hence, it is very useful when we are trying to
compress our model. Otherwise, we usually prefer L2 over it.

In keras, we can directly apply regularization to any layer using the regularizers.
Below I have applied regularizer on dense layer having 500 neurons and relu
activation function.

In [11]:
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",i
nput_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Flatten()) #l2 regularizer
model.add(Dense(500,kernel_regularizer=regularizers.l2(0.01),activati
on="relu")) model.add(Dense(2,activation="softmax"))#2 represent
output layer neurons

Note: Here the value 0.01 is the value of regularization parameter, i.e., lambda,
which we need to optimize further

Similarly, we can also apply L1 regularization.

2. Norm penalties as constrained optimization

we can construct a generalized Lagrangian function containing the objective


function along with the penalties can be increased or decreased. Suppose we
wanted Ω(θ) < k, then we could construct the following Lagrangian equation
proposed by author:
We get optimal θ by solving the Lagrangian. If Ω(θ) > k, then the weights need to be
compensated highly and hence, α should be large to reduce its value below k.

Likewise, if Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α
should be small. This is now similar to the parameter norm penalty regularized
objective function as both of them encourage lower values of the norm. Thus,
parameter norm penalties naturally impose a constraint, like the L²-regularization,
defining a constrained L²-ball.

Larger α implies a smaller constrained region as it pushes the values really low,
hence, allowing a small radius and vice versa. The idea of constraints over
penalties is important for several reasons. Large penalties might cause non-convex
optimization algorithms to get stuck in local minima due to small values of θ,
leading to the formation of so-called dead cells, as the weights entering and
leaving them are too small to have an impact.

Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.

3. Regularized & Under-constrained problems

Underdetermined problems are those problems that have infinitely many


solutions. A logistic regression problem having linearly separable classes with as a
solution, will always have 2w as a solution and so on. In some machine learning
problems, regularization is
necessary. For e.g., many algorithms require the inversion of X’ X, which might be
singular. In such a case, we can use a regularized form instead. (X’ X + αI) is
guaranteed to be invertible.
Regularization can solve underdetermined problems. For e.g. the Moore-Pentose
proposed
pseudoinverse defined earlier as:

This can be applied in performing a linear regression with L²-regularization.

Many linear models in machine learning, including linear regression depend on


inverting the

whenever the data generating distribution truly has no variance in some direction,
or when no variance in observed in some direction because there are fewer
examples (rows of X) than input features (columns of X). In this case, many
forms of regularization correspond to

inverti

Data Augmentation

The simplest way to reduce overfitting is to increase the size of the training data.
In machine learning, we were not able to increase the size of training data as the
labeled data was too costly.

But, now let’s consider we are dealing with images. In this case, there are a few
ways of increasing the size of the training data – rotating the image, flipping,
scaling, shifting, etc. In the below image, some transformation has been done on
the handwritten digits dataset.
This technique is known as data augmentation. This usually provides a big leap in
improving the accuracy of the model. It can be considered as a mandatory trick in
order to improve our predictions.

Below is the implementation code example

from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image width_shift_range=0.1, #
randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total
height) horizontal_flip=False, # randomly flip images
vertical_flip=False) # randomly flip images

datagen.fit(x_train)

Dropout

This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used
regularization technique in the field of deep learning.
To understand dropout, let’s say our neural network structure is akin to the one
shown
So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as
shown below.

So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine
learning.

Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network
model.

This probability of choosing how many nodes should be dropped is the


hyperparameter of the dropout function. As seen in the image above, dropout can
be applied to both the hidden layers as well as the input layers.
Due to these reasons, dropout is usually preferred when we have a large neural
network structure in order to introduce more randomness.

In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.5 as the probability of dropping in
my neural network architecture after last hidden layer having 64 kernels and after
first dense layer having 500 neurons.

exam
ple
linkc
ode
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",i
nput_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
# 1st dropout
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(500,activation="relu"))
# 2nd dropout
model.add(Dropout(0.2))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons

Early stopping

Early stopping is a kind of cross-validation strategy where we keep one part of the
training set as the validation set. When we see that the performance on the
validation set is getting worse, we immediately stop the training on the model.
This is known as early stopping.

In the above image, we will stop training at the dotted line since after that our
model will start overfitting on the training data.

in keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop
immendiately if validation error will not decreased after 3 epochs.
In [14]:
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc',
patience=3) epochs = 20 # batch_size = 256

Here, monitor denotes the quantity that needs to be monitored and ‘val_err’
denotes the validation error.

Patience denotes the number of epochs with no further improvement after which
the training will be stopped. For better understanding, let’s take a look at the
above image again. After the dotted line, each epoch will result in a higher value
of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our
model will stop because no further improvement is seen.

Noise Robustness

Noise applied to inputs is a data augmentation, For some models addition of noise
with extremely small variance at the input is equivalent to imposing a penalty on
the norm of the weights.

Noise applied to hidden units, Noise injection can be much more powerful than
simply shrinking the parameters. Noise applied to hidden units is so important that
Dropout is the main development of this approach.

Training a neural network with a small dataset can cause the network to memorize
all training examples, in turn leading to overfitting and poor performance on a
holdout dataset. One approach to making the input space smoother and easier to
learn is to add noise to inputs during training.

• Small datasets can make learning challenging for neural nets and the examples
can be memorized.
• Adding noise during training can make the training process more robust and
reduce generalization error.
• Noise is traditionally added to the inputs, but can also be added to weights,
gradients, and even activation functions.
random noise can be added to other parts of the network during training. Some
examples include:

• Add noise to activations, i.e. the outputs of each layer.


• Add noise to weights, i.e. an alternative to the inputs.
• Add noise to the gradients, i.e. the direction to update weights.  Add noise
to the outputs, i.e. the labels or target variables.
The addition of noise to the layer activations allows noise to be used at any point
in the network. This can be beneficial for very deep networks. Noise can be added
to the layer outputs themselves, but this is more likely achieved via the use of a
noisy activation function.

The addition of noise to weights allows the approach to be used throughout the
network in a consistent way instead of adding noise to inputs and layer
activations. This is particularly useful in recurrent neural networks.
The addition of noise to gradients focuses more on improving the robustness of
the optimization process itself rather than the structure of the input domain. The
amount of noise can start high at the beginning of training and decrease over time,
much like a decaying learning rate. This approach has proven to be an effective
method for very deep networks and for a variety of different network types

Adding noise to the activations, weights, or gradients all provide a more generic
approach to adding noise that is invariant to the types of input variables provided
to the model.

If the problem domain is believed or expected to have mislabeled examples, then


the addition of noise to the class label can improve the model’s robustness to this
type of error. Although, it can be easy to derail the learning process.

Adding noise to a continuous target variable in the case of regression or time


series forecasting is much like the addition of noise to the input variables and may
be a better use case.

Semi-Supervised Learning
Semi-supervised learning is a type of machine learning that falls in between
supervised and unsupervised learning. It is a method that uses a small amount of
labeled data and a large amount of unlabeled data to train a model. The goal of
semi-supervised learning is to learn a function that can accurately predict the
output variable based on the input variables, similar to supervised learning.
However, unlike supervised learning, the algorithm is trained on a dataset that
contains both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of
unlabeled data available, but it’s too expensive or difficult to label all of it.

Examples of Semi-Supervised Learning


• Text classification: In text classification, the goal is to classify a given text
into one or more predefined categories. Semi-supervised learning can be used
to train a text classification model using a small amount of labeled data and a
large amount of unlabeled text data.
• Image classification: In image classification, the goal is to classify a given
image into one or more predefined categories. Semi-supervised learning can
be used to train an image classification model using a small amount of labeled
data and a large amount of unlabeled image data.
• Anomaly detection: In anomaly detection, the goal is to detect patterns or
observations that are unusual or different from the norm

Applications of Semi-Supervised Learning


1. Speech Analysis: Since labeling audio files is a very intensive task, Semi-
Supervised learning is a very natural approach to solve this problem.
2. Internet Content Classification: Labeling each webpage is an impractical
and unfeasible process and thus uses Semi-Supervised learning algorithms.
Even the Google search algorithm uses a variant of Semi-Supervised learning
to rank the relevance of a webpage for a given query.
3. Protein Sequence Classification: Since DNA strands are typically very large
in size, the rise of Semi-Supervised learning has been imminent in this field.
Disadvantages of Semi-Supervised Learning
The most basic disadvantage of any Supervised Learning algorithm is that the
dataset has to be hand-labeled either by a Machine Learning Engineer or a Data
Scientist. This is a very costly process, especially when dealing with large
volumes of data. The most basic disadvantage of any Unsupervised Learning is
that its application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was
introduced. In this type of learning, the algorithm is trained upon a combination of
labeled and unlabelled data. Typically, this combination will contain a very small
amount of labeled data and a very large amount of unlabelled data. The basic
procedure involved is that first, the programmer will cluster similar data using an
unsupervised learning algorithm and then use the existing labeled data to label the
rest of the unlabelled data.

Multi-Task Learning

Multi-Task Learning (MTL) is a type of machine learning technique where a


model is trained to perform multiple tasks simultaneously. In deep learning, MTL
refers to training a neural network to perform multiple tasks by sharing some of
the network’s layers and parameters across tasks.

In MTL, the goal is to improve the generalization performance of the model by


leveraging the information shared across tasks. By sharing some of the network’s
parameters, the model can learn a more efficient and compact representation of
the data, which can be beneficial when the tasks are related or have some
commonalities.

Hard Parameter Sharing – A common hidden layer is used for all tasks but
several task specific layers are kept intact towards the end of the model. This
technique is very useful as by learning a representation for various tasks by a
common hidden layer, we reduce the risk of overfitting.
Soft Parameter Sharing – Each model has their own sets of weights and
biases and the distance between these parameters in different models is
regularized so that the parameters become similar and can represent all the
tasks.

Assumptions and Considerations –


Using MTL to share knowledge among tasks are very useful only when the tasks
are very similar, but when this assumption is violated, the performance will
significantly decline. Applications: MTL techniques have found various uses,
some of the major applications are-
• Object detection and Facial recognition
• Self Driving Cars: Pedestrians, stop signs and other obstacles can be detected
together
• Multi-domain collaborative filtering for web applications
• Stock Prediction
• Language Modelling and other NLP applications
Multi-Task Learning (MTL) for deep learning important observations :

1. Task relatedness: MTL is most effective when the tasks are related or have
some commonalities, such as natural language processing, computer vision,
and healthcare.
2. Data limitation: MTL can be useful when the data is limited, as it allows the
model to leverage the information shared across tasks to improve the
generalization performance.
3. Shared feature extractor: A common approach in MTL is to use a shared
feature extractor, which is a part of the network that is shared across tasks and
is used to extract features from the input data.
4. Task-specific heads: Task-specific heads are used to make predictions for each
task and are typically connected to the shared feature extractor.
5. Shared decision-making layer: another approach is to use a shared decision-
making layer, where the decision-making layer is shared across tasks, and the
task-specific layers are connected to the shared decision-making layer.

Parameter Typing
Two models are doing the same classification task (with the same set of classes),
but their input distributions are somewhat different.

• We have model A has the parameters

• Another model B has the parameters .


W(A)

and
W(B)

are the two models that transfer the input to two different but related outputs.
Assume the tasks are comparable enough (possibly with similar input and output
distributions) that the model parameters should be near to each

other: should be close to .

We can take advantage of this data by regularising it. We can apply a parameter
norm penalty of the following form We utilised an L2 penalty here, but there are
other options.

Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm,
were regularised to be close to the parameters of another model, trained in an
unsupervised paradigm, using this method (to capture the distribution of the
observed input data).
Many of the parameters in the classifier model might be linked with similar
parameters in the unsupervised model thanks to the designs.

While a parameter norm penalty is one technique to require sets of parameters to


be equal, constraints are a more prevalent way to regularise parameters to be close
to one another. Because we view the numerous models or model components as
sharing a unique set of parameters, this form of regularisation is commonly
referred to as parameter sharing. The fact that only a subset of the parameters (the
unique set) needs to be retained in memory is a significant advantage of parameter
sharing over regularising the parameters to be close (through a norm penalty).
This can result in a large reduction in the memory footprint of certain models,
such as the convolutional neural network.

Example : Convolutional neural networks (CNNs) used in computer vision are by


far the most widespread and extensive usage of parameter sharing. Many
statistical features of natural images are translation insensitive. A shot of a cat, for
example, can be translated one pixel to the right and still be a shot of a cat. By
sharing parameters across several picture locations, CNNs take this property into
account. Different locations in the input are computed with the same feature (a
hidden unit with the same weights).
Sparse Representations

Sparse representation (SR) is used to represent data with as few atoms as possible
in a given overcomplete dictionary. By using the SR, we can concisely represent
the data and easily extract the valuable information from the data

Sparse representations classification (SRC) is a powerful technique for pixelwise


classification of images and it is increasingly being used for a wide variety of
image analysis tasks. The method uses sparse representation and learned
redundant dictionaries to classify image pixels.

sparse representation attracts great attention as it can significantly save computing


resources and find the characteristics of data in a low-dimensional space. Thus, it
can be widely applied in engineering fields such as dictionary learning, signal
reconstruction, image clustering, feature selection, and extraction.
As real-world data becomes more diverse and complex, it becomes hard to
completely reveal the intrinsic structure of data with commonly used approaches.
This has led to the exploration of more practicable representation models and
efficient optimization approaches. New formulations such as deep sparse
representation, graph-based sparse representation, geometry-guided sparse
representation, and group sparse representation have achieved remarkable success

the terms "sparse" and "dense" are commonly used to describe the
distribution of zero and non-zero array members in machine learning (e.g.
vector or matrix). Sparse matrices are those that primarily consist of zeros,
while dense matrices have a large number of nonzero entries.
Machine learning makes use of sparse and dense representations due to their
usefulness in efficient data representation. While dense representations are useful
for capturing intricate interactions between data points, sparse representations can
help minimize the amount of a dataset.

• Sparse representations have the potential to be more resilient to noise


and produce more interpretable outcomes. For calculations, dense
representations are typically more effective since they can be processed
more quickly. On top of that, dense representations are useful for tasks
like classification and regression because they can capture intricate
connections between data points.
• Sparse representations are helpful for reducing the dimensionality of the
data in tasks like natural language processing and picture recognition.
Further, sparse representations can be utilized to capture only the most
crucial elements of the data, which can greatly cut down on the time
needed to train a model.

• Dense representations are able to capture complicated interactions


between data points, they are frequently employed in machine learning
and can be especially helpful for tasks like classification and regression.
Because of their increased computational efficiency, dense
representations can also shorten the time it takes to train a model.
A matrix is a two-dimensional data object made of m rows and n columns,
therefore having total m x n values. If most of the elements of the matrix have 0
value, then it is called a sparse matrix.
Why to use Sparse Matrix instead of simple matrix ?
• Storage: There are lesser non-zero elements than zeros and thus lesser
memory can be used to store only those elements.
• Computing time: Computing time can be saved by logically designing a data
structure traversing only non-zero elements..

sparse Matrix Representations can be done in many ways following are two
common representations: 1. Array representation
2. Linked list representation

Example -

Let's understand the array representation of sparse matrix with the help of the
example given below -

Consider the sparse matrix -

In the above figure, we can observe a 5x4 sparse matrix containing 7 non-zero
elements and 13 zero elements. The above matrix occupies 5x4 = 20 memory
space. Increasing the size of matrix will increase the wastage space.

The tabular representation of the above matrix is given below -


In the above structure, first column represents the rows, the second column
represents the columns, and the third column represents the non-zero value. The
first row of the table represents the triplets. The first triplet represents that the
value 4 is stored at 0th row and 1st column. Similarly, the second triplet
represents that the value 5 is stored at the 0th row and 3rd column. In a similar
manner, all triplets represent the stored location of the nonzero elements in the
matrix.

The size of the table depends upon the total number of non-zero elements in the
given sparse matrix. Above table occupies 8x3 = 24 memory space which is more
than the space occupied by the sparse matrix. So, what's the benefit of using the
sparse matrix? Consider the case if the matrix is 8*8 and there are only 8 non-zero
elements in the matrix, then the space occupied by the sparse matrix would be 8*8
= 64, whereas the space occupied by the table represented using triplets would be
8*3 = 24.

Example -

Let's understand the linked list representation of sparse matrix with the help of the
example given below -

Consider the sparse matrix


In the above figure, we can observe a 4x4 sparse matrix containing 5 non-zero
elements and 11 zero elements. Above matrix occupies 4x4 = 16 memory space.
Increasing the size of matrix will increase the wastage space.

The linked list representation of the above matrix is given below -

In the above figure, the sparse matrix is represented in the linked list form. In the
node, the first field represents the index of the row, the second field represents the
index of the column, the third field represents the value, and the fourth field
contains the address of the next node.

In the above figure, the first field of the first node of the linked list contains 0,
which means 0th row, the second field contains 2, which means 2 nd column, and
the third field contains 1 that is the non-zero element. So, the first node represents
that element 1 is stored at the 0 th row-2nd column in the given sparse matrix. In a
similar manner, all of the nodes represent the non-zero elements of the sparse
matrix.

Sparse Coding representation in Neural Networks

sparse code follows the more all-encompassing idea of neural code. Consider the
case when you have binary neurons. So, basically:

• The neural networks will get some inputs and deliver outputs
• Some neurons in the neural network will be frequently activated while
others won’t be activated at all to calculate the outputs
• The average activity ratio refers to the number of activations on some
data, whereas the neural code is the observation of those activations
for a specific input
• Neural coding is the process of instructing your neurons to produce a
reliable neural code

Now that we know what a neural code is, we can speculate on what it may be like.
Then, data will be encoded using a sparse code while taking into
consideration the following scenarios:

• No neurons are even activated


• One neuron alone is activated
• Half of the neurons are active

These are the methods which are being followed to represent image and its
classifications

Ensemble Learning Methods: Bagging, Boosting

Ensemble learning is a machine learning technique combining multiple individual


models to create a stronger, more accurate predictive model. By leveraging the
diverse strengths of different models, ensemble learning aims to mitigate errors,
enhance performance, and increase the overall robustness of predictions, leading
to improved results across various tasks in machine learning and neural networks .

Bagging or Bootstrap Aggregating is an ensemble learning method that is used to


reduce the error by training homogeneous weak learners on different random
samples from the training set, in parallel. The results of these base learners are
then combined through voting or averaging approach to produce an ensemble
model that is more robust and accurate.

Bagging mainly focuses on obtaining an ensemble model with lower variance


than the individual base models composing it. Hence, bagging techniques help
avoid the overfitting of the model.
Benefits of Bagging
• Reduce Overfitting

• Improve Accuracy

• Handles Unstable Models

Note: Random Forest Algorithm is one of the most common Bagging Algorithm.

Steps of Bagging Technique


• Randomly select multiple bootstrap samples from the training data with
replacement and train a separate model on each sample.

• For classification, combine predictions using majority voting. For regression,


average the predictions.

• Assess the ensemble’s performance on test data and use the aggregated models
for predictions on new data.

• If needed, retrain the ensemble with new data or integrate new models into the
existing ensemble.

Example of Bagging and boosting

The main idea behind ensemble learning is the usage of multiple algorithms and
models that are used together for the same task. While single models use only one
algorithm to create prediction models, bagging and boosting methods aim to
combine several of those to achieve better prediction with higher consistency
compared to individual learnings.
Image classification

Supposing a collection of images, each accompanied by a categorical label


corresponding to the kind of animal, is available for the purpose of training a
model. In a traditional modeling approach, we would try several techniques and
calculate the accuracy to choose one over the other. Imagine we used logistic
regression, decision tree, and support vector machines
here that perform differently on the given data set.

In the above example, it was observed that a specific record was predicted as a
dog by the logistic regression and decision tree models, while a support vector
machine identified it as a cat. As various models have their distinct advantages
and disadvantages for particular records, it is the key idea of ensemble learning to
combine all three models instead of selecting only one approach that showed the
highest accuracy.

The procedure is called aggregation or voting and combines the predictions of all
underlying models, to come up with one prediction that is assumed to be more precise
than any sub-
model that would stay alone.
Boosting is an ensemble learning method that involves training homogenous weak
learners sequentially such that a base model depends on the previously fitted base
models. All these base learners are then combined in a very adaptive way to
obtain an ensemble model.
In boosting, the ensemble model is the weighted sum of all constituent base
learners. There are two meta-algorithms in boosting that differentiate how the
base models are aggregated:
• Adaptive Boosting (AdaBoost)
• Gradient Boosting

• XGBoost

Benefits of Boosting Techniques


• High Accuracy

• Adaptive Learning

• Reduces Bias

• Flexibility

How is Boosting Model Trained to Make Predictions


• Samples generated from the training set are assigned the same weight to start
with. These samples are used to train a homogeneous weak learner or base
model.
• The prediction error for a sample is calculated – the greater the error, the
weight of the sample increases. Hence, the sample becomes more important
for training the next base model.
• The individual learner is weighted too – does well on its predictions, gets a
higher weight assigned to it. So, a model that outputs good predictions will
have a higher say in the final decision.
• The weighted data is then passed on to the following base model, and steps 2
and step 3 are repeated until the data is fitted well enough to reduce the error
below a certain threshold.
• When new data is fed into the boosting model, it is passed through all
individual base models, and each model makes its own weighted prediction.
• Weight of these models is used to generate the final prediction. The predictions
are scaled and aggregated to produce a final prediction.
Key Difference Between Bagging and Boosting
• The bagging technique combines multiple models trained on different subsets
of data, whereas boosting trains models sequentially, focusing on the error
made by the previous model.

• Bagging is best for high variance and low bias models while boosting is
effective when the model must be adaptive to errors, suitable for bias and
variance errors.

• Generally, boosting techniques are not prone to overfitting. Still, it can be if the
number of models or iterations is high, whereas the Bagging technique is less
prone to overfitting.

• Bagging improves accuracy by reducing variance, whereas boosting achieves


accuracy by reducing bias and variance.

• Boosting is suitable for bias and variance, while bagging is suitable for high-
variance and low-bias models.

About bias and variance used in bagging and boosting

Bias:While making predictions, a difference occurs between prediction values made by


the model and actual values/expected values, and this difference is known as bias errors
or
Errors due to bias o Low Bias: A low bias model will make fewer assumptions
about the form of the target function.
o High Bias: A model with a high bias makes more assumptions, and the
model becomes unable to capture the important features of our dataset. A
high bias model also cannot perform well on new data.
Variance:the variance would specify the amount of variation in the prediction if
the different training data was used. In simple words, variance tells that how
much a random variable is different from its expected value. Ideally, a model
should not vary too much from one training dataset to another, which means the
algorithm should be good in understanding the hidden mapping between inputs
and output variables. Variance errors are either of low variance or high variance.

o Low variance means there is a small variation in the prediction of the


target function with changes in the training data set. At the same time,
High variance shows a large variation in the prediction of the target
function with changes in the training dataset.

Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Tangent propagation is a way of regularizing neural nets. It encourages the


representation to be invariant by penalizing large changes in the representation
when small transformations are applied to the inputs.

It combines this prior knowledge with observed training data, by minimizing an


objective function that measures both the network's error with respect to the
training example values (fitting the data) and its error with respect to the desired
derivatives (fitting the prior knowledge).

Tangent propagation is closely related to dataset augmentation. In both cases, the


user of the algorithm encodes his or her prior knowledge of the task by specifying
a set of transformations that should not alter the output of the network.
The difference is that in the case of dataset augmentation, the network is explicitly
trained to correctly classify distinct inputs that were created by applying more
than an infinitesimal amount of these transformations.

tangent propagation does not require explicitly visiting a new input point. Instead,
it analytically regularizes the model to resist perturbation in the directions
corresponding to the specified transformation. While this analytical approach is
intellectually elegant,

it has two major drawbacks. First, it only regularizes the model to resist
infinitesimal perturbation. Explicit dataset augmentation confers resistance to
larger perturbations( means changes in datasets) Second, the infinitesimal
approach poses difficulties for models based on rectified linear units. These
models can only shrink their derivatives by turning units off or shrinking their
weights.
They are not able to shrink their derivatives by saturating at a high value with
large weights, as sigmoid or tanh units can. Dataset augmentation works well with
rectified linear units because different subsets of rectified units can activate for
different transformed versions of each original input. Tangent propagation is also
related to double backprop (Drucker and LeCun, 1992) and adversarial training

The TANGENTPROP Algorithm TANGENTPROP (Simard et al. 1992)


accommodates domain knowledge expressed as derivatives of the target function
with respect to transformations of its inputs. Consider a learning task involving an
instance space X and target function f.

The TANGENTPROP algorithm assumes various training derivatives of the target


function are also provided. For example, if each instance xi is described by a
single real value, then each training example may be of the form (xi, f (xi), q lx, ).
Here lx, denotes the derivative of the target function f with respect to x, evaluated
at the point x = xi.

To develop an intuition for the benefits of providing training derivatives as well as


training values during learning, consider the simple learning task depicted in
Figure
The task is to learn the target function f shown in the leftmost plot of the figure,
based on the three training examples shown: (xl, f (xl)), (x2, f (x2)), and (xg, f
(xg)).

Given these three training examples, the BACKPROPAGATION algorithm can be


expected to hypothesize a smooth function, such as the function g depicted in the
middle plot of the figure. The rightmost plot shows the effect of

providing training derivatives, or slopes, as additional information for each


training example (e.g., (XI, f (XI), I,, )). By fitting both the training values f (xi)
and these training derivatives PI,, the learner has a better chance to correctly
generalize from the sparse training data.

To summarize, the impact of including the training derivatives is to override the


usual syntactic inductive bias of BACKPROPAGATION that favors a smooth
interpolation between points, replacing it by explicit input information about
required derivatives. The resulting hypothesis h shown in the rightmost plot of the
figure provides a much more accurate estimate of the true target function f.

Each transformation must be of the form sj(a, x) where aj is a continuous


parameter, where sj is differentiable, and where sj(O, x) = x (e.g., for rotation of
zero degrees the transformation is the identity function). For each such
transformation, sj(a, x),

In the Figure one f(X) are the hypothesis and x1 , x2 ,x3 are the instances and
these instances fit to proper hypothesis shown in first figure and in second fig we
can see the instances classified and machine learns to fit to proper hypothesis by
doing necessary modification by using

TANGEPROP considers the squared error between the specified training


derivative and the actual derivative of the learned neural network. The modified
error function is

where p is a constant provided by the user to determine the relative importance of


fitting training values versus fitting training derivatives.

Notice the first term in this definition of E is the original squared error of the
network versus training values, and the second term is the squared error in the
network versus training derivatives.

In the third figure we can see the instances are classified properly and maintaining
accuracy.

An Illustrative Example
Remarks To summarize, TANGENTPROP uses prior knowledge in the form of
desired derivatives of the target function with respect to transformations of its
inputs.

It combines this prior knowledge with observed training data, by minimizing an


objective function that measures both the network's error with respect to the
training example values (fitting the data) and its error with respect to the desired
derivatives (fitting the prior knowledge).

You might also like