0% found this document useful (0 votes)
6 views

Lecture Notes 3 &4

This document discusses key concepts in machine learning, including gradient descent, multi-layer perceptrons (MLP), backpropagation, batch normalization, and regularization techniques. Gradient descent is an optimization algorithm that minimizes the cost function by adjusting parameters in the negative gradient direction. Additionally, it covers the importance of momentum in optimization and the differences between L1 and L2 regularization methods for addressing overfitting.

Uploaded by

punia4901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture Notes 3 &4

This document discusses key concepts in machine learning, including gradient descent, multi-layer perceptrons (MLP), backpropagation, batch normalization, and regularization techniques. Gradient descent is an optimization algorithm that minimizes the cost function by adjusting parameters in the negative gradient direction. Additionally, it covers the importance of momentum in optimization and the differences between L1 and L2 regularization methods for addressing overfitting.

Uploaded by

punia4901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture No.

12

Today's Agenda:
 Detail discussion on gradient decent in Machine Learning.

Gradient Descent is an optimization algorithm used in machine learning to minimize

the cost function by iteratively adjusting parameters in the direction of the negative

gradient, aiming to find the optimal set of parameters.

The cost function represents the discrepancy between the predicted output of the model

and the actual output. The goal of gradient descent is to find the set of parameters that

minimizes this discrepancy and improves the model’s performance.

The algorithm operates by calculating the gradient of the cost function, which indicates

the direction and magnitude of steepest ascent. However, since the objective is to

minimize the cost function, gradient descent moves in the opposite direction of the

gradient, known as the negative gradient direction.

By iteratively updating the model’s parameters in the negative gradient direction,

gradient descent gradually converges towards the optimal set of parameters that yields

the lowest cost.

The learning rate, a hyperparameter, determines the step size taken in each iteration,

influencing the speed and stability of convergence. Gradient descent can be applied to

various machine learning algorithms, including linear regression, logistic regression,

neural networks, and support vector machines. It provides a general framework for

optimizing models by iteratively refining their parameters based on the cost function.

Example of Gradient Descent:

Let’s say you are playing a game where the players are at the top of a mountain, and

they are asked to reach the lowest point of the mountain. Additionally, they are

blindfolded

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends. From that

position, take a step in the descending direction and iterate this process until we reach
the lowest point.

Finding the lowest point in a hilly landscape. (Source: Fisseha Berhane)

Gradient descent is an iterative optimization algorithm for finding the local minimum of

a function.To find the local minimum of a function using gradient descent, we must take

steps proportional to the negative of the gradient (move away from the gradient) of the

function at the current point. If we take steps proportional to the positive of the gradient

(moving towards the gradient we will approach a local maximum of the function,

and the procedure is called

Gradient Ascent.

Gradient descent was originally proposed by CAUCHY in 1847. It is also known as

steepest descent.

Source: Clairvoyant
The goal of the gradient descent algorithm is to minimize the given function (say cost

function). To achieve this goal, it performs two steps iteratively:

1. Compute the gradient (slope), the first order derivative of the function at that point

2. Make a step (move) in the direction opposite to the gradient, opposite

direction of slope increase from the current point by alpha times the gradient at that

point
Alpha is called Learning rate – a tuning parameter in the optimization process. It
decides the length of the steps.

Summary
Gradient descent is one of the most fundamental and widely used optimization algorithms in
machine learning. It's all about finding the best set of parameters for your model by continuously
minimizing a loss function. Essentially, it's like rolling a ball down a hill where the bottom of the
hill represents the minimum of the loss function and the ball represents your model's parameters.
Here's a breakdown of the key concepts:
1. Cost Function (Loss Function): This function measures how "bad" your model's predictions are.
We want to minimize this function to train a better model.
2. Gradient: This is the direction and steepness of the slope of the cost function at the current
position of your model's parameters. It tells you how much and in which direction to adjust your
parameters to minimize the cost function.
3. Parameter Update: Based on the gradient, you adjust your model's parameters by taking a small
step in the direction of the negative gradient (steepest descent). This step size is called the learning
rate.
4. Iteration: You repeat steps 2 and 3 iteratively until the cost function reaches a minimum (or
close enough!), and your model has learned the best parameters.
Lecture No. 13

Today's Agenda:
 Detail discussion on MLP in ANN

 Detail discussion on Backpropogation

Multi-layer ANN
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP).

It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a deep ANN. An
MLP is a typical example of a feedforward artificial neural network. In this figure, the ith activation unit
in the lth layer is denoted as ai.The number of layers and the number of neurons are referred to as
hyperparameters of a neural network, and these need tuning. Cross-validation techniques must be used
to find ideal values for these. The weight adjustment training is done via backpropagation. Deeper
neural networks are better at processing data. However, deeper layers can lead to vanishing gradient
problems. Special algorithms are required to solve this issue.

Notations

In the representation below:


 ai(in) refers to the ith value in the input layer

 ai(h) refers to the ith unit in the hidden layer

 ai(out) refers to the ith unit in the output layer

 ao(in) is simply the bias unit and is equal to 1; it will have the corresponding weight w0

 The weight coefficient from layer l to layer l+1 is represented by wk,j(l)

A simplified view of the multilayer is presented here. This image shows a fully connected three-layer
neural network with 3 input neurons and 3 output neurons. A bias term is added to the input vector.

MLP Learning Procedure

The MLP learning procedure is as follows:

 Starting with the input layer, propagate data forward to the output layer. This step is the forward
propagation.

 Based on the output, calculate the error (the difference between the predicted and known outcome).
The error needs to be minimized.

 Backpropagate the error. Find its derivative with respect to each weight in the network, and update
the model.

Repeat the three steps given above over multiple epochs to learn ideal weights. Finally, the output is
taken via a threshold function to obtain the predicted class labels.

Forward Propagation in MLP

In the first step, calculate the activation unit al(h) of the hidden layer.

Backpropagation

Backpropagation is the crucial learning algorithm that powers artificial neural networks (ANNs),
enabling them to learn complex relationships and improve their predictions over time. It's like the
engine in a car, propelling the ANN towards optimal performance. Here's why it's so important:

1. Training ANNs: Without backpropagation, ANNs would be static, unable to adjust their internal
parameters based on new information. Backpropagation provides the mechanism for fine-tuning the
weights and biases in the network, allowing it to learn from its mistakes and improve its predictions on
future data.
2. Minimizing Loss: Backpropagation calculates the gradient of the loss function with respect to each
weight and bias in the network. This gradient tells us how much each parameter contributes to the
overall error in the network's predictions.

3. Updating Weights and Biases: Based on the calculated gradient, backpropagation updates the weights
and biases in the network iteratively, moving them in the direction that minimizes the loss function. This
process is akin to sculpting the network, gradually shaping its internal structure to better represent the
underlying data patterns.

4. Enabling Non-Linearity: Backpropagation allows ANNs to learn non-linear relationships between


features and target variables. This is crucial for tackling real-world problems where simple linear
models often fall short. The ability to learn complex relationships makes ANNs versatile and powerful
tools for diverse applications.

5. Generalization: Backpropagation helps ANNs generalize from the training data to unseen examples.
By minimizing the loss function on the training data, the network learns to capture the underlying
patterns and relationships, enabling it to make accurate predictions on new data points it hasn't
encountered before.

Summary
In conclusion, backpropagation is the vital cog in the ANN learning machine. It's what transforms these models
from static structures to powerful learning machines, capable of solving complex problems and shaping the future
of AI. If you're interested in diving deeper into the intricate details of backpropagation
Lecture No. 14

Today's Agenda:
 Detail discussion on Testing, gradient unstable,

 Detail discussion on batch normalization

Gradient instability is a major hurdle in machine learning, impacting training speed, convergence,
and overall model performance. Fortunately, various testing and mitigation strategies can be
employed to address this issue. Let's delve deeper into this topic:

Detecting Gradient Instability:

 Monitor Gradient Magnitudes: Track the absolute values of gradients throughout


training. Excessively high or low values indicate potential instability.
 Visualize Gradient Distributions: Utilize tools like TensorBoard to visualize the distribution of
gradients over layers and epochs. Look for sharp peaks or highly skewed distributions.
 Analyze Loss Function Landscape: Sudden jumps or plateaus in the loss function could signify
instability.
 Monitor Training Metrics: Track convergence metrics like training accuracy and loss. Stagnation
or erratic fluctuations might indicate gradient issues.

Understanding the Roots of Instability:

 Vanishing Gradients: Gradients decrease exponentially through deep networks, eventually


shrinking to insignificant values, hindering learning in deeper layers.
 Exploding Gradients: Gradients accumulate and magnify throughout backpropagation, leading to
uncontrolled updates and model divergence.
 Internal Covariate Shift: Activations within a layer can shift distributions across training
batches, destabilizing gradients.

Testing and Mitigation Strategies:

 Batch Normalization (BN): A popular technique that normalizes activations within each mini-
batch, addressing internal covariate shift and stabilizing gradients.
 Weight Initialization: Choosing appropriate initialization schemes like Xavier or He initialization
can influence gradient flow and mitigate instability.
 Clipping Gradients: Enforce hard or soft clipping thresholds on gradient magnitudes to prevent
them from exploding.
 Adaptive Learning Rates: Employ algorithms like Adam or RMSprop that dynamically adjust
learning rates based on gradients, providing more stability.
 Regularization Techniques: Techniques like L1 or L2 regularization can prevent model
overfitting, which can contribute to instability.

Advanced Testing Techniques:

 Hessian Analysis: Analyzes the curvature of the loss function around the current
parameters, providing insights into gradient stability and potential second-order issues.
 Spectral Normalization: Normalizes weights based on their singular values, addressing exploding
gradients in certain network architectures.
 Gradient Checkpointing: Saves and restores gradients periodically during training, providing a
fallback point if they diverge excessively.

Choosing the Right Approach:

 Consider the specific model architecture, problem domain, and dataset characteristics.
 Experiment with different combinations of testing and mitigation strategies.
 Monitor performance metrics and choose the approach that demonstrably improves training stability
and optimizes model performance.

BN normalizes the activations (outputs) of a layer within each mini-batch during training. This
essentially standardizes the distribution of activations around a mean of 0 and a standard deviation
of
Calculate Mean and Variance: Within each mini-batch, the mean and variance of the activations
across all training samples are calculated for each feature.
Normalize Activations: Each activation is then shifted by the mean and scaled by the standard
deviation, effectively centering and rescaling its distribution.
Scale and Shift: Two learnable parameters, gamma and beta, are applied to the normalized
activations to allow the model to recover any representational power lost during normalization.
Benefits of Batch Normalization:
 Reduced Internal Covariate Shift: BN stabilizes the distribution of activations across mini-
batches, preventing the covariate shift that can occur during training and hinder learning.
 Faster Convergence: Normalized activations lead to smoother gradient flow, allowing the model
to converge to a minimum of the loss function more quickly.
 Improved Generalization: BN can act as a regularizer, reducing overfitting and improving the
model's ability to perform well on unseen data.
 Higher Learning Rates: The stable gradients due to BN allow for using larger learning rates
without risking divergence or instability.

Challenges and Considerations:


 Increased Complexity: BN adds additional learnable parameters and computational overhead
to the model.
 Hyperparameter Tuning: Tuning the learnable scale and shift parameters (gamma and beta)
can be crucial for optimal performance.
 Not a cure-all: While effective in many cases, BN might not solve all instability issues, and
alternative techniques like layer normalization or weight normalization might be needed in
specific cases.

Applications of Batch Normalization:


 Deep Neural Networks: BN is particularly beneficial in deep networks where vanishing or
exploding gradients can be a major problem.
 Computer Vision: BN is widely used in convolutional neural networks for image recognition and
classification tasks.
 Natural Language Processing: BN can improve the performance of recurrent neural networks
used for language modeling and machine translation.

Summary
In conclusion, Batch normalization is a valuable tool in the machine learning toolbox, offering
significant improvements in training speed, stability, and model performance. By understanding
its workings, benefits, and considerations, you can leverage its strengths to get the most out of
your machine learning projects.
Lecture No. 15

Today's Agenda:
 Detail discussion on L1 and L2 regularization

When you have a large number of features in your data set, you may wish to create a less complex, more
parsimonious model. Two widely used regularization techniques used to address overfitting and feature
selection are L1 and L2 regularization.

L1 VS. L2 REGULARIZATION METHODS

 L1 Regularization, also called a lasso regression, adds the “absolute value of magnitude” of the
coefficient as a penalty term to the loss function.
 L2 Regularization, also called a ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function.

L1 Regularization: Lasso Regression


Lasso is an acronym for least absolute shrinkage and selection operator, and lasso regression adds the
“absolute value of magnitude” of the coefficient as a penalty term to the loss function.

Again, if lambda is zero, then we'll get back OLS (ordinary least squares) whereas a very large value
will make coefficients zero, which means it will become underfit.

L2 Regularization: Ridge Regression


Ridge regression adds the “squared magnitude” of the coefficient as the penalty term to the loss
function. The highlighted part below represents the L2 regularization element.

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it
will add too much weight and lead to underfitting. Having said that, how we choose lambda is
important. This technique works very well to avoid overfitting issues.
Summary

The key difference between these techniques is that lasso shrinks the less important feature’s
coefficient to zero thus, removing some features altogether. In other words, L1
regularization works well for feature selection in case we have a huge number of features.
Lecture No. 16

Today's Agenda:
 Detail discussion on Momentum,

 Detail discussion on hypertunning of parameters

Momentum is a powerful optimization technique in machine learning, often used alongside gradient
descent, that helps your model navigate the training landscape more efficiently. Imagine rolling a ball
down a hill – momentum accelerates the ball, helping it overcome bumps and reach the bottom
(minimum of the loss function) faster. Here's a deeper look at how it works:

The Problem:
Gradient descent typically takes small steps in the direction of the steepest descent (negative
gradient) to minimize the loss function. While this works, it can be slow, especially when the
landscape is bumpy or has shallow regions.

The Solution: Momentum:


Momentum acts like a rolling ball's inertia. It "remembers" the past gradients and adds them to
the current gradient, creating a larger update vector. This pushes the ball downhill with extra
force, helping it overcome shallow areas and reach the minimum faster.

Benefits of Momentum:
 Faster Convergence: By accumulating past gradients, momentum can significantly accelerate the
learning process.
 Reduced Oscillations: Momentum smooths out the descent path, preventing the ball from
bouncing back and forth in shallow regions.
 Improved Performance: Faster convergence and smoother trajectories often lead to better final
model performance.
Key Parameters:
 Momentum coefficient (μ): This controls the amount of past gradients considered. Higher values
increase momentum, but too much can cause overshooting.
 Initial velocity: This sets the starting direction of the ball's movement.
Challenges and Considerations:
 Choosing the right momentum coefficient: Finding the optimal value depends on the problem
and dataset. Experimentation is key.
 Can overshoot the minimum: High momentum can cause the ball to zoom past the
minimum, requiring careful tuning.
 Not a guaranteed solution: While effective in many cases, momentum may not always be the
best approach.
Variations of Momentum:

 Nesterov momentum: Looks ahead slightly before taking the update, leading to improved stability
and accuracy.
 AdaGrad, RMSProp, Adam: Adaptive momentum-based algorithms adjust the learning rate for
different parameters, often leading to better performance.

Summary
momentum is a powerful tool that can significantly improve the efficiency and performance of your machine
learning models. By understanding its benefits, challenges, and variations, you can leverage its capabilities to
push your models to the top of the learning hill faster and achieve optimal results.
NOTES

UNIT –III

Convolutional neural network, flattening, subsampling, padding, stride, convolution


layer,pooling layer, loss layer, densed layer 1x1 convolution, inception network, input channels,
transfer learning, one shot learning, dimension reductions, implementation of CNN like tensor
flow, keras etc.

Course Outcome Bloom’s Weightage


Level
BOCS -605(B).3 Determine and analyse common Machine BL4 20 %
Learning algorithms in practice and
implementing their own;
Lecture No. 18

Today's Agenda:
 Detail discussion on Convolutional neural networks,

 Detail discussion on types of layers

Convolutional neural networks are distinguished from other neural networks by their superior
performance with image, speech, or audio signal inputs. They have three main types of layers, which
are:

 Convolutional layer
 Pooling layer
 Fully-connected (FC) layer

The convolutional layer is the first layer of a convolutional network. While convolutional layers can be
followed by additional convolutional layers or pooling layers, the fully-connected layer is the final
layer. With each layer, the CNN increases in its complexity, identifying greater portions of the image.
Earlier layers focus on simple features, such as colors and edges. As the image data progresses through
the layers of the CNN, it starts to recognize larger elements or shapes of the object until it finally
identifies the intended object.
Convolutional layer

The convolutional layer is the core building block of a CNN, and it is where the majority of computation
occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s assume that
the input will be a color image, which is made up of a matrix of pixels in 3D. This means that the input
will have three dimensions—a height, width, and depth—which correspond to RGB in an image. We
also have a feature detector, also known as a kernel or a filter, which will move across the receptive
fields of the image, checking if the feature is present. This process is known as a convolution.

The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image.
While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the
receptive field. The filter is then applied to an area of the image, and a dot product is calculated between
the input pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter
shifts by a stride, repeating the process until the kernel has swept across the entire image. The final
output from the series of dot products from the input and the filter is known as a feature map, activation
map, or a convolved feature.

Note that the weights in the feature detector remain fixed as it moves across the image, which is also
known as parameter sharing. Some parameters, like the weight values, adjust during training through the
process of backpropagation and gradient descent. However, there are three hyperparameters which
affect the volume size of the output that need to be set before the training of the neural network begins.
These include:

1. The number of filters affects the depth of the output. For example, three distinct filters
would yield three different feature maps, creating a depth of three.
Lecture No. 20

Today's Agenda:
 Detail discussion on padding

 Detail discussion on strides

2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix. While stride
values of two or greater is rare, a larger stride yields a smaller output.

3. Zero-padding is usually used when the filters do not fit the input image. This sets all elements that
fall outside of the input matrix to zero, producing a larger or equally sized output. There are three types
of padding:

 Valid padding: This is also known as no padding. In this case, the last convolution is dropped if
dimensions do not align.
 Same padding: This padding ensures that the output layer has the same size as the input layer
 Full padding: This type of padding increases the size of the output by adding zeros to the border
of the input.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the
feature map, introducing nonlinearity to the model.

Additional convolutional layer can follow the initial convolution layer. When this happens, the structure
of the CNN can become hierarchical as the later layers can see the pixels within the receptive fields of
prior layers. As an example, let’s assume that we’re trying to determine if an image contains a bicycle.
You can think of the bicycle as a sum of parts. It is comprised of a frame, handlebars, wheels, pedals, et
cetera. Each individual part of the bicycle makes up a lower-level pattern in the neural net, and the
combination of its parts represents a higher-level pattern, creating a feature hierarchy within the
CNN. Ultimately, the convolutional layer converts the image into numerical values, allowing the neural
network to interpret and extract relevant patterns.

Pooling layer

Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the number
of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps a filter
across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel
applies an aggregation function to the values within the receptive field, populating the output array.
There are two main types of pooling:

 Max pooling: As the filter moves across the input, it selects the pixel with the maximum value
to send to the output array. As an aside, this approach tends to be used more often compared to
average pooling.
 Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits to the CNN. They
help to reduce complexity, improve efficiency, and limit risk of overfitting.
Fully-connected layer

The name of the full-connected layer aptly describes itself. As mentioned earlier, the pixel values of the
input image are not directly connected to the output layer in partially connected layers. However, in the
fully-connected layer, each node in the output layer connects directly to a node in the previous layer.

This layer performs the task of classification based on the features extracted through the previous layers
and their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers
usually leverage a softmax activation function to classify inputs appropriately, producing a probability
from 0 to 1.
Lecture No. 22

Today's Agenda:
 Detail discussion on Transfer learning,

 Detail discussion on working of Transfer Learning

Transfer learning is a technique in machine learning where a model trained on one task is used as the
starting point for a model on a second task. This can be useful when the second task is similar to the
first task, or when there is limited data available for the second task. By using the learned features
from the first task as a starting point, the model can learn more quickly and effectively on the second
task. This can also help to prevent overfitting, as the model will have already learned general features
that are likely to be useful in the second task.

Importance of Transfer Learning

Many deep neural networks trained on images have a curious phenomenon in common: in the early
layers of the network, a deep learning model tries to learn a low level of features, like detecting edges,
colours, variations of intensities, etc. Such kind of features appear not to be specific to a particular
dataset or a task because no matter what type of image we are processing either for detecting a lion or
car. In both cases, we have to detect these low-level features. All these features occur regardless of the
exact cost function or image dataset. Thus, learning these features in one task of detecting lions can be
used in other tasks like detecting humans.
Working of Transfer Learning

 Pre-trained Model: Start with a model that has previously been trained for a certain task using a
large set of data. Frequently trained on extensive datasets, this model has identified general
features and patterns relevant to numerous related jobs.
 Base Model: The model that has been pre-trained is known as the base model. It is made up of
layers that have utilized the incoming data to learn hierarchical feature representations.
 Transfer Layers: In the pre-trained model, find a set of layers that capture generic information
relevant to the new task as well as the previous one. Because they are prone to learning low-level
information, these layers are frequently found near the top of the network.
 Fine-tuning: Using the dataset from the new challenge to retrain the chosen layers. We define this
procedure as fine-tuning. The goal is to preserve the knowledge from the pre-training while
enabling the model to modify its parameters to better suit the demands of the current assignment.
The Block diagram is shown below as follows:
Transfer Learning

The origin of the name ‘Inception Network’ is very interesting since it comes from the famous
movie Inception, directed by Christopher Nolan. The movie concerns the idea of dreams embedded into
other dreams and turned into the famous internet meme below:

Main Blocks

To gain a better understanding of Inception Networks, let’s dive into and explore its individual
components one by one. 1 x 1 Convolution
The goal of a 1X1convolution is to reduce the dimensions of the input data by channel-wise pooling. In
this way, the depth of the network can increase without running the risk of overfitting.
In a 1X1convolution layer (also referred to as the bottleneck layer), we compute the convolution
between each pixel of the image and the filter in the channel dimension. As a result, the output will have
the same height and width as the input, but the number of output channels will change.
In the image below, we can see an example of a 1X1 convolution. The input dimension is (64, 64, 192)
while the output dimension is (64, 64, 1). So, the dimension of the output feature map is (64, 64, number
of filters):

We can also observe that 1 X1 convolutions help the network to learn depth features that span across the
channels of the image.
3 x 3 and 5 x 5 Convolutions

The goal of 3 X 3 and 5 X 5 convolutions is to learn spatial features at different scales.


Specifically, by leveraging convolutional filters of different sizes, the network learns spatial patterns at
many scales as the human eye does. As we can easily understand, the 3 3 convolution learns features
at a small scale while the 5 5 convolution learns features at a larger scale.

Overall Architecture

Overall, every inception architecture consists of the above inception blocks that we mentioned,
along with a max-pooling layer that is present in every neural network and a concatenation layer
that joins the features extracted by the inception blocks.
Now, we’ll describe two Inception architectures starting from a naive one and moving on to the original
one, which is an improved version of the first.

Naive Inception

A naive implementation is to apply every inception block in the input separately and then
concatenate the output features in the channel dimension. To do this, we need to ensure that the
extracted features have the same width and height dimensions. So, we apply the same padding in every
convolution.
Below, we can see an example where the input has dimensions (28, 28, 192) and the output has
dimensions (28, 28, channels) where the channels are equal to the sum of the channels extracted from
each residual block:

Optimized Inception

The drawback of the naive implementation is the computational cost due to a large number of
parameters in the convolutional layers.
The proposed solution is to take advantage of convolutions that are able to reduce the input
dimensions resulting in much lower computational cost.
Below, we can see how the improved inception architecture is defined. We can see that before each
3 X 3 and 5 X5 convolution we add a 1 X1 convolution to reduce the size of the features:
The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions for the
training dataset with a high number of features, for such cases, dimensionality reduction techniques are
required to use.
Lecture No. 23
Today's Agenda:
 Detail discussion on Dimensionality reduction technique,

 Detail discussion on types of l Dimensionality reduction

Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better fit predictive
model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as the curse of dimensionality.
If the dimensionality of the input dataset increases, any machine learning algorithm and model becomes more
complex. As the number of features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine learning model is trained on high-dimensional data, it
becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with dimensionality reduction.
Benefits of applying Dimensionality Reduction

 Some benefits of applying dimensionality reduction technique to the given dataset are given below:
 By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
 Less Computation training time is required for reduced dimensions of features.
 Reduced dimensions of features of the dataset help in visualizing the data quickly.
 It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which are given below:

 Some data may be lost due to dimensionality reduction.


 In the PCA dimensionality reduction technique, sometimes the principal components required to consider
are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection

Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal
features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common
techniques of filters method are:

 Correlation
 Chi-Square Test
 ANOVA
 Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model. This method
is more accurate than the filtering method but complex to work. Some common techniques of wrapper methods
are:

 Forward Selection
 Backward Selection
 Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine learning model
and evaluate the importance of each feature. Some common techniques of Embedded methods are:

 LASSO
 Elastic Net
 Ridge Regression, etc.

Feature Extraction:

Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources while
processing the information.

Some common feature extraction techniques are:

 Principal Component Analysis  Score comparison


 Linear Discriminant Analysis  Missing Value Ratio
 Kernel PCA  Low Variance Filter
 Quadratic Discriminant Analysis  High Correlation Filter
 Common techniques of Dimensionality  Random Forest
Reduction  Factor Analysis
 Principal Component Analysis  Auto-Encoder
 Backward Elimination
 Forward Selection
Lecture No. 24,25

Today's Agenda:
 Detail discussion on TensorFlow Implementation of CNN,

TensorFlow Implementation of CNN

In this section, we will learn about the TensorFlow implementation of CNN. The steps,which require the
execution and proper dimension of the entire network, are as shown below –

Step 1 − Include the necessary modules for TensorFlow and the data set modules, which are needed to
compute the CNN model.
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
Step 2 − Declare a function called run_cnn(), which includes various parameters and optimization
variables with declaration of data placeholders. These optimization variables will declare the training
pattern.
def run_cnn():
mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)
learning_rate = 0.0001
epochs = 10
batch_size = 50
Step 3 − In this step, we will declare the training data placeholders with input parameters - for 28 x 28
pixels = 784. This is the flattened image data that is drawn from mnist.train.nextbatch().

We can reshape the tensor according to our requirements. The first value (-1) tells function to
dynamically shape that dimension based on the amount of data passed to it. The two middle dimensions
are set to the image size (i.e. 28 x 28).

x = tf.placeholder(tf.float32, [None, 784])


x_shaped = tf.reshape(x, [-1, 28, 28, 1])
y = tf.placeholder(tf.float32, [None, 10])
Step 4 − Now it is important to create some convolutional layers −
layer1 = create_new_conv_layer(x_shaped, 1, 32, [5, 5], [2, 2], name = 'layer1')
layer2 = create_new_conv_layer(layer1, 32, 64, [5, 5], [2, 2], name = 'layer2')
Step 5 − Let us flatten the output ready for the fully connected output stage - after two layers of stride 2
pooling with the dimensions of 28 x 28, to dimension of 14 x 14 or minimum 7 x 7 x,y co-ordinates, but
with 64 output channels. To create the fully connected with "dense" layer, the new shape needs to be [-
1, 7 x 7 x 64]. We can set up some weights and bias values for this layer, then activate with ReLU.
flattened = tf.reshape(layer2, [-1, 7 * 7 * 64])

wd1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1000], stddev = 0.03), name = 'wd1')


bd1 = tf.Variable(tf.truncated_normal([1000], stddev = 0.01), name = 'bd1')
dense_layer1 = tf.matmul(flattened, wd1) + bd1
dense_layer1 = tf.nn.relu(dense_layer1)
Step 6 − Another layer with specific softmax activations with the required optimizer defines the
accuracy assessment, which makes the setup of initialization operator.
wd2 = tf.Variable(tf.truncated_normal([1000, 10], stddev = 0.03), name = 'wd2')
bd2 = tf.Variable(tf.truncated_normal([10], stddev = 0.01), name = 'bd2')

dense_layer2 = tf.matmul(dense_layer1, wd2) + bd2


y_ = tf.nn.softmax(dense_layer2)

cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits = dense_layer2, labels = y))

optimiser = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))


accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init_op = tf.global_variables_initializer()
Step 7 − We should set up recording variables. This adds up a summary to store the accuracy of data.
tf.summary.scalar('accuracy', accuracy)
merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('E:\TensorFlowProject')

with tf.Session() as sess:


sess.run(init_op)
total_batch = int(len(mnist.train.labels) / batch_size)

for epoch in range(epochs):


avg_cost = 0
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size = batch_size)
_, c = sess.run([optimiser, cross_entropy], feed_dict = {
x:batch_x, y: batch_y})
avg_cost += c / total_batch
test_acc = sess.run(accuracy, feed_dict = {x: mnist.test.images, y:
mnist.test.labels})
summary = sess.run(merged, feed_dict = {x: mnist.test.images, y:
mnist.test.labels})
writer.add_summary(summary, epoch)

print("\nTraining complete!")
writer.add_graph(sess.graph)
print(sess.run(accuracy, feed_dict = {x: mnist.test.images, y:
mnist.test.labels}))

def create_new_conv_layer(
input_data, num_input_channels, num_filters,filter_shape, pool_shape, name):

conv_filt_shape = [
filter_shape[0], filter_shape[1], num_input_channels, num_filters]
weights = tf.Variable(
tf.truncated_normal(conv_filt_shape, stddev = 0.03), name = name+'_W')
bias = tf.Variable(tf.truncated_normal([num_filters]), name = name+'_b')

#Out layer defines the output


out_layer =
tf.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding = 'SAME')

out_layer += bias
out_layer = tf.nn.relu(out_layer)
ksize = [1, pool_shape[0], pool_shape[1], 1]
strides = [1, 2, 2, 1]
out_layer = tf.nn.max_pool(
out_layer, ksize = ksize, strides = strides, padding = 'SAME')

return out_layer

if __name__ == "__main__":
run_cnn()
Epoch: 1 cost = 0.676 test accuracy: 0.940

NOTES

UNIT –IV
Recurrent neural network, Long short-term memory, gated recurrent unit, translation, beam search and
width, Bleu score, attention model, Reinforcement Learning, RL-framework, MDP, Bellman equations,
Value Iteration and Policy Iteration, , Actor-critic model, Q -learning, SARSA

Course Outcome Bloom’s Level Weightage


BOCS -605(B).5 Formulate different machine BOCS - Formulate different
learning layers in recurrent 605(B).5 machine learning layers
neural network. in recurrent neural
network.

Lecture No. 27

Today's Agenda:
 Detail discussion on Recurrent neural networks (RNNs),

Recurrent neural networks (RNNs) are the state of the art algorithm for sequential data and are used
by Apple’s Siri and Google’s voice search. It is the first algorithm that remembers its input, due to an
internal memory, which makes it perfectly suited for machine learning problems that involve sequential
data. It is one of the algorithms behind the scenes of the amazing achievements seen in deep
learning over the past few years.RNNs are a powerful and robust type of neural network, and belong to
the most promising algorithms in use because it is the only one with an internal memory.

Because of their internal memory, RNNs can remember important things about the input they received,
which allows them to be very precise in predicting what’s coming next. This is why they’re the
preferred algorithm for sequential data like time series, speech, text, financial data, audio, video,
weather and much more. Recurrent neural networks can form a much deeper understanding of a
sequence and its context compared to other algorithms

Recurrent neural networks (RNNs) are a class of neural network that are helpful in modelling sequence
data. Derived from feedforward networks, RNNs exhibit similar behaviour to how human brains
function. Simply

Sequential data is basically just ordered data in which related things follow each other. Examples are
financial data or the DNA sequence. The most popular type of sequential data is perhaps time series
data, which is just a series of data points that are listed in time order.

RECURRENT VS. FEED-FORWARD NEURAL NETWORKS

RNNs and feed-forward neural networks get their names from the way they channel information.

In a feed-forward neural network, the information only moves in one direction — from the input layer,
through the hidden layers, to the output layer. The information moves straight through the network.
Feed-forward neural networks have no memory of the input they receive and are bad at predicting
what’s coming next. Because a feed-forward network only considers the current input, it has no notion
of order in time. It simply can’t remember anything about what happened in the past except its training.

In a RNN the information cycles through a loop. When it makes a decision, it considers the current input
and also what it has learned from the inputs it received previously. The two images below illustrate the
difference in information flow between a RNN and a feed-forward neural network.
Lecture No. 28

Today's Agenda:
 Detail discussion on TensorFlow Implementation of LSTM

LSTM

The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that maintains its
state over time. The cell state is the horizontal line that runs through the top of the below diagram. It can
be visualized as a conveyor belt through which information just flows, unchanged.

Information can be added to or removed from the cell state in LSTM and is regulated by gates. These
gates optionally let the information flow in and out of the cell. It contains a pointwise multiplication
operation and a sigmoid neural net layer that assist the mechanism.

LSTM Applications

LSTM networks find useful applications in the following areas:


 Language modeling
 Machine translation
 Handwriting recognition
 Image captioning
 Image generation using attention models
 Question answering
 Video-to-text conversion
 Polymorphic music modeling
 Speech synthesis
 Protein secondary structure prediction
Lecture No. 29

Today's Agenda:

 Detail discussion on TensorFlow Implementation of GRU

Gated Recurrent Unit (GRU)


To find the Hidden state Ht in GRU, it follows a two-step process. The first step is to

generate what is known as the candidate hidden state. As shown below

Candidate Hidden State

It takes in the input and the hidden state from the previous timestamp t-1 which is

multiplied by the reset gate output rt. Later passed this entire information to the tanh

function, the resultant value is the candidate’s hidden state.

The most important part of this equation is how we are using the value of the reset gate to

control how much influence the previous hidden state can have on the candidate state.
If the value of rt is equal to 1 then it means the entire information from the previous

hidden state Ht-1 is being considered. Likewise, if the value of rt is 0 then that means the

information from the previous hidden state is completely ignored.

Hidden state

Once we have the candidate state, it is used to generate the current hidden state Ht. It is

where the Update gate comes into the picture. Now, this is a very interesting equation,

instead of using a separate gate like in LSTM in GRU we use a single update gate to

control both the historical information which is Ht-1 as well as the new information

which comes from the candidate state.

Now assume the value of ut is around 0 then the first term in the equation will vanish

which means the new hidden state will not have much information from the previous

hidden state. On the other hand, the second part becomes almost one that essentially

means the hidden state at the current timestamp will consist of the information from the

candidate state only.

You might also like