0% found this document useful (0 votes)
45 views46 pages

Unit 5

The document discusses neural networks, focusing on perceptrons, including single-layer and multi-layer perceptrons, their components, and activation functions. It explains the working principles, advantages, and limitations of these models, emphasizing the importance of activation functions in introducing non-linearities for learning. Additionally, it covers types of activation functions such as linear, sigmoid, and tanh, highlighting their characteristics and implications for neural network training.

Uploaded by

sgkv8056
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views46 pages

Unit 5

The document discusses neural networks, focusing on perceptrons, including single-layer and multi-layer perceptrons, their components, and activation functions. It explains the working principles, advantages, and limitations of these models, emphasizing the importance of activation functions in introducing non-linearities for learning. Additionally, it covers types of activation functions such as linear, sigmoid, and tanh, highlighting their characteristics and implications for neural network training.

Uploaded by

sgkv8056
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Page | 1

UNIT V
NEURAL NETWORKS
Perceptron - Multilayer perceptron, activation functions, network training –
gradient descent optimization – stochastic gradient descent, error backpropagation,
from shallow networks to deep networks –Unit saturation (aka the vanishing gradient
problem) – ReLU, hyperparameter tuning, batchnormalization, regularization, dropout.

1. What is perceptron? Explain single layer and Multiple Layer perceptron with an
example.
PERCEPTRONS
The perceptron was first proposed by Rosenblatt (1958) is a simple neuron that is
used to classify its input into one of two categories. A perceptron is a single processing
unit of a neural network. This is a good learning tool. This model follows perceptron
training rule and it could operate well with linearly separable patterns.

Linear separability is the separation of the input space into regions is based
on whether the network response is positive or negative.

A perceptron uses a step function that returns +1 if the weighted sum of its input (v)
is greater than or equal to 0 else it returns -1.

Working of perceptrons

Representation of a biological neuron

Prepared by Dr.S.Ramesh, AP/CSE


Page | 2

 In the biological neurons, the dendrite receives the electrical signals from the
axons of other neurons. The signals are modulated in various amounts before
further transmission.
 The signals are transmitted to other neurons only if the modulated signal
exceeds the threshold value. The same principle is applied in perceptron model.
 In the perceptron, the input received is always represented as numerical values.
These values are multiplied by the weights.
 The total strength of the input is calculated as the weighted sum of the inputs. A
step function (activation function) is applied to determine its output.
 This output is fed to the other perceptrons if it exceeds the threshold value.

Fig. 2.4. Perceptron Model (In this model x0 = 1, which is the bias)

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:

Prepared by Dr.S.Ramesh, AP/CSE


Page | 3

Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
Wight and Bias:
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
Types of Activation functions:
 Sign function
 Step function, and
 Sigmoid function

The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights.

Prepared by Dr.S.Ramesh, AP/CSE


Page | 4

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

and Bias, net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to the activation function 'f'
to obtain the desired output. This activation function is also known as the step function
and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight of
input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as follows:

Prepared by Dr.S.Ramesh, AP/CSE


Page | 5

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model
Single Layer Perceptron Model
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of
this model is stated as satisfied, and weight demand does not change. However, this
model consists of a few discrepancies triggered when multiple weight inputs values
are fed into the model. Hence, to find desired output and minimize errors, some
changes should be necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."
Multi-Layered Perceptron Model
Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are modified as per the
model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural
networks having various layers in which activation function does not remain linear,
similar to a single layer perceptron model. Instead of linear, activation function can be
executed as sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear
and non-linear patterns. Further, it can also implement logic gates such as AND, OR,
XOR, NAND, NOT, XNOR, NOR.

Prepared by N.GOBINATHAN, AP/CSE


Page | 6

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Advantages of Multi-Layer Perceptron:


 A multi-layered perceptron model can be used to solve complex non-
linear problems.
 It works well with both small and large input data.
 It helps us to obtain quick predictions after the training.
 It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:
 In Multi-layer perceptron, computations are difficult and time-consuming.
 In multi-layer Perceptron, it is difficult to predict how much the
dependent variable affects each independent variable.
 The model functioning depends on the quality of the training.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with
the learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0 otherwise, f(x)=0

 'w' represents real-valued weights vector


 'b' represents the bias
 'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of
binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is
made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the
weight function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between
the two linearly separable classes +1 and -1.

6. If the added sum of all input values is more than the threshold value,
it must have an output signal; otherwise, no output will be shown.

Prepared by N.GOBINATHAN, AP/CSE


Page | 7

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Limitations of Perceptron Model


A perceptron model has limitations as follows:

 The output of a perceptron can only be a binary number (0 or 1) due to


the hard limit transfer function.
 Perceptron can only be used to classify the linearly separable sets of
input vectors. If input vectors are non-linear, it is not easy to classify
them properly.

Multi-layer Perceptron

Multi-Layer perceptron defines the most complex architecture of artificial neural


networks. It is substantially formed from multiple layers of the perceptron. TensorFlow
is a very popular deep learning framework released by, and this notebook will guide to
build a neural network with this library. If we want to understand what is a Multi-layer
perceptron, we have to develop a multi-layer perceptron from scratch using Numpy.

MLP networks are used for supervised learning format. A typical learning algorithm for
MLP networks is also called back propagation's algorithm.

Prepared by N.GOBINATHAN, AP/CSE


Page | 8

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a
set of outputs from a set of inputs. An MLP is characterized by several layers of input
nodes connected as a directed graph between the input nodes connected as a directed
graph between the input and output layers. MLP uses backpropagation for training the
network. MLP is a deep learning method.

2. What is activation functions? Explain types of activation functions..


Activation Functions:

In artificial neural networks, an activation function is one that outputs a smaller


value for tiny inputs and a higher value if its inputs are greater than a threshold. An
activation function "fires" if the inputs are big enough; otherwise, nothing happens. An
activation function, then, is a gate that verifies how an incoming value is higher than a
threshold value.

Because they introduce non-linearities in neural networks and enable the neural
networks can learn powerful operations, activation functions are helpful. A feedforward
neural network might be refectories into a straightforward linear function or matrix
transformation on to its input if indeed the activation functions were taken out.

By generating a weighted total and then including bias with it, the activation function
determines whether a neuron should be turned on. The activation function seeks to boost a
neuron's output's nonlinearity.

Explanation: As we are aware, neurons in neural networks operate in accordance with


weight, bias, and their corresponding activation functions. Based on the mistake, the values
of the neurons inside a neural network would be modified. This process is known as back-
propagation. Back-propagation is made possible by activation functions since they provide
the gradients and error required to change the biases and weights.

The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function

Prepared by N.GOBINATHAN, AP/CSE


Page | 9

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

2. Non-linear Activation Functions

Linear or Identity Activation Function

As you can see the function is a line or linear. Therefore, the output of the functions will not
be confined between any range.

Fig: Linear Activation Function

Equation : f(x) = x

Range : (-infinity to infinity)

It doesn’t help with the complexity or various parameters of usual data that is fed to the
neural networks.

Non-Linear Neural Networks Activation Functions

Sigmoid / Logistic Activation Function

This function takes any real value as input and outputs values in the range of 0 to 1.

The larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to 0.0, as shown below.
Prepared by N.GOBINATHAN, AP/CSE
Page | 10

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Sigmoid/Logistic Activation Function

Mathematically it can be represented as:

Here’s why sigmoid/logistic activation function is one of the most widely used functions:

 It is commonly used for models where we have to predict the probability as an


output. Since probability of anything exists only between the range of 0 and 1, sigmoid is
the right choice because of its range.

 The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.

The limitations of sigmoid function are discussed below:

 The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).

Prepared by N.GOBINATHAN, AP/CSE


Page | 11

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

The derivative of the Sigmoid Activation Function


As we can see from the above Figure, the gradient values are only significant for range -3 to 3,
and the graph gets much flatter in other regions.

It implies that for values greater than 3 or less than -3, the function will have very small
gradients. As the gradient value approaches zero, the network ceases to learn and suffers from
the Vanishing gradient problem.

 The output of the logistic function is not symmetric around zero. So the output of all
the neurons will be of the same sign. This makes the training of the neural network
more difficult and unstable.

Tanh Function (Hyperbolic Tangent)

Tanh function is very similar to the sigmoid/logistic activation function, and even has the same
S- shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more
positive), the closer the output value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to -1.0.

Prepared by N.GOBINATHAN, AP/CSE


Page | 12

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Tanh Function (Hyperbolic Tangent)

Mathematically it can be represented as:

Advantages of using this activation function are:

 The output of the tanh activation function is Zero centered; hence we can easily map
the output values as strongly negative, neutral, or strongly positive.

 Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps
in centering the data and makes learning for the next layer much easier.

Have a look at the gradient of the tanh activation function to understand its limitations.

Prepared by N.GOBINATHAN, AP/CSE


Page | 13

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Gradient of the Tanh Activation Function

Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and
the gradients are not restricted to move in a certain direction. Therefore, in practice, tanh
nonlinearity is always preferred to sigmoid nonlinearity.

As you can see— it also faces the problem of vanishing gradients similar to the sigmoid
activation function. Plus the gradient of the tanh function is much steeper as compared to the
sigmoid function.
The advantage is that the negative inputs will be mapped strongly negative and the zero
inputs will be mapped near zero in the tanh graph.
The function is differentiable.
The function is monotonic while its derivative is not monotonic.
The tanh function is mainly used classification between two classes.
Both tanh and logistic sigmoid activation functions are used in feed-forward nets.

Prepared by N.GOBINATHAN, AP/CSE


Page | 14

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Fig: Activation Function

Prepared by N.GOBINATHAN, AP/CSE


Page | 15

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Sigmoid

The sigmoid function consists of 2 functions, logistic and tangential. The values of
logistic function range from 0 and 1 and -1 to +1 for tangential function.

It is a functional that is graphed in a "S" shape.


A is equal to 1/(1 + e-x).
Non-linear in nature. Observe that while Y values are fairly steep, X values range from -2 to
2. To put it another way, small changes in x also would cause significant shifts in the value
of Y. spans from 0 to 1.
Uses: Sigmoid function is typically employed in the output nodes of a classi?cation, where
the result may only be either 0 or 1. Since the value for the sigmoid function only ranges
from 0 to 1, the result can be easily anticipated to be 1 if the value is more than 0.5 and 0 if
it is not.
3. Explain in detail about gradient descent and Stochastic gradient decent with
an example.
Training rules in Perceptron: Gradient Descent and Delta Rule

 The perceptron rule fails to converge if the examples are not linearly separable.
 Delta rule is designed to overcome this difficulty.
 If the training examples are not linearly separable, the delta rule converges toward
a best-fit approximation to the target concept.
 The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the training
examples.
 Gradient descent searches the hypothesis space of possible weight vectors to find the

Prepared by N.GOBINATHAN, AP/CSE


Page | 16

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

best one. The search hypothesis space contains many different types of
o continuously parameterized hypotheses.

“Gradient descent is an iterative algorithm, that starts from a random point


on a function and travels down its slope in steps until it reaches the
lowest point.

 To find a local minimum of a function using gradient descent, one takes steps
proportional to the negative of the gradient (or of the approximate gradient) of the
function from the current point.

 A linear unit corresponds to the first stage of a perceptron, without the


threshold.

 Where D is the set of training examples, td is the target output (actual output) for
training exampled, and od (predicted output) is the output of the linear unit for
training example d.
 E(→w ) is half the squared difference between the target output and the
output, summed over all training examples. This is the deviation between actual and
target output. In other words it is the error in prediction.
 The best machine learning model will try to minimise this error value through
continuous learning.

Prepared by N.GOBINATHAN, AP/CSE


Page | 17

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

"A gradient measures how much the output of a function changes if you change the
inputs a little bit." —Lex Fridman (MIT)

Importance of Learning rate

How big the steps are gradient descent takes into the direction of the local minimum are
determined by the learning rate, which figures out how fast or slows we will move towards
the optimal weights.

For gradient descent to reach the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high. This is important because if the

Prepared by N.GOBINATHAN, AP/CSE


Page | 18

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

steps it takes are too big, it may not reach the local minimum because it bounces back and
forth between the convex function of gradient descent (see left image below). If we set the
learning rate to a very small value, gradient descent will eventually reach the local
minimum but that may take a while (see the right image).

Advantages:
 Easy computation.
 Easy to implement.
 Easy to understand.
Disadvantages:
 May trap at local minima.
 Weights are changed after calculating the gradient on the whole dataset. So, if the
dataset is too large then this may take years to converge to the minima.
 Requires large memory to calculate the gradient on the whole dataset.

Stochastic Gradient Descent (SGD)


This is an improvisation done over gradient descent to reduce the computations.
 SGD or incremental gradient descent randomly picks one data point from the
whole data set at each iteration to reduce the computations enormously.
 SGD approximates the gradient descent search by updating weights
incrementally, following the calculation of the error for each individual
example.
 Stochastic Gradient Descent (SGD) is a type of gradient descent that runs one
training example per iteration. It processes a training epoch for each example within
a dataset and updates each training example's parameters one at a time.
 As it requires only one training example at a time, hence it is easier to store in
allocated memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that require
more detail and speed.

Prepared by N.GOBINATHAN, AP/CSE


Page | 19

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

 Further, due to frequent updates, it is also treated as a noisy gradient. However


sometimes it can be helpful in finding the global minimum and also escaping the
local minimum.
Advantages of Stochastic gradient descent :
 It is easier to allocate in desired memory.
 It is relatively fast to compute than batch gradient descent.
 It is more efficient for large datasets.
Disadvantages of Stochastic Gradient Descent :
a) SGD requires a number of hyperparameters such as the
regularization parameter and the number of iterations.
b) SGD is sensitive to feature scaling.

Differences between Gradient Descent and Stochastic Gradient Descent


Gradient Descent Stochastic Gradient Descent
The error is summed over all the The weights are updated after each
examples before updating the weights. training example.
This requires more computations. Relatively fewer computations are
needed.
Summing over multiple examples requires This requires lower step size per weight
larger step size per weight update. update.
4. Explain in detail about error backpropagation with an example.
Error Backpropagation
Backpropagation is a training method used for a multi-layer neural network. It is
also called the generalized delta rule. It is a gradient descent method which minimizes the
total squared error of the output computed by the net.
The backpropagation algorithm looks for the minimum value of the error function in
weight space using a technique called the delta rule or gradient descent. The weights that
minimize the error function is then considered to be a solution to the learning problem.
Backpropagation is a systematic method for training multiple layer ANN. t is a
generalization of Widrow-Hoff error correction rule. 80 % of ANN applications uses back
propagation..
Consider a simple neuron:
a. Neuron has a summing junction and activation function.

Prepared by N.GOBINATHAN, AP/CSE


Page | 20

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

b. Any non linear function which differentiable everywhere and increases everywhere with
sum can be used as activation function.

Fig:Back Propagation network


C. Examples: Logistic function, Arc tangent function, hyperbolic tangent activation function,
These activation functions make the multilayer network to have greater representational
power than single layer network only when non-linearity is introduced.
Need of hidden layers
1, A network with only two layers (input and output) can only represent the input with
whatever representation already exists in the input data.
2. If the data is discontinuous or non-linearly separable, the innate representation is
inconsistent, and the mapping cannot be learned using two layers (Input and Output).
3. Therefore, hidden layer(s) are used between input and output layers Weights connects
unit (neuron) in one layer only to those in the next higher layer.
The output of the unit is scaled by the value of the connecting weight, and it is fed forward
to provide a portion of the activation for the units in the next higher layer.
• Backpropagation can be applied to an artificial neural network with any number of
hidden layers. The training objective is to adjust the weights so that the application of a set
of inputs produces the desired outputs.

Prepared by N.GOBINATHAN, AP/CSE


Page | 21

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Training procedure : The network is usually trained with a large number of input - output
pairs.
1. Generate weights randomly to small random values (both positive and negative) to
ensure that the network is not saturated by large values of weights.
2. Choose a training pair from the training
set. 3, Apply the input vector to network
input.
4, Calculate the network output.
5. Calculate the error, the difference between the network output and the desired output.
6. Adjust the weights of the network in a way that minimizes this error.
7. Repeat steps 2 6 for each pair of input output in the training set until the error for the
entire system is acceptably low.
Forward pass and backward pass :
• Backpropagation neural network training involves two passes.
1. In the forward pass, the input signals moves forward from the network input to the
output.
2. In the backward pass, the calculated error signals propagate backward through the
network, where they are used to adjust the weights.
3. In the forward pass, the calculation of the output is carried out, layer by layer in the
forward direction. The output of one layer is the input to the next layer.
• In the reverse pass,
 The weights of the output neuron layer are adjusted first since the target value of
each output neuron is available to guide the adjustment of the associated weights,
using the delta rule.
 Next, we adjust the weights of the middle layers. As the middle layer neurons have
no target values, it makes the problem complex.
Selection of number of hidden units : The number of hidden units depends on the
number of input units.
1. Never choose h to be more than twice the number of input units.
2. You can load patterns of I elements into log, p hidden units.
3. Ensure that we must have at least 1/e times as many training examples.

Prepared by N.GOBINATHAN, AP/CSE


Page | 22

4. Feature II
extraction
Year/CSE requires fewer hidden units than
CS3491-Artificial inputs. and Machine Learning
Intelligence

Prepared by N.GOBINATHAN, AP/CSE


Page | 23

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

5. Learning many examples of disjointed inputs requires more hidden units than inputs.
6. The number of hidden units required for a classification task increases with the number
of classes in the task. Large networks require longer training times.
Factor influencing Backpropagation training:
The training time can be reduced by using:
Bias : Networks with biases can represent relationships between outputs more easily than
networks it out biases, Adding a bias to eat neuron 1s usually desirable to offset the origin
of the activation function. The weight of the bias is trainable similar to weight except that
the input is always +1.
2. Momentum :The use of momentum enhances the stability of the training process.
Momentum is used to keep the training process going in the same general direction
analogous to the way that momentum of a moving object behave. In back propagation with
momentum, the weight change is a combination of the current gradient and the previous
gradient.
Advantages and Disadvantages
Advantages of backpropagation:
1. It is simple, fast and easy to program.
2. Only numbers of the input are tuned and not any other
3. No need to have prior knowledge about the network.
4. It is flexible.
5. A standard approach and works efficiently.
6. It does not require the user to learn special functions.
Disadvantages of backpropagation:
 Backpropagation possibly be sensitive to noisy data and irregularity.
 The performance of this is highly reliant on the input data.
 Needs excessive time for training.
 The need for a matrix-based method for backpropagation instead of mini - batch.
5. Explain in detail about shallow and deep networks with an example.
Shallow Networks:
The terms shallow and deep refer to the number of layers in a neural network; shallow
neural networks refer to a neural network that have a small number of layers, usually

Prepared by N.GOBINATHAN, AP/CSE


Page | 24

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

regarded as having a single hidden layer, and deep neural networks refer to neural
networks that have multiple hidden layers. Both types of networks perform certain tasks
better than the other and selecting the right network depth is important for creating a
successful model.
• In a shallow neural network, the values of the feature vector of the data to be classified
(the input layer) are passed to a hidden layer of nodes (neurons) each of which generates a
response according to some activation function, g, acting on the weighted sum of those
values, z.
The responses of each unit in the hidden layer is then passed to a final, outputs layer
(which may consist of a single unit), whose activation produces the classification prediction
output.
Deep Network:
Deep learning is a new area of machine learning research, which has been
introduced with the objective of moving machine learning closer to one of its original goals.
Deep learning is about learning multiple levels of representation and abstraction that help
to make sense of data such as images, sound, and text.
Deep learning' means using a neural network with several layers of nodes between
input and output. It is generally better than other methods on image, speech and certain
other types of data because the series of layers between input and output do feature
identification and processing in a series of stages, just as our brain seem to.
Deep Learning emphasizes the network architecture of today's most successful
machine learning approaches. These methods are based on "deep" multi – layer neural
networks with many hidden layers.
TensorFlow
 TensorFlow is one of the most popular frameworks used to build deep learning
models. The framework is developed by Google Brain Team.
 Languages like C++, R and Python are supported by the framework to create the
models as well as the libraries. This framework can be accessed from both -desktop
and mobile.

Prepared by N.GOBINATHAN, AP/CSE


Page | 25

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

 The translator used by Google is the best example of TensorFlow. In this, the model
is created by adding the functionalities of text classification, natural language
processing, speech or handwriting recognition, image recognition, etc.
 The framework has its own visualization toolkit, named Tensor Board which helps
in powerful data visualization of the network along with its performance.
 One more tool added in TensorFlow, TensorFlow Serving, can be used for quick and
easy deployment of the newly developed algorithms without introducing any
change in the existing API or architecture.
 TensorFlow framework comes along with a detailed documentation for the users to
adapt it quickly and easily, making it the most preferred deep learning
 Tensorflow framework comes along with a detailed documentation for the users to
adapt it quickly and easily , making it the most preferred deep learning framework
to model deep learning algorithms.
Some of the characteristics of TensorFlow is :
 Multiple GPU supported
 One can visualize graphs and queues easily using Tensorboard.
 Powerful documentation and larger support from community.
Keras
If you are comfortable in programming with Python, then learning Keras will no
prove hard to you. This will be the most recommended framework to create deep learning
models for ones having a sound of Python.
Keras is built purely on Python and can run on the top of TensorFlow. Due to its
complexity and use of low - level libraries, TensorFlow can be comparatively harder to
adapt for the new users as compared to Keras. Users those who are beginners in deep
learning, and find its models difficult to understand in TensorFlow generally prefer Keras
as it solves all complex models in no time.
Keras has been developed keeping in mind the complexities in the deep learning
models, and hence it can run quickly to get the results in minimum time.
Convolutional as well as Recurrent Neural networks are supported in Keras. The
framework can run easily on CPU and GPU.

Prepared by N.GOBINATHAN, AP/CSE


Page | 26

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

The models in Keras can be classified into 2 categories:


1. Sequential model :
The layers in the deep learning model are defined in a sequential manner. Hence the
implementation of the layers in this model will also be done sequentially.

2. Keras functional API:


Deep learning models that has multiple outputs, or has shared layers, i.e. more
Complex models can be implemented in Keras functional API.
Difference between Deep Network and Shallow Network
Deep network Shallow network

Deep network contains many hidden layers. Shallow network contains only one hidden
layer.

Deep network can compactly express highly Shallow networks with oneHidden layer
complex functions over input space. cannot place complex functions over the
input space.

Training in DN is easy and no issue of local Shallow network is more difficult to train
minima in DN. with our current algorithms.

DN can fit functions better with less Shallow net's needs more parameters to
parameters than a shallow network have better fit.

6. Explain in detail about vanishing gradient problem.


Vanishing Gradient Problem
 The vanishing gradient problem is a problem that user face, when we are training
Neural Networks by using gradient-based methods like backpropagation. The
problem makes it difficult to learn and tune the parameters of the earlier layers in
the network.
 The vanishing gradient problem is essentially a situation in which a deep multilayer
feed-forward network or a Recurrent Neural Network (RNN) does not have the
ability to propagate useful gradient information from the output end of the model
back to the layers near the input end of the model.

Prepared by N.GOBINATHAN, AP/CSE


Page | 27

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

 It results in models with many layers being rendered unable to learn on a specific
dataset. It could even cause models with many layers to prematurely converge to a
substandard solution.
 When the backpropagation algorithm advances downwards or backward going from
the output layer to the input layer, the gradients tend to shrink, becoming smaller
and smaller till they approach zero. This ends up leaving the weights of the initial or
lower layers practically unchanged. In this situation, the gradient descent does not
ever end up converging to the optimum.
 Vanishing gradient does not necessarily imply that the gradient vector is all zero. It
implies that the gradients are minuscule, which would cause the learning to be very
slow.
 The most important solution to the vanishing gradient problem is a specific type of
neural network called Long Short-Term Memory Networks (LSTMs).

Indication of vanishing gradient problem:


a) The parameters of the higher layers change to a great extent, while the parameters of
lower layers barely change.
b) The model weights could become 0 during training.
c) The model learns at a particularly slow pace and the training could stagnate at a very
early phase after only a few iterations.
Some methods that are proposed to overcome the vanishing gradient problem :
a) Residual neural networks (ResNets)
b) Multi-level hierarchy
c) Long short term memory (LSTM)
d) Faster hardware
e) ReLU
f) Batch normalization.

Prepared by N.GOBINATHAN, AP/CSE


Page | 28

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

7. Explain in detail about RELU function and its usage in hidden layer.
ReLU
• Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReU is a non-
linear function or piecewise linear function that will output the input directly if it is
positive, otherwise, it will output zero.
• It is the most commonly used activation function in neural networks, especially
Convolutional Neural Networks (CNNs) and Multilayer perceptron’s.
• . Mathematically, it is expressed as
f(x) = max (0, x)
where X : input to neuron

Fig. ReLU function


• The derivative of an activation function is required when updating the weights
during the back-propagation of the error. The slope of ReLU is 1 for positive values
and 0 for negative values. It becomes non-differentiable when the input x is zero,
but it can be safely assumed to be zero and causes no problem in practice.
• ReLU is used in the hidden layers instead of Sigmoid or tanh. The ReLU function
solves the problem of computational complexity of the Logistic Sigmoid and Tanh
functions.n
• A ReLU activation unit is known to be less likely to create a vanishing gradient
problem because its derivative is always 1 for positive values of the argument.

Prepared by N.GOBINATHAN, AP/CSE


Page | 29

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Advantages of ReLU function:


a) ReLU is simple to compute and has a predictable gradient for the backpropagation
of the error.
b) Easy to implement and very fast.
c) The calculation speed is very fast. The ReLU function has only a direct relationship.
d) It can be used for deep network training.
Disadvantages of ReLU function:
a) When the input is negative, ReLU is not fully functional which means when it comes
to the wrong number installed, ReLU will die. Thís problem is also known as the
Dead Neurons problem.
b) b) ReLU function can only be used within hidden layers of a Neural Network Model.

LReLU and ERELU


1. LReLU
• The Leaky ReLU is one of the most well-known activation function. It is the same as
ReLU for positive numbers. But instead of being 0 for all negative values, it has a
constant slope (less than 1.).
• Leaky ReLU is a type of activation function that helps to prevent the function from
becoming saturated at 0. It has a small slope instead of the standard ReLU which has
an infinite slope.
• Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Fig. shows ReLU
function.

Fig. LReLU function

Prepared by N.GOBINATHAN, AP/CSE


Page | 30

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

• The leak helps to increase the range of the ReLU function. Usually, the value of a is
0.01 or so.
• The n motivation foe using LReLU instead of ReLU is that constant zero gradients
can also result in slow learning, as when a saturated neuron uses a sigmoid
activation function.
EReLU:
• An Elastic ReLU (EReLU) Considers a slope randomly drawn from a uniform
distribution during the training for the positive inputs to control the amount of non-
Linearity.
• The EReLU 15 defined as : EReLU() max(RX: 0) in the output range of[0;1]where R is
a random number
• At the test time, the EReLU becomes the identity function for positive inputs.

7. Write short note on hyperparameter tuning.


Hyperparameter Tuning
• Hyperparameters are parameters whose values control the learning process and
determine the values of model parameters that a learning algorithm ends up
learning.
• While designing a machine learning model, one always has multiple choices for the
architectural design for the model based on its optimality. This creates a confusion
on which design to choose for the model based on its optimality. And due to this,
there are always trials for defining a perfect machine learning model.
• . The parameters that are used to define these machine learning models are known
as the hyperparameters and the rigorous search for these parameters to build an
optimized model is known as hyperparameter tuning.
• Hyperparameters are not model parameters, which can be directly trained from
data. Model parameters usually specify the way to transform the input into the
required output, whereas hyperparameters define the actual structure of the model
that gives the required data.

Prepared by N.GOBINATHAN, AP/CSE


Page | 31

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Layer Size:
 Layer size is defined by the number of neurons in a given layer. Input and output
layers are relatively easy to figure out because they correspond directly to how our
modeling problem handles input and output.
 For the input layer, this will match up to the number of features in the input vector.
For the output layer, this will either be a single output neuron or a number of neurons
matching the number of classes we are trying to predict.
 It is obvious that a neural network with 3 layers will give better performance than that
of 2 layers. Increasing more than 3 doesn't help that much in neural networks. In the
case of CNN, an increasing number of layers makes the model better.
Magnitude : Learning Rate
 The amount that the weights are updated during training 1s referred to as the size or
the learning rate. Specifically, the learning rate is a configurable hyper-parameter
used in the training of neural networks that has a small positive value, often in the
range between 0.0 and 1.0.
 For example, if learning rate is 0.1, then the weights in the network are updated 0.1 *
(estimated weight error) or 10 % of the estimated weight error each time the weights
are updated. The learning rate hyper-parameter controls the rate or speed at which the
model learns.
• Learning rates are tricky because they end up being specific to the dataset and even to
other hyper-parameters. This creates a lot of overhead for finding the right setting for
hyper-parameters.
• Large learning rates () make the model learn faster but at the same time it may cause us to
miss the minimum loss function and only reach the surrounding of it. In cases where the
learning rate is too large, the optimizer overshoots the minimum and the loss updates will
lead to divergent behaviors.
• On the other hand, choosing lower learning rate values gives a better chance of finding
the local minima with the trade-off of needing larger number of epochs and more time.
• Momentum can accelerate learning on those problems where the high-dimensional
weight space that is being navigated by the optimization process has structures that
mislead the gradient descent algorithm, such as flat regions or steep curvature.

Prepared by N.GOBINATHAN, AP/CSE


Page | 32

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

8. Explain in detail about normalization and batch normalization with an


example.
Normalization
• Normalization is a data preparation technique that is frequently used in machine learning.
The process of transforming the columns in a dataset to the same scale is referred to as
normalization. Every dataset does not need to be normalized for machine learning.
• Normalization makes the features more consistent with each other, which allows the
model to predict outputs more accurately. The main goal of normalization S to make the
data homogenous over all records and fields.
• Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range. Data
normalization is used in machine learning to make model training less sensitive to the
scale of features.
• Normalization is important in such algorithms a s k-NN, SUPPORT VECTOR
MACHINES, NEURAL NETWORKS, and principal components. The type of feature
preprocessing and normalization that’s needed can depend on the data.
Batch Normalization
What is Batch Normalization?

Normalization is a data pre-processing tool used to bring the numerical data to a common

scale without distorting its shape.

Generally, when we input the data to a machine or deep learning algorithm we tend to

change the values to a balanced scale. The reason we normalize is partly to ensure that our

model can generalize appropriately.

Now coming back to Batch normalization, it is a process to make neural networks faster

and more stable through adding extra layers in a deep neural network. The new layer

performs the standardizing and normalizing operations on the input of a layer coming from

a previous layer.

But what is the reason behind the term “Batch” in batch normalization? A typical neural

network is trained using a collected set of input data called batch. Similarly, the

normalizing process in batch normalization takes place in batches, not as a single input.

Prepared by N.GOBINATHAN, AP/CSE


Page | 33

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Let’s understand this through an example, we have a deep neural network as shown in the

following image.

Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-

processing stage. When the input passes through the first layer, it transforms, as a sigmoid

function applied over the dot product of input X and the weight matrix W.

Similarly, this transformation will take place for the second layer and go till the last layer L

as shown in the following image.

Prepared by N.GOBINATHAN, AP/CSE


Page | 34

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Although, our input X was normalized with time the output will no longer be on the same

scale. As the data go through multiple layers of the neural network and L activation

functions are applied, it leads to an internal co-variate shift in the data.

How does Batch Normalization work?

Since by now we have a clear idea of why we need Batch normalization, let’s understand

how it works. It is a two-step process. First, the input is normalized, and later rescaling and

offsetting is performed.

Normalization of the Input

Normalization is the process of transforming the data to have a mean zero and standard

deviation one. In this step we have our batch input from layer h, first, we need to calculate

the mean of this hidden activation.

Here, m is the number of neurons at layer h.

Once we have meant at our end, the next step is to calculate the standard deviation of the

hidden activations.

Prepared by N.GOBINATHAN, AP/CSE


Page | 35

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Further, as we have the mean and the standard deviation ready. We will normalize the

hidden activations using these values. For this, we will subtract the mean from each input

and divide the whole value with the sum of standard deviation and the smoothing term (ε).

The smoothing term(ε) assures numerical stability within the operation by stopping a

division by a zero value.

Rescaling of Offsetting

In the final operation, the re-scaling and offsetting of the input take place. Here two

components of the BN algorithm come into the picture, γ(gamma) and β (beta). These

parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from

the previous operations.

These two are learnable parameters, during the training neural network ensures the

optimal values of γ and β are used. That will enable the accurate normalization of each

batch.
Advantages of Batch Normalization

Now let’s look into the advantages the BN process offers.


Speed Up the Training

By Normalizing the hidden layer activation the Batch normalization speeds up the

training process.
Handles internal covariate shift

It solves the problem of internal covariate shift. Through this, we ensure that the

input for every layer is distributed around the same mean and standard deviation. If you

are unaware of what is an internal covariate shift, look at the following example.
Internal covariate shift

Prepared by N.GOBINATHAN, AP/CSE


Page | 36

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Suppose we are training an image classification model, that classifies the images

into Dog or Not Dog. Let’s say we have the images of white dogs only, these images will

have certain distribution as well. Using these images model will update its parameters.

later, if we get a new set of images, consisting of non-white dogs. These new images will

have a slightly different distribution from the previous images. Now the model will change

its parameters according to these new images. Hence the distribution of the hidden

activation will also change. This change in hidden activation is known as an internal

covariate shift.

However, according to a study by MIT researchers, the batch normalization does not solve

the problem of internal covariate shift.


Advantages of Batch Normalization:
a) The model is le8e delicate to hyperparameter tuning.
b) Shrinks internal covariant shift.
c) Diminishes the reliance of gradients -on the scale of the parameters or their underlying
values.

Prepared by N.GOBINATHAN, AP/CSE


Page | 37

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

d) Dropout can be evacuated for regularization


9. What is use of regularization in machine learning? Explain difference between L1
and L2 regularization.
Regularization

• Just have a look at the above figure, and we can immediately predict that once we
try to cover every minutest feature of the input data, there can be irregularities in
the extracted features, which can introduce noise in the output. This is referred to as
"Overfitting".
• This may also happen with the lesser number of features extract as some of the
important details might be missed out. This will leave an effect on the accuracy of
the outputs produced. This is referred to as "Underfitting".
 This also shows that the complexity for processing the input elements increases
with overfitting. Also, neural networks being a complex interconnection of nodes
the issue of overfitting may arise frequently.
• To eliminate this, regularization is used, in which we have to make the slightest
modification in the design of the neural network, and we can get better outcomes.

Regularization in Machine Learning


 One of the most important factors. that affect the machine learning model is
overfitting.
 The machine learning model may perform poorly if it tries to capture even the noise
present in the dataset applied for training the system, which ultimately results in
overfitting. In this context, noise doesn't mean the ambiguous or false data, but
those inputs which do not acquire the required features to execute the machine
learning model.
 Analyzing these data inputs may surely make the model flexible, but the risk of
overfitting will also increase accordingly.
 One or the ways to avoid this is to cross validate the training dataset, and decide
accordingly the parameters to include that can increase the efficiency and
perfornance of the model.
Let this be the simple relation for linear
regression : Y = bo + bX + b,X, t . bp

Prepared by N.GOBINATHAN, AP/CSE


Page | 38

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Y = Learned relation
B(beta) = Co-efficient estimators for different variables and/or predictors (X)
 Now, we shall introduce a loss function, that implements the fitting procedure,
which is referred to as "Residual Sum of Squares'" or RSS.
 The co-efficient in the function is chosen in such a way that it can minimize the loss
function easily.

RSS=∑𝑛 (𝑌𝑖 − 𝛽0 − 𝛽𝑗𝑋𝑖𝑗)2


𝑖=
1 ∑𝑝
𝑗
=1
• In case noise is present in the training dataset, then the adjusted co-efficient won't be
generalized when the future datasets will be introduced. Hence, at this point, regularization
comes into picture and makes this adjusted co-efficient shrink towards zero.
One of the methods to implement this is the ridge regression, also known as L2
regularization. Lets have a quick overview on this.

How does Regularization help reduce Overfitting?


Let’s consider a neural network which is overfitting on the training data as shown in the
image below.

If you have studied the concept of regularization in machine learning, you will have a fair
idea that regularization penalizes the coefficients. In deep learning, it actually penalizes the
weight matrices of the nodes.

Assume that our regularization coefficient is so high that some of the weight matrices are
nearly equal to zero.

Prepared by N.GOBINATHAN, AP/CSE


Page | 39

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

This will result in a much simpler linear network and slight underfitting of the training

data.

Such a large value of the regularization coefficient is not that useful. We need to optimize

the value of regularization coefficient in order to obtain a well-fitted model as shown in the

image below.

Ridge Regression (L2 Regularization)


Ridge regression, also known as L2,regularization, is a technique of regularization to avoid
the overfitting in training data set, which introduces a small bias in the training model,
through which one can get long term predictions for that input.
In this method, a penalty term is added to the cost function. This amount of bias altered to
the cost function in the model is also known as ridge regression penalty Hence, the
equation for the cost

Prepared by N.GOBINATHAN, AP/CSE


Page | 40

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

In L2 Regularization, the term we add to the cost function is the following:

In this case, the regularization term is the squared norm of the weights of each network’s
layer. This matrix norm is called Frobenius norm and, explicitly, it’s computed as follows:

Please note that the weight matrix relative to layer l has n^{[l]} rows and n^{[l-1]} columns.
Finally, the complete cost function under L2 Regularization becomes:

Again, λ is the regularization term and for λ=0 the effects of L2 Regularization are null.
L2 Regularization brings towards zero the values of the weights, resulting in a more
simple model.
 It regularizes the co-efficient set for the model and hence the ridge regression term
deduces the values of the coefficient, which ultimately helps in deducing the
complexity of the machine learning model.
 From the above equation, we can observe that if the value of tends to zero, the last
term on the right - hand side will tend to zero,. thus making the above equation a
representation of a simple linear regression model:
 Hence, lower the value of , the model will tend to linear regression.
 This model is important to execute the neural networks for machine learning, as
there would be risks of failure for generalized linear regression models, if there are
dependencies found between its variables. Hence, ridge regression is used here.
Lasso Regression (L1 Regularization)
 One more technique to reduce the overfitting, and thus the complexity of the model
is the lasso regression.

Prepared by N.GOBINATHAN, AP/CSE


Page | 41

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

 Lasso regression stands for Least Absolute and Selection Operator and is
also sometimes known as L1 regularization.
 The equation for the lasso regression is almost same as that of the ridge regression,
except for a change that the value. of the penalty term is taken as the absolute
weights.
 The advantage of taking the absolute values is that its slope can shrink to 0,
as compared to the ridge regression, where the slope will shrink it near to 0.
The following equation gives the cost function defined in the Lasso regression:
In L1 Regularization we add the following term to the cost function J:

where the matrix norm is the sum of the absolute value of the weights for each layer 1, …, L
of the network:

λ is the regularization term. It’s a hyperparameter that must be carefully tuned. λ directly
controls the impact of the regularization: as λ increases, the effects on the weights
shrinking are more severe.
The complete cost function under L1 Regularization becomes:

For λ=0, the effects of L1 Regularization are null. Instead, choosing a value of λ which is
too big, will over-simplify the model, probably resulting in an underfitting network.
L1 Regularization can be considered as a sort of neuron selection process because it would
bring to zero the weights of some hidden neurons.

• Due to the acceptance of absolute values for the cost function, some of the features of the
input dataset can be ignored completely while evaluating the machine learning model, and
hence the feature selection and overfitting can be reduced to much extent.

Prepared by N.GOBINATHAN, AP/CSE


Page | 42

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

• On the other hand, ridge regression does not ignore any feature in the model and includes
it all for model evaluation. The complexity of the model can be reduced using the shrinking
of co-efficient in the ridge regression model.

Dropout
Dropout was introduced by "Hinton et al' and this method is now very popular. It Consists of
setting to zero the output of each hidden neuron in chosen layer with some probability and is
proven to be very effective in reducing overfitting,
Understand dropout, let’s say our neural network structure is akin to the one shown below:

Fig. shows dropout regulations.


So what does dropout do? At every iteration, it randomly selects some nodes and removes
them along with all of their incoming and outgoing connections as shown above.
 To achieve dropout regularization, some neurons in the artificial neural network
randomly disabled. That prevents them from being too dependent another as they
learn the correlations. Thus, the neurons work more independently, and the
artificial neural network learns multiple independent correlations in the data based
on different configurations of the neurons.
 It is used to improve the training of neural networks by omitting a hidden unit. It
also speeds training.
 Dropout is driven by randomly dropping a neuron so that it will not contribute to
the forward pass and back-propagation.
 Dropout is an inexpensive but powerful method of regularizing a broad family of
models.

Prepared by N.GOBINATHAN, AP/CSE


Page | 43

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

Drop Connect:
Drop Connect, known as the generalized version of Dropout, is the method used tor
regularizing deep neural networks.

Fig. dropconnect.
DropConnect has been proposed to add more noise to the network. The primary difference
is that instead of randomly dropping the output of the neurons, we randomlỳ drop the
connection between neurons.
In other words, the fully connected layer with DropConnect becomes a sparsely connected
layer in which the connections are chosen at random during the training stage.
Difference between L1 and L2 Regularization:
S.No L1 Regularization L2 Regularization

Penalizes the sum of absolute value Penalizes the sum of square weights.
1.
weights.
2. It has a sparse solution It has a non-sparse solution
It gives multiple solutions. It has only one solutions.
3.

4. Constructed in feature selection No feature selection


5. Robust to outliers Not robust to outliers
It gives more accurate predictions when the
It generates simple and
6. output variable is the function of whole input
interpretable
variables.
Unable to learn complex data
7. Able to learn complex data patterns
patterns
Computationally inefficient over utationally efficient because of having
8.
non- space conditions analytics solutions.

Prepared by N.GOBINATHAN, AP/CSE


Page | 44

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

PART-A

1. What is multilayer perceptron.


The Multilayer Perceptron (MLP) model features multiple layers that are
interconnected in such a way that they form a feed-forward neural network. Each
Neuron in one layer has directed connections to the neurons of a separate layer. It
Consists of three types of layers : the input layer, output layer and hidden layer.
2. What is vanishing gradient problem?
When back-propagation is used, the earlier layers will receive very small
updates compared to the later layers. This problem is referred to as the vanishing
gradient problem. The vanishing gradient problem is essentially a situation in
which a deep multilayer feed-forward network or a recurrent neural network
(RNN) does not have the ability to propagate useful gradient information from the
output end of the model back to the layers near the input end of the model.
3. List out the advantage of deep learning.

 No need for feature engineering


 Deep learning gives more accuracy
 DL Solves the problem on the end to end basis.
4. What is backpropagation.
Backpropagation is a training method used for a multi-layer neural network. It
is also called the generalized delta rule. It is a gradient descent method which
minimizes the total squared error of the output computed by the net.
5. What are hyperparameters?
Hyperparameters are parameters whose values control the learrning process and
determine the values of model parameters that a learning algorithm ends up
learning.
6. Define ReLU.
Rectified Linear Unit (ReLU) solve the vanishing gradient problem, ReLU is a
nonlinear function or piecewise linear function that will output the input directly if
it is positive, otherwise, it will output zero.
7. Define normalization.
Normalization is a data pre-processing tool used to bring the numerical data to
a common scale without distorting its shape.
8. What is batch normalization?
It is a method of adaptive reparameterization, motivated by the difficulty of
training very deep models. In Deep networks, the weights are updated for each
layer.So the output will no longer be on the same scale as the input.
9. List out the advantages of ReLU function.
Advantages of ReLU function :
a) ReLU is simple to compute and has a predictable gradient for the
backpropagation of the error.
b) Easy to implement and Very fast.
c) It can be used for deep network training advantages of ReLU function.

Prepared by N.GOBINATHAN, AP/CSE


Page | 45

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

10. Define Ridge regression.


Ridge regression, also known as L2 regularization, is a technique of regularization
to avoid the over fitting in training data set, which introduces a small Bias in the
training model, through which one can get long term predictions for that input.
11. What is dropout?
Dropout was introduced by "Hinton et al" and this method is now very popular.
It consists of setting to zero the output of each hidden neuron in chosen layer with
Some probability and is proven to be very effective in reducing overfitting.
12. List out the disadvantages of deep learning.
Disadvantages of deep learning
• DL needs high-performance hardware.
• DL needs much more time to train
• it is very difficult to assess its performance in real world applications
• it is very hard to understand.
13. Why we need of hidden layers?
• A network with only two layers (input and output) can only represent the
input with whatever representation already exists in the input data.
• 2. If the data is discontinuous or non-linearly separable, the innate
representation is inconsistent, and the mapping cannot be learned using
two layers (Input and Output).
• 3. Therefore, hidden layer(s) are used between input and output layers.
14. Why we use activation functions?
Activation functions also known as transfer function is used to map input nodes
to output nodes in certain fashion. It helps in normalizing the output betweern 0 to
1 or- VI to 1. The activation function is the most important factor in a neural
network Which decided whether or not a neuron will be activated or not and
transferred to the next layer.
15. Difference between L1 and L2 Regularizatio:
S.No L1 Regularization L2 Regularization

Penalizes the sum of absolute Penalizes the sum of square weights.


1.
value weights.
2. It has a sparse solution It has a non-sparse solution
It gives multiple solutions.
3. It has only one solutions.
4. Constructed in feature selection No feature selection
5. Robust to outliers Not robust to outliers
It gives more accurate predictions
It generates simple and
6. when the output variable is the
interpretable
function of whole input variables.
Unable to learn complex data
7. Able to learn complex data patterns
patterns
Computationally inefficient over utationally efficient because of having
8.
non- space conditions analytics solutions.

Prepared by N.GOBINATHAN, AP/CSE


Page | 46

II Year/CSE CS3491-Artificial Intelligence and Machine Learning

16. Difference between Deep Network and Shallow Network


Deep network Shallow network

Deep network contains many hidden Shallow network contains only one
layers. hidden layer.

Deep network can compactly express Shallow networks with oneHidden


highly complex functions over input layer cannot place complex functions
space. over the input space.

Training in DN is easy and no issue of Shallow network is more difficult to


local minima in DN. train with our current algorithms.

DN can fit functions better with less Shallow net's needs more parameters
parameters than a shallow network to have better fit.

Prepared by N.GOBINATHAN, AP/CSE

You might also like