0% found this document useful (0 votes)
19 views26 pages

Unit V

This document covers neural networks, focusing on perceptrons and multilayer perceptrons, including their structure, activation functions, and training methods such as gradient descent and backpropagation. It highlights the limitations of single-layer perceptrons in handling non-linearly separable problems and introduces various activation functions that enhance network performance. Additionally, it discusses optimization techniques like stochastic gradient descent and the importance of hidden layers in improving representational power.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views26 pages

Unit V

This document covers neural networks, focusing on perceptrons and multilayer perceptrons, including their structure, activation functions, and training methods such as gradient descent and backpropagation. It highlights the limitations of single-layer perceptrons in handling non-linearly separable problems and introduces various activation functions that enhance network performance. Additionally, it discusses optimization techniques like stochastic gradient descent and the importance of hidden layers in improving representational power.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT V

Chapter: 10: Neural Networks


Syllabus

Perceptron - Multilayer perceptron, activation functions, network training - gradient descent Se


optimization - stochastic gradient descent, error backpropagation, from shallow networks to deep
networks -Unit saturation (aka the vanishing gradient problem) - ReLU, hyperparameter tuning,
batch normalization, regularization, dropout.

Perceptron

• The perceptron is a feed-forward network with one output neuron that learns a separating hyper-
plane in a pattern space.

• The "n" linear Fx neurons feed forward to one threshold output Fy neuron. The perceptron
separates linearly separable set of pa set of patterns.

Single Layer Perceptron

• The perceptron is a feed-forward network with one output neuron that learns a separating hyper-
plane in a pattern space. The "n" linear Fx neurons feed forward to one threshold output Fy neuron.
The perceptron separates linearly separable set of patterns.

• SLP is the simplest type of artificial neural networks and can only classify linearly inseparable cases
with a binary target (1, 0).

• We can connect any number of McCulloch-Pitts neurons together in any way we like. An
arrangement of one input layer of McCulloch-Pitts neurons feeding forward to one output layer of
McCulloch-Pitts neurons is known as a Perceptron.

• A single layer feed-forward network consists of one or more output neurons, each of which is
connected with a weighting factor Wij to all of the inputs Xi.

• The Perceptron is a kind of a single-layer artificial network with only one neuron. The Percepton is
a network in which the neuron unit calculates the linear combination of its real-valued or boolean
inputs and passes it through a threshold activation function. Fig. 10.1.1 shows Perceptron.
• The Perceptron is sometimes referred to a Threshold Logic Unit (TLU) since it discriminates the data
depending on whether the sum is greater than the threshold value.

• In the simplest case the network has only two inputs and a single output. The output of the neuron
is:

y = f ( Σ2i=1 WiXi + b)

• Suppose that the activation function is a threshold then

f = {1 if s > 0

-1 if s < 0

• The Perceptron can represent most of the primitive boolean functions: AND, OR, NAND and NOR
but can not represent XOR.

• In single layer perceptron, initial weight values are assigned randomly because it does not have
previous knowledge. It sum all the weighted inputs. If the sum is greater than the threshold value
then it is activated i.e. output = 1.

Output

W1X1 + W2X2 +...+ WnXn > 0 ⇒ 1

W1X1 + W2X2 +...+ WnXn ≤ 0 ⇒ 0

• The input values are presented to the perceptron, and if the predicted output is the same as the
desired output, then the performance is considered satisfactory and no changes to the weights are
made.

• If the output does not match the desired output, then the weights need to be changed to reduce
the error.

• The weight adjustment is done as follows:

∆W = ῃ × d × x

Where

x = Input data

d = Predicted output and desired output.

ῃ = Learning rate

• If the output of the perceptron is correct then we do not take any action. If the output is incorrect
then the weight vector is W→ W + W.

• The process of weight adaptation is called learning.

• Perceptron Learning Algorithm:

1. Select random sample from training set as input.

2. If classification is correct, do nothing.

3. If classification is incorrect, modify the weight vector W using


Wi = Wi + ῃd (n) Xi (n)

Repeat this procedure until the entire training set is classified correctly.

Multilayer Perceptron

• A multi-layer perceptron (MLP) has the same structure of a single layer perceptron with one or
more hidden layers. An MLP is a network of simple neurons called perceptrons.

• A typical multilayer perceptron network consists of a set of source nodes forming the input layer,
one or more hidden layers of computation nodes, and an output layer of nodes.

• It is not possible to find weights which enable single layer perceptrons to deal with non-linearly
separable problems like XOR: See Fig. 10.1.2.

Limitation of Learning in Perceptron: linear separability

• Consider two-input patterns (X1, X2) being classified into two classes as shown in Fig. 10.1.3. Each
point with either symbol of x or 0 represents a pattern with a set of values (X1, X2).

• Each pattern is classified into one of two classes. Notice that these classes can be separated with a
single line L. They are known as linearly separable patterns.
• Linear separability refers to the fact that classes of patterns with n-dimensional vector x = (x1, x2, …
xn) can be separated with a single decision surface. In the case above, the line L represents the
decision surface.

• If two classes of patterns can be separated by a decision boundary, represented by the linear
equation then they are said to be linearly separable. The simple network can correctly classify any
patterns.

• Decision boundary (i.e., W, b or q) of linearly separable classes can be determined either by some
learning procedures or by solving linear equation systems based on representative patterns of each
classes.

• If such a decision boundary does not exist, then the two classes are said to be linearly inseparable.

• Linearly inseparable problems cannot be solved by the simple network, more sophisticated
architecture is needed.

• Examples of linearly separable classes

• 1. Logical AND function

2. Logical OR function
• Examples of linearly inseparable classes

1. Logical XOR (exclusive OR) function


No line can separate these two classes, as can be seen from the fact that the following linear
inequality system has no solution.

because we have b < 0 from (1) +(4), and b >= 0 from (2) + (3), which is a contradiction.

Activation Functions

• Activation functions also known as transfer function is used to map input nodes to output nodes in
certain fashion.

• The activation function is the most important factor in a neural network which decided whether or
not a neuron will be activated or not and transferred to the next layer.

• Activation functions help in normalizing the output between 0 to 1 or 1 to 1. It helps in the process
of back propagation due to their differentiable property. During back propagation, loss function gets
updated, and activation function helps the gradient descent curves to achieve their local minima.

• Activation function basically decides in any neural network that given input or receiving
information is relevant or it is irrelevant.

• These activation function makes the multilayer network to have greater representational power
than single layer network only when non-linearity is introduced.

• The input to the activation function is sum which is defined by the following equation.

Sum = I1W1 +I2 W2 +...+In Wn

= Σnj=1 Ij Wj + b

• Activation Function: Logistic Functions


• Logistic function monotonically increases from a lower limit (0 or 1) to an upper limit (+1) as sum
increases. In which values vary between 0 and 1, with a value of 0.5 when I is zero.

• Activation Function: Arc Tangent

• Activation Function: Hyperbolic Tangent

Identity or Linear Activation Function

• A linear activation is a mathematical equation used for obtaining output vectors with specific
properties.

• It is a simple straight line activation function where our function is directly proportional to
weighted sum of neurons or input.

• Linear activation functions are better in giving a wide range activations and a line of a positive slops
may increase the firing rate as the input rate increases.

• Fig. 10.2.4 shows identity function.


• The equation for linear activation function is :

f(x) = a.x

When a = 1 then f(x) = x and this is a special case known as identity.

• Properties:

1. Range is - infinity to + infinity

2. Provides a convex error surface so optimisation can be achieved faster.

3. df(x)/dx = a which is constant. So cannot be optimised with gradient descent.

• Limitations:

1. Since the derivative is constant, the gradient has no relation with input.

2. Back propagation is constant as the change is delta x.

3.Activation function does not work in neural networks in practice.

Sigmoid

• A sigmoid function produces a curve with an "S" shape. The example sigmoid function shown on
the left is a special case of the logistic function, which models the growth of some set.

Sig (t) =1/1+e-t

• In general, a sigmoid function is real-valued and differentiable, having a non-negative or non-


positive first derivative, one local minimum, and one local maximum.

• The logistic sigmoid function is related to the hyperbolic tangent as follows:

1 - 2 sig (x) = 1- 2.1/1+e–x = -tanh x/2


• Sigmoid functions are often used in artificial neural networks to introduce nonlinearity in the
model.

• A neural network element computes a linear combination of its input signals, and applies a sigmoid
function to the result.

• A reason for its popularity in neural networks is because the sigmoid function satisfies a property
between the derivative and itself such that it is computationally easy to perform.

d/dt sig (t) = sig(t) (1 - sig (t))

• Derivatives of the sigmoid function are usually employed in learning algorithms.

Gradient Descent Optimization

• Gradient Descent is an optimization algorithm in gadget mastering used to limit a feature with the
aid iteratively moving towards the minimal fee of the characteristic.

• We essentially use this algorithm when we have to locate the least possible values which could
fulfill a given fee function. In gadget getting to know, greater regularly that not we try to limit loss
features (like Mean Squared Error). By minimizing the loss characteristic, we will improve our model
and Gradient Descent is one of the most popular algorithms used for this cause.
• The graph above shows how exactly a Gradient Descent set of rules works.

• We first take a factor in the value function and begin shifting in steps in the direction of the
minimum factor. The size of that step, or how quickly we ought to converge to the minimum factor is
defined by Learning Rate. We can cowl more location with better learning fee but at the risk of
overshooting the minima. On the opposite hand, small steps/smaller gaining knowledge of charges
will eat a number of time to attain the lowest point.

• Now, the direction wherein algorithm has to transport (closer to minimal) is also important. We
calculate this by way of using derivatives. You need to be familiar with derivatives from calculus. A
spin off is largely calculated because the slope of anon the graph at any specific factor. We get that
with the aid of finding the tangent line to the graph at that point. The extra steep the tangent, would
suggest that more steps would be needed to reach minimum point, much less steep might suggest
lesser steps are required to reach the minimum factor.

Stochastic Gradient Descent

• The word 'stochastic' means a system or a process that is linked with a random probability. Hence,
in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set
for each iteration.

• Stochastic Gradient Descent (SGD) is a type of gradient descent that runs one training example per
iteration. It processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time.

• As it requires only one training example at a time, hence it is easier to store in allocated memory.
However, it shows some computational efficiency losses in comparison to batch gradient systems as
it shows frequent updates that require more detail and speed.

• Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it can
be helpful in finding the global minimum and also escaping the local minimum.

• Advantages of Stochastic gradient descent:

a) It is easier to allocate in desired memory.

b) It is relatively fast to compute than batch gradient descent.

c) It is more efficient for large datasets.

• Disadvantages of Stochastic Gradient Descent:

a) SGD requires a number of hyper parameters such as the regularization parameter and the number
of iterations.

b) SGD is sensitive to feature scaling.

Error Backpropagation

• Backpropagation is a training method used for a multi-layer neural network. It is also called the
generalized delta rule. It is a gradient descent method which minimizes the total squared error of the
output computed by the net.
• The backpropagation algorithm looks for the minimum value of the error function in weight space
using a technique called the delta rule or gradient descent. The weights that minimize the error
function is then considered to be a solution to the learning problem.

• Backpropagation is a systematic method for training multiple layer ANN. It is a generalization of


Widrow-Hoff error correction rule. 80 % of ANN applications uses backpropagation.

• Fig. 10.4.1 (See on next page) shows backpropagation network.

• Consider a simple neuron:

a. Neuron has a summing junction and activation function.

b. Any non linear function which differentiable everywhere and increases everywhere with sum can
be used as activation function.

c. Examples: Logistic function, Arc tangent function, Hyperbolic tangent activation function.

• These activation function makes the multilayer network to have greater representational power
than single layer network only when non-linearity is introduced.

• Need of hidden layers:

1. A network with only two layers (input and output) can only represent the input with whatever
representation already exists in the input data.

2. If the data is discontinuous or non-linearly separable, the innate representation is inconsistent,


and the mapping cannot be learned using two layers (Input and Output).

3. Therefore, hidden layer(s) are used between input and output layers

• Weights connects unit (neuron) in one layer only to those in the next higher layer. The output of
the unit is scaled by the value of the connecting weight, and it is fed forward to provide a portion of
the activation for the units in the next higher layer.
• Backpropagation can be applied to an artificial neural network with any number of hidden layers.
The training objective is to adjust the weights so that the application of a set of inputs produces the
desired outputs.

• Training procedure: The network is usually trained with a large number of input-output pairs.

1. Generate weights randomly to small random values (both positive and negative) to ensure that the
network is not saturated by large values of weights.

2. Choose a training pair from the training set.

3. Apply the input vector to network input.

4. Calculate the network output.

5. Calculate the error, the difference between the network output and the desired output.

6. Adjust the weights of the network in a way that minimizes this error.

7. Repeat steps 2 - 6 for each pair of input-output in the training set until the error for the entire
system is acceptably low.

Forward pass and backward pass:

• Backpropagation neural network training involves two passes.

1. In the forward pass, the input signals moves forward from the network input to the output.

2. In the backward pass, the calculated error signals propagate backward through the network,
where they are used to adjust the weights.

3. In the forward pass, the calculation of the output is carried out, layer by layer, in the forward
direction. The output of one layer is the input to the next layer.

• In the reverse pass,

a. The weights of the output neuron layer are adjusted first since the target value of each output
neuron is available to guide the adjustment of the associated weights, using the delta rule.

b. Next, we adjust the weights of the middle layers. As the middle layer neurons have no target
values, it makes the problem complex.

• Selection of number of hidden units: The number of hidden units depends on the number of input
units.

1. Never choose h to be more than twice the number of input units.

2. You can load p patterns of I elements into log2 p hidden units.

3. Ensure that we must have at least 1/e times as many training examples.

4. Feature extraction requires fewer hidden units than inputs.

5. Learning many examples of disjointed inputs requires more hidden units than inputs.

6. The number of hidden units required for a classification task increases with the number of classes
in the task. Large networks require longer training times.
Factors influencing Backpropagation training

• The training time can be reduced by using:

1.Bias: Networks with biases can represent relationships between inputs and outputs more easily
than networks without biases. Adding a bias to each neuron is usually desirable to offset the origin of
the activation function. The weight of the bias is trainable similar to weight except that the input is
always+1.

2. Momentum: The use of momentum enhances the stability of the training process. Momentum is
used to keep the training process going in the same general direction analogous to the way that
momentum of a moving object behaves. In backpropagation with momentum, the weight change is a
combination of the current gradient and the previous gradient.

Advantages and Disadvantages

Advantages of backpropagation:

1. It is simple, fast and easy to program.

2. Only numbers of the input are tuned and not anyother parameter.

3. No need to have prior knowledge about the network.

4. It is flexible.

5. A standard approach and works efficiently.

6. It does not require the user to learn special functions.

Disadvantages of backpropagation:

1. Backpropagation possibly be sensitive to noisy data and irregularity.

2. The performance of this is highly reliant on the input data.

3. Needs excessive time for training.

4. The need for a matrix-based method for backpropagation instead of mini - batch.

Shallow Networks

• The terms shallow and deep refer to the number of layers in a neural network; shallow neural
networks refer to a neural network that have a small number of layers, usually regarded as having a
single hidden layer, and deep neural networks refer to neural networks that have multiple hidden
layers. Both types of networks perform certain tasks better than the other and selecting the right
network depth is important for creating a successful model.

• In a shallow neural network, the values of the feature vector of the data to be classified (the input
layer) are passed to a hidden layer of nodes (neurons) each of which generates a response according
to some activation function, g, acting on the weighted sum of those values, z.

• The responses of each unit in the hidden layer is then passed to a final, output layer (which may
consist of a single unit), whose activation produces the classification prediction output.
Deep Network

• Deep learning is a new area of machine learning research, which has been introduced with the
objective of moving machine learning closer to one of its original goals. Deep learning is about
learning multiple levels of representation and abstraction that help to make sense of data such as
images, sound, and text.

• 'Deep learning' means using a neural network with several layers of nodes between input and
output. It is generally better than other methods on image, speech and certain other types of data
because the series of layers between input and output do feature identification and processing in a
series of stages, just as our brains seem to.

• Deep Learning emphasizes the network architecture of today's most successful machine learning
approaches. These methods are based on "deep" multi- neural networks with many hidden layers.

TensorFlow

• TensorFlow is one of the most popular frameworks used to build deep learning models. The
framework is developed by Google Brain Team.

• Languages like C++, R and Python are supported by the framework to create the models as well as
the libraries. This framework can be accessed from both - desktop and mobile.

• The translator used by Google is the best example of TensorFlow. In this, the model is created by
adding the functionalities of text classification, natural language processing, speech or handwriting
recognition, image recognition, etc.

• The framework has its own visualization toolkit, named TensorBoard which helps in powerful data
visualization of the network along with its performance.

• One more tool added in TensorFlow, TensorFlow Serving, can be used for quick and easy
deployment of the newly developed algorithms without introducing any change in the existing API or
architecture.

• TensorFlow framework comes along with a detailed documentation for the users od or to adapt it
quickly and easily, making it the most preferred deep learning to do framework to model deep
learning algorithms.

• Some of the characteristics of TensorFlow is:

• Multiple GPU supported

• One can visualize graphs and queues easily using TensorBoard.

• Powerful documentation and larger support from community

Keras

• If you are comfortable in programming with Python, then learning Keras will not prove hard to you.
This will be the most recommended framework to create deep aid learning models for ones having a
sound of Python.

• Keras is built purely on Python and can run on the top of TensorFlow. Due to its complexity and use
of low level libraries, TensorFlow can be comparatively harder to adapt for the new users as
compared to Keras. Users those who are beginners in deep learning, and find its models difficult to
understand in TensorFlow generally prefer Keras as it solves all complex models in no time.

• Keras has been developed keeping in mind the complexities in the deep learning models, and
hence it can run quickly to get the results in minimum time. Convolutional as well as Recurrent
Neural networks are supported in Keras. The framework can run easily on CPU and GPU.

• The models in Keras can be classified into 2 categories:

1. Sequential model:

The layers in the deep learning model are defined in a sequential manner. Hence the implementation
of the layers in this model will also be done sequentially.

2. Keras functional API:

Deep learning models that has multiple outputs, or has shared layers, i.e. more complex models can
be implemented in Keras functional API.

Difference between Deep Network and Shallow Network

Vanishing Gradient Problem

• The vanishing gradient problem is a problem that user face, when we are training Neural Networks
by using gradient-based methods like backpropagation. This problem makes it difficult to learn and
tune the parameters of the earlier layers in the network.

• The vanishing gradient problem is essentially a situation in which a deep multilayer feed-forward
network or a Recurrent Neural Network (RNN) does not have the ability to propagate useful gradient
information from the grim the model back to the layers near the input end of the model.

• It results in models with many layers being rendered unable to learn on a specific dataset. It could
even cause models with many layers to prematurely converge to a substandard solution.

• When the backpropagation algorithm advances downwards or backward going from the output
layer to the input layer, the gradients tend to shrink, becoming smaller and smaller till they approach
zero. This ends up leaving the weights of the initial or lower layers practically unchanged. In this
situation, the gradient descent does not ever end up converging to the optimum.

• Vanishing gradient does not necessarily imply that the gradient vector is all zero. It implies that the
gradients are minuscule, which would cause the learning to be very slow.

• The most important solution to the vanishing gradient problem is a specific type of neural network
called Long Short-Term Memory Networks (LSTMs).

• Indication of vanishing gradient problem:

a) The parameters of the higher layers change to a great extent, while the parameters of lower layers
barely change.

b) The model weights could become 0 during training.

c) The model learns at a particularly slow pace and the training could stagnate at a very early phase
after only a few iterations.

• Some methods that are proposed to overcome the vanishing gradient problem:a) Residual neural
networks (ResNets)

b) Multi-level hierarchy

c) Long short term memory (LSTM)

d) Faster hardware

e) ReLU

f) Batch normalization

ReLU

• Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReLU is a non-linear function or
piecewise linear function that will output the input directly if it is positive, otherwise, it will output
zero.

• It is the most commonly used activation function in neural networks, especially in Convolutional
Neural Networks (CNNs) and Multilayer perceptron's.

• Mathematically, it is expressed as

f(x) = max (0, x)

where x : input to neuron

• Fig. 10.8.1 shows ReLU function


• The derivative of an activation function is required when updating the weights during the back-
propagation of the error. The slope of ReLU is 1 for positive values and 0 for negative values. It
becomes non-differentiable when the input x is zero, but it can be safely assumed to be zero and
causes no problem in practice.

• ReLU is used in the hidden layers instead of Sigmoid or tanh. The ReLU function solves the problem
of computational complexity of the Logistic Sigmoid and Tanh functions.

• A ReLU activation unit is known to be less likely to create a vanishing gradient problem because its
derivative is always 1 for positive values of the argument.

• Advantages of ReLU function

a) ReLU is simple to compute and has a predictable gradient for the backpropagation of the error.

b) Easy to implement and very fast.

c) The calculation speed is very fast. The ReLU function has only a direct relationship.

d) It can be used for deep network training.

• Disadvantages of ReLU function

a) When the input is negative, ReLU is not fully functional which means when it comes to the wrong
number installed, ReLU will die. This problem is also known as the Dead Neurons problem.

b) ReLU function can only be used within hidden layers of a Neural Network Model.

LReLU and ERELU

1. LReLU

• The Leaky ReLU is one of the most well-known activation function. It is the same as ReLU for
positive numbers. But instead of being 0 for all negative values, it has a constant slope (less than 1.).

• Leaky ReLU is a type of activation function that helps to prevent the function from becoming
saturated at 0. It has a small slope instead of the standard ReLU which has an infinite slope.

• Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Fig. 10.8.2 shows LReLU function.

• The leak helps to increase the range of the ReLU function. Usually, the value of a dog is 0.01 or so.

• The motivation for using LReLU instead of ReLU is that constant zero gradients can also result in
slow learning, as when a saturated neuron uses a sigmoid activation function
2. EReLU

• An Elastic ReLU (EReLU) considers a slope randomly drawn from a uniform distribution during the
training for the positive inputs to control the amount of non-linearity.

• The EReLU is defined as: EReLU(x) = max(Rx; 0) in the output range of [0;1) where R is a random
number

• At the test time, the ERELU becomes the identity function for positive inputs.

Hyperparameter Tuning

• Hyperparameters are parameters whose values control the learning process and determine the
values of model parameters that a learning algorithm ends up learning.

• While designing a machine learning model, one always has multiple choices for the architectural
design for the model. This creates a confusion on which design to choose for the model based on its
optimality. And due to this, there are always trials for defining a perfect machine learning model.

• The parameters that are used to define these machine learning models are known as the
hyperparameters and the rigorous search for these parameters to build an optimized model is known
as hyperparameter tuning.

• Hyperparameters are not model parameters, which can be directly trained from data. Model
parameters usually specify the way to transform the input into the required output, whereas
hyperparameters define the actual structure of the model that gives the required data.

Layer Size

• Layer size is defined by the number of neurons in a given layer. Input and output layers are
relatively easy to figure out because they correspond directly to how our modeling problem handles
input and ouput.

• For the input layer, this will match up to the number of features in the input vector. For the output
layer, this will either be a single output neuron or a number of neurons matching the number of
classes we are trying to predict.

• It is obvious that a neural network with 3 layers will give better performance than that of 2 layers.
Increasing more than 3 doesn't help that much in neural networks. In the case of CNN, an increasing
number of layers makes the model better.

Magnitude: Learning Rate

• The amount that the weights are updated during training is referred to as the step size or the
learning rate. Specifically, the learning rate is a configurable hyper-parameter used in the training of
neural networks that has a small positive value, often in the range between 0.0 and 1.0.

• For example, if learning rate is 0.1, then the weights in the network are updated 0.1* (estimated
weight error) or 10% of the estimated weight error each time the Top weights are updated. The
learning rate hyper-parameter controls the rate or speed at which the model learns.

• Learning rates are tricky because they end up being specific to the dataset and even to other
hyper-parameters. This creates a lot of overhead for finding the right setting for hyper-parameters.
• Large learning rates () make the model learn faster but at the same time it may cause us to miss the
minimum loss function and only reach the surrounding of it. In cases where the learning rate is too
large, the optimizer overshoots the minimum and the loss updates will lead to divergent behaviours.

• On the other hand, choosing lower learning rate values gives a better chance of finding the local
minima with the trade-off of needing larger number of epochs and more time.

• Momentum can accelerate learning on those problems where the high-dimensional weight space
that is being navigated by the optimization process has structures that mislead the gradient descent
algorithm, such as flat regions or steep curvature.

Normalization

• Normalization is a data preparation technique that is frequently used in machine learning. The
process of transforming the columns in a dataset to the same scale is referred to as normalization.
Every dataset does not need to be normalized for machine learning.

• Normalization makes the features more consistent with each other, which allows the model to
predict outputs more accurately. The main goal of normalization is to make the data homogenous
over all records and fields.

• Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range. Data


normalization is used in machine learning to make model training less sensitive to the scale of
features.

• Normalization is important in such algorithms as k-NN, support vector machines, neural networks,
and principal components. The type of feature preprocessing and normalization that's needed can
depend on the data.

Batch Normalization

• It is a method of adaptive reparameterization, motivated by the difficulty of training very deep


models. In Deep networks, the weights are updated for each layer. So the output will no longer be on
the same scale as the input.

• When we input the data to a machine or deep learning algorithm we tend to for change the values
to a balanced scale because, we ensure that our model can generalize appropriately.

• Batch normalization is a technique for standardizing the inputs to layers in a neural network. Batch
normalization was designed to address the problem of internal covariate shift, which arises as a
consequence of updating multiple-layer inputs simultaneously in deep neural networks.

• Batch normalization is applied to individual layers, or optionally, to all of them: In each training
iteration, we first normalize the inputs by subtracting their mean and dividing by their standard
deviation, where both are estimated based on the statistics of the current mini-batch.

• Next, we apply a scale coefficient and an offset to recover the lost degrees of freedom. It is
precisely due to this normalization based on batch statistics that batch normalization derives its
name.

• We take the output a[i-1] from the preceding layer, and multiply by the weights W and add the bias b
of the current layer. The variable I denotes the current layer.

Z[i] = W [i] a[i-1] + b[i]


• Next, we usually apply the non-linear activation function that results in the output a[i] of the
current layer. When applying batch norm, we correct our data before feeding it to the activation
function.

• To apply batch norm, calculate the mean as well as the variance of current z.

μ = Σ mi=1 Zj

• When calculating the variance, we add a small constant to the variance to prevent potential
divisions by zero.

σ2 = 1/m Σmi=1 (Zj - μ)2 + €

• To normalize the data, we subtract the mean and divide the expression by the standard deviation.

Z[i] = Z[i]-μ / √σ 2

• This operation scales the inputs to have a mean of 0 and a standard deviation of 1.

• Advantages of Batch Normalisation:

a) The model is less delicate to hyperparameter tuning.

b) Shrinks internal covariant shift.

c) Diminishes the reliance of gradients on the scale of the parameters or their underlying values.

d) Dropout can be evacuated for regularization

Regularization

• Just have a look at the above figure, and we can immediately predict that once we try to cover
every minutest feature of the input data, there can be irregularities in the extracted features, which
can introduce noise in the output. This is referred to as "Overfitting".

• This may also happen with the lesser number of features extracted as some of the important
details might be missed out. This will leave an effect on the accuracy of the outputs produced. This is
referred to as "Underfitting".

• This also shows that the complexity for processing the input elements increases with overfitting.
Also, neural networks being a complex interconnection of nodes, the issue of overfitting may arise
frequently.

• To eliminate this, regularization is used, in which we have to make the slightest modification in the
design of the neural network, and we can get better outcomes.
Regularization in Machine Learning

• One of the most important factors that affect the machine learning model is overfitting.

• The machine learning model may perform poorly if it tries to capture even the noise present in the
dataset applied for training the system, which ultimately results in overfitting. In this context, noise
doesn't mean the ambiguous or false data, but those inputs which do not acquire the required
features to execute the machine learning model.

• Analyzing these data inputs may surely make the model flexible, but the risk of overfitting will also
increase accordingly.

• One of the ways to avoid this is to cross validate the training dataset, and decide accordingly the
parameters to include that can increase the efficiency and performance of the model.

• Let this be the simple relation for linear regression

Where

Y = b0 + b1X1 + b2X2 + .... bpXp

Y = Learned relation

B = Co-efficient estimators for different variables and/or predictors (X)

• Now, we shall introduce a loss function, that implements the fitting procedure, which is referred to
as "Residual Sum of Squares" or RSS.

• The co-efficient in the function is chosen in such a way that it can minimize the loss function easily.

Hence,

RSS = Σni=1 (Υi – β0 – Σpj=1 βi Χij )2

• Above equation will help in adjusting the co-efficient function depending on the training dataset.

• In case noise is present in the training dataset, then the adjusted co-efficient won't be generalized
when the future datasets will be introduced. Hence, at this point, regularization comes into picture
and makes this adjusted co-efficient shrink towards zero.

• One of the methods to implement this is the ridge regression, also known as L2 regularization. Lets
have a quick overview on this.

Ridge Regression (L2 Regularization)

• Ridge regression, also known as L2 regularization, is a technique of regularization to avoid the


overfitting in training data set, which introduces a small bias in the Straining model, through which
one can get long term predictions for that input.

• In this method, a penalty term is added to the cost function. This amount of bias altered to the cost
function in the model is also known as ridge regression penalty.

• Hence, the equation for the cost function, after introducing the ridge regression penalty is as
follows:

Σmi=1 (yi – y`i)2 = Σni=1 (Υi – Σnj=1 βj × Χij)2 + λ Σnj=0 βj2


Here, λ is multiplied by the square of the weight set for the individual feature of the input data. This
term is ridge regression penalty.

• It regularizes the co-efficient set for the model and hence the ridge regression term deduces the
values of the coefficient, which ultimately helps in deducing the complexity of the machine learning
model.

• From the above equation, we can observe that if the value of tends to zero, the last term on the
right hand side will tend to zero, thus making the above equation a representation of a simple linear
regression model.

• Hence, lower the value of, the model will tend to linear regression.

• This model is important to execute the neural networks for machine learning, as there would be
risks of failure for generalized linear regression models, if there are dependencies found between its
variables. Hence, ridge regression is used here.

Lasso Regression (L1 Regularization)

• One more technique to reduce the overfitting, and thus the complexity of the model is the lasso
regression.

• Lasso regression stands for Least Absolute and Selection Operator and is also sometimes known as
L1 regularization.

• The equation for the lasso regression is almost same as that of the ridge regression, except for a
change that the value of the penalty term is taken as the absolute weights.

• The advantage of taking the absolute values is that its slope can shrink to 0, as compared to the
ridge regression, where the slope will shrink it near to 0.

• The following equation gives the cost function defined in the Lasso regression:

Σmi=1 (yi – y`i)2 = Σni=1 (Υi – Σni=1 βj × Χij) + λ Σni=0 | βj|2

• Due to the acceptance of absolute values for the cost function, some of the features of the input
dataset can be ignored completely while evaluating the machine learning model, and hence the
feature selection and overfitting can be reduced to much extent.

• On the other hand, ridge regression does not ignore any feature in the model and includes it all for
model evaluation. The complexity of the model can be reduced using the shrinking of co-efficient in
the ridge regression model.

Dropout

• Dropout was introduced by "Hinton et al"and this method is now very popular. It consists of setting
to zero the output of each hidden neuron in chosen layer with some probability and is proven to be
very effective in reducing overfitting.

• Fig. 10.11.2 shows dropout regulations.


• To achieve dropout regularization, some neurons in the artificial neural network are randomly
disabled. That prevents them from being too dependent on one another as they learn the
correlations. Thus, the neurons work more independently, and the artificial neural network learns
multiple independent correlations in the data based on different configurations of the neurons.

• It is used to improve the training of neural networks by omitting a hidden unit. It also speeds
training.

• Dropout is driven by randomly dropping a neuron so that it will not contribute to the forward pass
and back-propagation.

• Dropout is an inexpensive but powerful method of regularizing a broad family of es models.

DropConnect

• DropConnect, known as the generalized version of Dropout, is the method used brts for
regularizing deep neural networks. Fig. 10.11.3 shows dropconnect.

• DropConnect has been proposed to add more noise to the network. The primary difference is that
instead of randomly dropping the output of the neurons, we randomly drop the connection between
neurons.

• In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in
which the connections are chosen at random during the training stage.

Difference between L1 and L2 Regularization


Two Marks Questions with Answers
Q.1 Explain multilayer perceptron.

Ans.: The Multilayer Perceptron (MLP) model features multiple layers that are interconnected in such
a way that they form a feed-forward neural network. Each neuron in one layer has directed
connections to the neurons of a separate layer. It consists of three types of layers: the input layer,
output layer and hidden layer.

Q.2 What is vanishing gradient problem?

Ans.: When back-propagation is used, the earlier layers will receive very small updates compared to
the later layers. This problem is referred to as the vanishing gradient problem. The vanishing gradient
problem is essentially a situation in which a deep multilayer feed-forward network or a recurrent
neural network (RNN) does not have the ability to propagate useful gradient information from the
output end of the model back to the layers near the input end of the model.

Q.3 Explain advantages deep learning.

Ans.: Advantages of deep learning:

• No need for feature engineering

• DL solves the problem on the end-to-end basis.

• Deep learning gives more accuracy

Q.4 Explain back propagation.

Ans.: Backpropagation is a training method used for a multi-layer neural network. It is also called the
generalized delta rule. It is a gradient descent method which minimizes the total squared error of the
output computed by the net.

Q.5 What is hyperparameters?


Ans.: Hyperparameters are parameters whose values control the learning process and determine the
values of model parameters that a learning algorithm ends up learning.

Q.6 Define ReLU.

Ans.: Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReLU is a nonlinear function
or piecewise linear function that will output the input directly if it is positive, otherwise, it will output
zero.

Q.7 What is vanishing gradient problem

Ans.: The vanishing gradient problem is a problem that user face, when we are training Neural
Networks by using gradient-based methods like backpropagation. This problem makes it difficult to
learn and tune the parameters of the earlier layers in the network

Q.8 Define normalization.

Ans.: Normalization is a data pre-processing tool used to bring the numerical data to a common scale
without distorting its shape.

Q.9 What is batch normalization

Ans.: It is a method of adaptive reparameterization, motivated by the difficulty of training very deep
models. In Deep networks, the weights are updated for each layer. So the output will no longer be on
the same scale as the input.

Q.10 Explain advantages of ReLU function

Ans.: Advantages of ReLU function:

a) ReLU is simple to compute and has a predictable gradient for the backpropagation of the error.

b) Easy to implement and very fast.

c) It can be used for deep network training

Q.11 Explain Ridge regression.

Ans.: Ridge regression, also known as L2 regularization, is a technique of regularization to avoid the
overfitting in training data set, which introduces a small bias in the training model, through which
one can get long term predictions for that input.

Q.12 Explain dropout.

Ans.:Dropout was introduced by "Hinton et al" and this method is now very popular. It consists of
setting to zero the output of each hidden neuron in chosen layer with some probability and is proven
to be very effective in reducing overfitting.

Q.13 Explain disadvantages of deep learning

Ans.: Disadvantages of deep learning

• DL needs high-performance hardware.

• DL needs much more time to train

• it is very difficult to assess its performance in real world applications


• it is very hard to understand

Q.14 Explain need of hidden layers.

Ans.:

1. A network with only two layers (input and output) can only represent the input with whatever
representation already exists in the input data.

2. If the data is discontinuousor non-linearly separable, the innate representation is inconsistent,


and the mapping cannot be learned using two layers (Input and Output).

3. Therefore, hidden layer(s) are used between input and output layers.

Q.15 Explain activation functions.

Ans.:Activation functions also known as transfer function is used to map input nodes to output
nodes in certain fashion. It helps in normalizing the output between 0 to 1 or - V1 to 1. The activation
function is the most important factor in a neural network which decided whether or not a neuron
will be activated or not and transferred to the next layer.

You might also like