0% found this document useful (0 votes)
38 views16 pages

1 Neural Networks

Uploaded by

Sai someone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

1 Neural Networks

Uploaded by

Sai someone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

NEURAL NETWORKS

PERCEPTRON MODEL:

A perceptron model, in Machine Learning, is a supervised learning algorithm of


binary classifiers. A single neuron, the perceptron model detects whether any
function is an input or not and classifies them in either of the classes.
Representing a biological neuron in the human brain, the perceptron model or
simply a perceptron acts as an artificial neuron that performs human-like brain
functions. A linear ML algorithm, the perceptron conducts binary classification
or two-class categorization and enables neurons to learn and register
information procured from the inputs.

This model uses a hyperplane line that classifies two inputs and classifies them
on the basis of the 2 classes that a machine learns, thus implying that the
perceptron model is a linear classification model.
There are 4 constituents of a perceptron model. They are as follows-
1. Input values
2. Weights and bias
3. Net sum
4. Activation function

There are 2 types of perceptron models-


1. Single Layer Perceptron- The Single Layer perceptron is defined by its
ability to linearly classify inputs. This means that this kind of model only
utilizes a single hyperplane line and classifies the inputs as per the
learned weights beforehand.
2. Multi-Layer Perceptron- The Multi-Layer Perceptron is defined by its
ability to use layers while classifying inputs. This type is a high
processing algorithm that allows machines to classify inputs using
various more than one layer at the same time.

In Machine Learning, Perceptron is considered as a single-layer neural network


that consists of four main parameters named input values (Input nodes), weights
and Bias, net sum, and an activation function. The perceptron model begins
with the multiplication of all input values and their weights, then adds these
values together to create the weighted sum. Then this weighted sum is applied
to the activation function 'f' to obtain the desired output. This activation function
is also known as the step function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that
output is mapped between required values (0,1) or (-1,1). It is important to note
that the weight of input is indicative of the strength of a node. Similarly, an
input's bias value gives the ability to shift the activation function curve up or
down.

Perceptron model works in two important steps as follows:


Step-1
In the first step first, multiply all input values with corresponding weight values
and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value
as follows:
Y = f(∑wi*xi + b)
The activation function can be a step function or it can also be a signum
function. After adding the activation function, the perceptron model can be used
to complete the binary classification task. The step and the sign functions are
discontinuous at z = 0, so the gradient descent algorithm cannot be
used to optimize the parameters. In order to enable the perceptron model to
automatically learn from the data, Frank Rosenblatt proposed a perceptron
learning algorithm, as shown in Algorithm.

Algorithm: Perceptron Training Algorithm:

Initialize w = 0,b = 0
repeat
Randomly select a sample (xi, yi) from training set
Calculate the output a = sign (wTxi + b)
If a ≠ yi:
w′ ← w + η ∙ yi ∙ xi
b′ ← b + η ∙ yi
until you reach the required number of steps

Output: parameters w and b

NOTE: Here η is learning rate

Advantages of Perceptron Model over Mc Culloch’s Pitts Model:

● MP Neuron Model only accepts boolean input whereas Perceptron Model


can process any real input.
● Inputs aren’t weighted in MP Neuron Model, which makes this model
less flexible. On the other hand, Perceptron model can take weights with
respective to inputs provided.

MULTILAYER PERCEPTRON:
It is a neural network where the mapping between inputs and output is non-
linear.

A Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together. And while in the Perceptron the
neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation
function.

Using the perceptron model, machines can learn weight coefficients that help
them classify inputs. This linear binary classifier is highly effective in arranging
and categorizing input data into different classes, allowing probability-based
predictions and classifying items into multiple categories. Multilayer
Perceptrons have the advantage of learning non-linear models and the ability to
train models in real-time (online learning).

FULLY CONNECTED LAYER


The underivable nature of the perceptron model severely constrains its potential,
making it only capable of solving extremely simple tasks.
We replace the activation function of the perceptron model and stack multiple
neurons in parallel to achieve a multi-input and multi-output network layer
structure.
As shown in Figure, two neurons are stacked in parallel, that is, two perceptrons
with replaced activation functions, forming a network layer of three input nodes
and two output nodes.
The first output node is:

The shape of the input matrix X is defined as [b,din], while the number of
samples is b and the number of input nodes is din.
The shape of the weight matrix W is defined as [din,dout], while the number of
output nodes is dout, and the shape of the offset vector b is [dout].
the output matrix O contains the output of b samples, and the shape is [b,dout].
Since each output node is connected to all input nodes, this network layer is
called a fully connected layer, or a dense layer, with W as weight matrix and b
is the bias vector.
NEURAL NETWORK
 By stacking the fully connected layers in the above figure and ensuring
that the number of output nodes of the previous layer matches the number
of input nodes of the current layer, a network of any number of layers can
be created, which is known as neural networks.
 By stacking four fully connected layers, a neural network with four layers
can be obtained. Since each layer is a fully connected layer, it is called a
fully connected network.
 Among them, the first to third fully connected layers are called hidden
layers, and the output of the last fully connected layer is called the
output layer of the network.
 When designing a fully connected network, the hyperparameters such
as the configuration of the network can be set freely according to the rule
of thumb, and only a few constraints need to be followed.
 For example, the number of input nodes in the first hidden layer needs to
match the actual feature length of the data. The number of input layers in
each layer matches the number of output nodes in the previous layer. The
activation function and number of nodes in the output layer need to be set
according to the specific settings of the required output.
 In general, the design of the neural network models has a greater degree
of freedom.
Layer Model Implementation
For the conventional network layer, it is more concise and efficient to
implement through the layer method. First, create new network layer classes
and specify the activation function types of each layer:

For such a network where data forwards in turn, it can also be encapsulated into
a network class object through the sequential container, and the forward
calculation function of the class can be called once to complete the forward
calculation of all layers. It is more convenient to use and is implemented as
follows :

In forward calculation, you only need to call the large network objects once to
complete the sequential calculation of all layers:
out = model(x)

CASE STUDY:
A. You design a fully connected neural network architecture where all
activations are sigmoids. You initialize the weights with large positive
numbers. Is this a good idea? Explain your answer.
B. You are doing full batch gradient descent using the entire training set (not
stochastic gradient descent). Is it necessary to shuffle the training data?
Explain your answer.
C. You would like to train a dog/cat image classifier using mini-batch
gradient descent. You have already split your dataset into train, dev and
test sets. The classes are balanced. You realize that within the training set,
the images are ordered in such a way that all the dog images come first
and all the cat images come after. A friend tells you: “you absolutely need
to shuffle your training set before the training procedure.” Is your friend
right? Explain.

SOLU:
Fully connected neural networks (FCNNs) are a type of artificial neural
network where the architecture is such that all the nodes, or neurons, in one
layer are connected to the neurons in the next layer.
Each individual function consists of a neuron (or a perceptron). In fully
connected layers, the neuron applies a linear transformation to the input vector
through a weights matrix

Where:
xi->Input vector
wjk->weights in the matrix
wj0->Intial Bias
Why are fully connected layers required?
We can divide the whole neural network (for classification) into two parts:
 Feature extraction: In the conventional classification algorithms, like
SVMs, we used to extract features from the data to make the
classification work. The convolutional layers are serving the same
purpose of feature extraction. CNNs capture better representation of
data and hence we don’t need to do feature engineering.
 Classification: After feature extraction we need to classify the data into
various classes, this can be done using a fully connected (FC) neural
network. In place of fully connected layers, we can also use a
conventional classifier like SVM. But we generally end up adding FC
layers to make the model end-to-end trainable. The fully connected layers
learn a (possibly non-linear) function between the high-level features
given as an output from the convolutional layers.

Visualization:
If we take as an example a layer in a FC Neural Network with an input
size of 9 and an output size of 4, the operation can be visualised as follows:

The activation function f wraps the dot product between the input of the layer
and the weights matrix of that layer.
The input is a 1x9 vector, the weights matrix is a 9x4 matrix. By taking the dot
product and applying the non-linear transformation with the activation function
we get the output vector (1x4).
A fully connected neural network consists of a series of fully connected layers.
A fully connected layer is a function from ℝ m to ℝ n . Each output dimension
depends on each input dimension. Pictorially, a fully connected layer is
represented as follows in

The image above shows why we call these kinds of layers “Fully Connected” or
sometimes “densely connected”.
All possible connections layer to layer are present, meaning every input of the
input vector influences every output of the output vector.
A)
.Zero initialization causes the neuron to memorize the same functions almost in
each iteration. Random initialization is a better choice to break the symmetry.
However, initializing weight with much high or low value can result in slower
optimization.
Weights should be small but not too small as it gives problems like vanishing
gradient problem( vanish to 0).
B)
Shuffling training data, both before training and between epochs, helps prevent
model overfitting by ensuring that batches are more representative of the entire
dataset (in batch gradient descent) and that gradient updates on individual
samples are independent of the sample ordering (within batches or in stochastic
gradient descent); the end-result of high-quality per-epoch shuffling is better
model accuracy after a set number of epochs.
C)
Suppose data is sorted in a specified order. For example a data set which is
sorted base on their class. So, if you select data for training, validation, and test
without considering this subject, you will select each class for different tasks,
and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data
to get different sets of training, validation, and test data

ACTIVATION FUNCTION - TYPES:


An Activation function is a deceptively small mathematical expression which
decides whether a neuron fires or not. This means that the activation function
suppresses the neurons whose inputs are of no significance to the overall
application of the neural network. This is why neural networks require such
functions which provide significant improvement in performance.
There are different types of activation functions:
1) Linear Transfer Function
2) Heaviside step function or binary classifier
3) Softmax function
4) Rectified linear unit(ReLU)
5) Leaky ReLU
6) Hyperbolic tangent function (tanh)
7) Sigmoid/logistic function

Linear Transfer Function:

A linear function is also known as a straight-line function where the activation


is proportional to the input i.e. the weighted sum from neurons. It has a simple
function with the equation:
f(x)=x
Sigmoid/Logistic Function:
The Sigmoid function is also called the logistic function, which is defined as:
Sigmoid(x) =1/1+e^-x
It squishes all the values to a probability between 0 and 1, which reduces
extreme values or outliers. It is used to classify data into two classes

Plot:

 One of its excellent features is the ability to “compress” the input x ∈ R


to an interval x ∈ (0, 1). The value of this interval is commonly used in
machine learning to express the following meanings
 Probability distribution:The output of the interval (0, 1) matches the
distribution range of probability. The output can be translated into a
probability by the sigmoid function
 Signal strength:Usually, 0~1 can be understood as the strength of a
certain signal, such as the colour intensity of the pixel: 1 represents the
strongest colour of the current channel, and 0 represents the current
channel without colour. It can also be used to represent the current Gate
status, that is, 1 means open and 0 indicates closed.
 The Sigmoid function is continuously derivable, as shown in Figure
above. The gradient descent algorithm can be directly used to optimize
the network parameters.

ReLU function:
Relu function activates a node only if the input is above zero. The ReLU
function is defined as: ReLU (x)=max(0,x)
The function curve is shown in Figure . It can be seen that ReLU suppresses all
values less than 0 to 0; for positive numbers, it outputs those directly.

Leaky ReLU:
The derivative of the ReLU function is always 0 when x<0 which may also
cause gradient dispersion. To overcome this problem, the LeakyReLU function
is proposed

LeakyReLU={ x , where x >=0 px x<0


where p is a small value set by users, such as 0.02.
When p = 0, the LeakyReLU function degenerates to the ReLU function.
When p ≠ 0, a small derivative value can be obtained at x

Tanh function:
The Tanh function can “compress” the input x ∈ R to an interval (−1,1), defined
as:

It can be seen that the Tanh activation function can be realized after zooming
and translated by the Sigmoid function, as shown in Figure:
Disadvantages Of Relu Function:
1) Exploding Gradient:
This occurs when the gradient gets accumulated, this causes a large differences
in the subsequent weight updates. This as a result causes instability when
converging to the global minima and causes instability in the learning too.
2 ) Dying ReLU:
The problem of "dead neurons" occurs when the neuron gets stuck in the
negative side and constantly outputs zero. Because gradient of 0 is also 0, it's
unlikely for the neuron to ever recover. This happens when the learning rate is
too high or negative bias is quite large.

ELU Function:

ELU is an activation function based on ReLU that has an extra alpha constant
(α) that defines function smoothness when inputs are negative. Play with an
interactive example below to understand how α influences the curve for the
negative part of the function.
The ELU output for positive input is the input. If the input is negative, the
output curve is slightly smoothed towards the alpha constant (α). The higher the
alpha constant, the more negative the output for negative inputs gets.
Advantages of ELU:
 Tend to converge faster than ReLU (because mean ELU activations are
closer to zero)
 Better generalization performance than ReLU
 Fully continuous
 Fully differentiable
 Does not have a vanishing gradient’s problem
 Does not have an exploding gradient problem
 Does not have a dead relu problem

NOTE: THE NEED OF A NON LINEAR MODEL IN LAYERS:


A linear model is one of the simplest models in machine learning. It has only a
few parameters and can only express linear relationships. The perception and
decision-making of complex brains are far more complex than a linear model.
Therefore, the linear model is clearly not enough.
Complexity is the model ability to approximate complex distributions
comparing a one-layer neural network model composed of a small number of
neurons. Compared with the 100 billion neuron interconnection structure in the
human brain, its generalization ability is obviously weaker.
Since a linear model is not feasible, we can embed a nonlinear function in the
linear model and convert it to a nonlinear model. We call this nonlinear function
the activation function, which is represented by
O= α(Wx+b)
Here α represents a specific nonlinear activation function, such as the Sigmoid
function.
We choose the ReLu function to be placed in the alpha location(α) pretty much
all the time as the ReLU function only retains the positive part of function y = x
and sets the negative part to be zeros. It has a unilateral suppression
characteristic. Although simple, the ReLU function has excellent nonlinear
characteristics, easy gradient calculation, and stable training process. It is one of
the most widely used activation functions for deep learning models. Here we
convert the model to a nonlinear model by embedding the ReLU function
O=ReLU(Wx+b)
Hence layers in deep learning are non linear.

GRADIENT OF SIGMOID & TANH ACTIVATION FUNCTIONS:


Sigmoid Activation Function :
The Sigmoid function is also called the logistic function, which is defined as:
One of its excellent features is the ability to “compress” the input x ∈ R to an
interval x ∈ (0, 1). The value of this interval is commonly used in machine
learning to express the following meanings:
● Probability distribution The output of the interval (0, 1) matches the
distribution range of probability. The output can be translated into a probability
by the Sigmoid function
● Signal strength Usually, 0~1 can be understood as the strength of a certain
signal, such as the color intensity of the pixel: 1 represents the strongest color of
the current channel, and 0 represents the current channel without color. It can
also be used to represent the current Gate status, that is, 1 means open and 0
indicates closed.
● Let’s take a look at the graph of the sigmoid function.

Okay, so let’s start deriving the sigmoid function!

Gradient for sigmoid: σ(x) ∗ (1 − σ(x))


Tanh Activation Function :
The Tanh function can “compress” the input x ∈ R to an interval (−1, 1),
defined as:

= 2Sigmoid(2x)-1
 tanh is also like logistic sigmoid but better. The range of the tanh
function is from (-1 to 1). tanh is also sigmoidal (s - shaped).
 The advantage is that the negative inputs will be mapped strongly
negative and the zero inputs will be mapped near zero in the tanh
graph.
 The function is differentiable.
 The function is monotonic while its derivative is not monotonic.
 The tanh function is mainly used for classification between two
classes.

Okay, so let’s start deriving the tanh function!

Gradient for tanh: 1 − tanh2(z)


CASE STUDY : Assume that before training your neural network the setting is:
(1) The data is zero centered.
(2) All weights are initialized independently with mean 0 and
variance 0.001.
(3) The biases are all initialized to 0.
(4) Learning rate is small and cannot be tuned.

Explain which activation function between tanh and sigmoid is


likely to lead to a higher gradient during the first update

SOLU :We know the gradients of sigmoid and tanh activation functions, which are:
● Gradient for sigmoid: σ(z) ∗ (1 − σ(z))
● Gradient for tanh: 1 − tanh2(z)

We just have to substitute the z value(0) into respective functions.

During initialization, the expected value of z is 0.


Derivative of σ w.r.t z evaluated at zero = 0.5 * 0.5 = 0.25.
Derivative of tanh w.r.t z evaluated at zero = 1.
Hence, tanh has higher gradient magnitude during first update.

You might also like