1 Neural Networks
1 Neural Networks
PERCEPTRON MODEL:
This model uses a hyperplane line that classifies two inputs and classifies them
on the basis of the 2 classes that a machine learns, thus implying that the
perceptron model is a linear classification model.
There are 4 constituents of a perceptron model. They are as follows-
1. Input values
2. Weights and bias
3. Net sum
4. Activation function
Initialize w = 0,b = 0
repeat
Randomly select a sample (xi, yi) from training set
Calculate the output a = sign (wTxi + b)
If a ≠ yi:
w′ ← w + η ∙ yi ∙ xi
b′ ← b + η ∙ yi
until you reach the required number of steps
MULTILAYER PERCEPTRON:
It is a neural network where the mapping between inputs and output is non-
linear.
A Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together. And while in the Perceptron the
neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation
function.
Using the perceptron model, machines can learn weight coefficients that help
them classify inputs. This linear binary classifier is highly effective in arranging
and categorizing input data into different classes, allowing probability-based
predictions and classifying items into multiple categories. Multilayer
Perceptrons have the advantage of learning non-linear models and the ability to
train models in real-time (online learning).
The shape of the input matrix X is defined as [b,din], while the number of
samples is b and the number of input nodes is din.
The shape of the weight matrix W is defined as [din,dout], while the number of
output nodes is dout, and the shape of the offset vector b is [dout].
the output matrix O contains the output of b samples, and the shape is [b,dout].
Since each output node is connected to all input nodes, this network layer is
called a fully connected layer, or a dense layer, with W as weight matrix and b
is the bias vector.
NEURAL NETWORK
By stacking the fully connected layers in the above figure and ensuring
that the number of output nodes of the previous layer matches the number
of input nodes of the current layer, a network of any number of layers can
be created, which is known as neural networks.
By stacking four fully connected layers, a neural network with four layers
can be obtained. Since each layer is a fully connected layer, it is called a
fully connected network.
Among them, the first to third fully connected layers are called hidden
layers, and the output of the last fully connected layer is called the
output layer of the network.
When designing a fully connected network, the hyperparameters such
as the configuration of the network can be set freely according to the rule
of thumb, and only a few constraints need to be followed.
For example, the number of input nodes in the first hidden layer needs to
match the actual feature length of the data. The number of input layers in
each layer matches the number of output nodes in the previous layer. The
activation function and number of nodes in the output layer need to be set
according to the specific settings of the required output.
In general, the design of the neural network models has a greater degree
of freedom.
Layer Model Implementation
For the conventional network layer, it is more concise and efficient to
implement through the layer method. First, create new network layer classes
and specify the activation function types of each layer:
For such a network where data forwards in turn, it can also be encapsulated into
a network class object through the sequential container, and the forward
calculation function of the class can be called once to complete the forward
calculation of all layers. It is more convenient to use and is implemented as
follows :
In forward calculation, you only need to call the large network objects once to
complete the sequential calculation of all layers:
out = model(x)
CASE STUDY:
A. You design a fully connected neural network architecture where all
activations are sigmoids. You initialize the weights with large positive
numbers. Is this a good idea? Explain your answer.
B. You are doing full batch gradient descent using the entire training set (not
stochastic gradient descent). Is it necessary to shuffle the training data?
Explain your answer.
C. You would like to train a dog/cat image classifier using mini-batch
gradient descent. You have already split your dataset into train, dev and
test sets. The classes are balanced. You realize that within the training set,
the images are ordered in such a way that all the dog images come first
and all the cat images come after. A friend tells you: “you absolutely need
to shuffle your training set before the training procedure.” Is your friend
right? Explain.
SOLU:
Fully connected neural networks (FCNNs) are a type of artificial neural
network where the architecture is such that all the nodes, or neurons, in one
layer are connected to the neurons in the next layer.
Each individual function consists of a neuron (or a perceptron). In fully
connected layers, the neuron applies a linear transformation to the input vector
through a weights matrix
Where:
xi->Input vector
wjk->weights in the matrix
wj0->Intial Bias
Why are fully connected layers required?
We can divide the whole neural network (for classification) into two parts:
Feature extraction: In the conventional classification algorithms, like
SVMs, we used to extract features from the data to make the
classification work. The convolutional layers are serving the same
purpose of feature extraction. CNNs capture better representation of
data and hence we don’t need to do feature engineering.
Classification: After feature extraction we need to classify the data into
various classes, this can be done using a fully connected (FC) neural
network. In place of fully connected layers, we can also use a
conventional classifier like SVM. But we generally end up adding FC
layers to make the model end-to-end trainable. The fully connected layers
learn a (possibly non-linear) function between the high-level features
given as an output from the convolutional layers.
Visualization:
If we take as an example a layer in a FC Neural Network with an input
size of 9 and an output size of 4, the operation can be visualised as follows:
The activation function f wraps the dot product between the input of the layer
and the weights matrix of that layer.
The input is a 1x9 vector, the weights matrix is a 9x4 matrix. By taking the dot
product and applying the non-linear transformation with the activation function
we get the output vector (1x4).
A fully connected neural network consists of a series of fully connected layers.
A fully connected layer is a function from ℝ m to ℝ n . Each output dimension
depends on each input dimension. Pictorially, a fully connected layer is
represented as follows in
The image above shows why we call these kinds of layers “Fully Connected” or
sometimes “densely connected”.
All possible connections layer to layer are present, meaning every input of the
input vector influences every output of the output vector.
A)
.Zero initialization causes the neuron to memorize the same functions almost in
each iteration. Random initialization is a better choice to break the symmetry.
However, initializing weight with much high or low value can result in slower
optimization.
Weights should be small but not too small as it gives problems like vanishing
gradient problem( vanish to 0).
B)
Shuffling training data, both before training and between epochs, helps prevent
model overfitting by ensuring that batches are more representative of the entire
dataset (in batch gradient descent) and that gradient updates on individual
samples are independent of the sample ordering (within batches or in stochastic
gradient descent); the end-result of high-quality per-epoch shuffling is better
model accuracy after a set number of epochs.
C)
Suppose data is sorted in a specified order. For example a data set which is
sorted base on their class. So, if you select data for training, validation, and test
without considering this subject, you will select each class for different tasks,
and it will fail the process.
Hence, to impede these kind of problems, a simple solution is shuffling the data
to get different sets of training, validation, and test data
Plot:
ReLU function:
Relu function activates a node only if the input is above zero. The ReLU
function is defined as: ReLU (x)=max(0,x)
The function curve is shown in Figure . It can be seen that ReLU suppresses all
values less than 0 to 0; for positive numbers, it outputs those directly.
Leaky ReLU:
The derivative of the ReLU function is always 0 when x<0 which may also
cause gradient dispersion. To overcome this problem, the LeakyReLU function
is proposed
Tanh function:
The Tanh function can “compress” the input x ∈ R to an interval (−1,1), defined
as:
It can be seen that the Tanh activation function can be realized after zooming
and translated by the Sigmoid function, as shown in Figure:
Disadvantages Of Relu Function:
1) Exploding Gradient:
This occurs when the gradient gets accumulated, this causes a large differences
in the subsequent weight updates. This as a result causes instability when
converging to the global minima and causes instability in the learning too.
2 ) Dying ReLU:
The problem of "dead neurons" occurs when the neuron gets stuck in the
negative side and constantly outputs zero. Because gradient of 0 is also 0, it's
unlikely for the neuron to ever recover. This happens when the learning rate is
too high or negative bias is quite large.
ELU Function:
ELU is an activation function based on ReLU that has an extra alpha constant
(α) that defines function smoothness when inputs are negative. Play with an
interactive example below to understand how α influences the curve for the
negative part of the function.
The ELU output for positive input is the input. If the input is negative, the
output curve is slightly smoothed towards the alpha constant (α). The higher the
alpha constant, the more negative the output for negative inputs gets.
Advantages of ELU:
Tend to converge faster than ReLU (because mean ELU activations are
closer to zero)
Better generalization performance than ReLU
Fully continuous
Fully differentiable
Does not have a vanishing gradient’s problem
Does not have an exploding gradient problem
Does not have a dead relu problem
= 2Sigmoid(2x)-1
tanh is also like logistic sigmoid but better. The range of the tanh
function is from (-1 to 1). tanh is also sigmoidal (s - shaped).
The advantage is that the negative inputs will be mapped strongly
negative and the zero inputs will be mapped near zero in the tanh
graph.
The function is differentiable.
The function is monotonic while its derivative is not monotonic.
The tanh function is mainly used for classification between two
classes.
SOLU :We know the gradients of sigmoid and tanh activation functions, which are:
● Gradient for sigmoid: σ(z) ∗ (1 − σ(z))
● Gradient for tanh: 1 − tanh2(z)