NNDL Unit 3
NNDL Unit 3
Artificial neurons, despite their striking resemblance to biological neurons, do not behave in the
same way.
Biological and artificial NNs differ fundamentally in the following ways:
Key Concepts
Each neuron has a value that is equivalent to the electrical potential of biological neurons at any given
time.
The value of a neuron can change according to its mathematical model; for example, if a neuron gets
a spike from an upstream neuron, its value may rise or fall.
If a neuron’s value surpasses a certain threshold, the neuron will send a single impulse to each
downstream neuron connected to the first one, and the neuron’s value will immediately drop below
its average.
As a result, the neuron will go through a refractory period similar to that of a biological neuron. The
neuron’s value will gradually return to its average over time.
SNN ARCHITECTURE
Spiking neurons and linking synapses are described by configurable scalar weights in an SNN
architecture.
The analogue input data is encoded into the spike trains using either a rate-based technique, some sort
of temporal coding or population coding as the initial stage in building an SNN.
The network dynamics of artificial SNNs are much simplified as compared to actual biological
networks.
It is useful in this context to suppose that the modelled spiking neurons have pure threshold dynamics
(as opposed to refractoriness, hysteresis, resonance dynamics, or post-inhibitory rebound features).
When the membrane potential of postsynaptic neurons reaches a threshold, the activity of presynaptic
neurons affects the membrane potential of postsynaptic neurons, resulting in an action potential or spike.
ADVANTAGES OF SNN
SNN is a dynamic system. As a result, it excels in dynamic processes like speech and dynamic picture
identification.
When an SNN is already working, it can still train.
To train an SNN, you simply need to train the output neurons.
Traditional ANNs often have more neurons than SNNs; however, SNNs typically have fewer
neurons.
Because the neurons send impulses rather than a continuous value, SNNs can work incredibly
quickly.
Because they leverage the temporal presentation of information, SNNs have boosted information
processing productivity and noise immunity.
DISADVANTAGES
Deep learning has become one of the most popular and visible areas of machine learning, due to
its success in a variety of applications, such as computer vision, natural language processing, and
Reinforcement learning.
Deep learning can be used for
i. Supervised
ii. unsupervised
iii. reinforcement machine learning
i. Supervised Machine Learning:
Supervised machine learning is the machine learning technique in which the neural network learns
to make predictions or classify data based on the labeled datasets. Here we input both input
features along with the target variables. The neural network learns to make predictions based on
the cost or error that comes from the difference between the predicted and the actual target, this
process is known as back propagation. Deep learning algorithms like Convolutional neural
networks, recurrent neural networks are used for many supervised tasks like image classifications
and recognization, sentiment analysis, language translations, etc.
ii. Unsupervised Machine Learning:
Unsupervised machine learning is the machine learning technique in which the neural network
learns to discover the patterns or to cluster the dataset based on unlabeled datasets. Here there are
no target variables. While the machine has to self-determined the hidden patterns or relationships
within the datasets. Deep learning algorithms like auto encoders and generative models are used
for unsupervised tasks like clustering, dimensionality reduction, and anomaly detection.
Apply statistical algorithms to learn the hidden Uses artificial neural network architecture to learn
patterns and relationships in the dataset. the hidden patterns and relationships in the dataset.
Takes less time to train the model. Takes more time to train the model.
i. Computer vision
ii. Natural language processing (NLP)
iii. Reinforcement learning.
i. Computer vision
In computer vision, Deep learning models can enable machines to identify and understand visual
data. Some of the main applications of deep learning in computer vision include:
Object detection and recognition: Deep learning model can be used to identify and locate objects
within images and videos, making it possible for machines to perform tasks such as self-driving
cars, surveillance, and robotics.
Image classification: Deep learning models can be used to classify images into categories such
as animals, plants, and buildings. This is used in applications such as medical imaging, quality
control, and image retrieval.
i. Input layer
In this representation, each X[i] corresponds to a specific feature or attribute of the data. N is
the total number of features.
The Input Layer is responsible for passing the data to the Hidden Layer for further processing.
ii. Hidden layer-single hidden layer
The hidden layer of ELM is where random weights and biases are assigned. Let’s denote the
number of hidden neurons as L as per above Fig 1.
The weights connecting the input features to the hidden neurons are represented by a weight
matrix W of size (number of features, L).
The value of N is a hyper parameter that needs to be set before training the neural network.
The more hidden neurons there are, the more complex the neural network will be and the more
accurate it will be at modeling complex functions. However, having too many neurons will
lead to over fitting.
Each column in the weight matrix corresponds to the weights of a hidden neuron. The biases
for the hidden neurons are represented by a bias vector b of size (L, 1).
Thе second dimension of 1 is used to ensure that the bias vector is a column vector.
This is because the dot product of the weight matrix W and input feature vector X results in a
column vector, and adding a row vector (the bias vector) to a column vector requires that the
bias vector be a column vector as well.
Thе purpose of the bias term is to shift the activation function to the left or right, allowing it
to model more complex functions.
The output of the hidden layer, often denoted as H, is calculated by applying the activation
function g like linear regression concept by making element-wise to the dot product of the
input features and the weights, adding the bias.
H = g (W * X + b)
To make predictions, we multiply the hidden layer output H by the output weights beta. Each
row in f(x) represents the predictions for a corresponding data point.
The output predictions f(x) is a matrix of size J x K, where J is the number of data points and
K is the number of output variables.
H is a matrix of size J x L, where L is the number of hidden neurons that contains the
transformed input data after applying the random weights and biases of the hidden layer. Each
row represents to data points and each column represents to hidden neuron.
The output weights beta is a matrix of size L x K that constitutes a link between hidden layer
output to output predictions. Each row corresponds to hidden neuron, and each column
represents an output variable.
HOW ELM TRAINED:
ELM is get trained based on input training data in step by step procedure which are listed below,
1. Input Training Data: The first step is to gather the training data which include input features and
target variable to feed into ELM machine learning algorithm.
2. Random Initialization: Next the weights and bias are randomly initialized in hidden layer thus
this process eliminate iterative weight adjustment.
3. Feature Mapping: Thе input data is transformed into a high-dimensional feature space using
randomly assigned weights. This process is known as feature mapping and helps to capture
complex relationships between the input features.
4. Hidden Layer Processing: Then the transformed input data is processed by the hidden layer,
resulting in an output matrix. This output is a key component in ELM’s unique single-step learning
process.
5. Output Weight Calculation: In the fifth step ELM leverage the output matrix by applying Moore-
Penrose generalized inverse to calculate the output weights. This mathematical technique ensures
robustness, even in the presence of noise or missing data.
6. Model Evaluation: Once the output weights are determined by Moore-Penrose generalized
inverse, the model’s training output is generated. This output is evaluated by specific metrics
based on the problem use cases.
7. Fine-Tuning and Hyper parameters Adjustment: Depending on the assessment results, fine-
tuning or hyper parameter adjustments can be applied to improve model performance.
8. Deployment: Finally training process is complete, the ELM model is ready for deployment in
real-world applications, where it can make predictions and decisions based on the learned patterns.
Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset.
For example visual datasets like images or videos where data patterns play an extensive role.
CNN ARCHITECTURE:
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
down samples the image to reduce computation, and the fully connected layer makes the final
prediction.
The network learns the optimal filters through back propagation and gradient descent.
Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically.
Now slide that neural network across the whole image, as a result, we will get another image with
different widths, heights, and depths. Instead of just R, G, and B channels now we have more
channels but lesser width and height. This operation is called Convolution.
If the patch size is the same as that of the image it will be a regular neural network. Because of
this small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights
and the same depth as that of input volume (3 if the input layer is image input).
Above is an example of a kernel for applying Gaussian blur (to smoothen the image before
processing), Sharpen image (enhance the depth of edges) and edge detection.
The shape of a kernel is heavily dependent on the input shape of the image and architecture of the
entire network, mostly the size of kernels is (MxM) i.e. a square matrix. The movement of a kernel
is always from left to right and top to bottom.
Stride defines by what step does to kernel move, for example stride of 1 makes kernel slide by one
row/column at a time and stride of 2 moves kernel by 2 rows/columns.
For input images with 3 or more channels such as RGB a filter is applied.
Filters are one dimension higher than kernels and can be seen as multiple kernels stacked on each
other where every kernel is for a particular channel.
Therefore for an RGB image of (32x32) we have a filter of the shape say (5x5x3).
Now let’s see how a kernel operates on sample matrix
Here the input matrix has shape 4x4x1 and the kernel is of size 3x3 since the shape of input is
larger than the kernel, we are able to implement a sliding window protocol and apply the kernel
over entire input. First entry in the convoluted result is calculated as:
45*0 + 12*(-1) + 5*0 + 22*(-1) + 10*5 + 35*(-1) + 88*0 + 26*(-1) + 51*0 = -45
Sliding window protocol:
7. POOLING :
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial
size of the Convolved Feature.
This is to decrease the computational power required to process the data by reducing the
dimensions.
There are two types of pooling average pooling and max pooling. I’ve only had experience
with Max Pooling so far I haven’t faced any difficulties.
The pooling operation involves sliding a two-dimensional filter over each channel of feature map
and summarizing the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a
pooling layer is
(nh - f + 1) / s x (nw - f + 1)/s x nc
Where,
-> nh - height of feature map
In practical implementations of the convolution operation, certain modifications are made which
deviate from standard discrete convolution operation:
In general a convolution layer consists of application of several different kernels to the input.
Since, convolution with a single kernel can extract only one kind of feature.
The input is generally not real-valued but instead vector valued.
Multi-channel convolutions are commutative if number of output and input channels is the
same.
Effect of Strides
Stride is the number of pixels shifts over the input matrix.
In order to allow for calculation of features at a coarser level strided convolutions can be used.
The effect of strided convolution is the same as that of a convolution followed by a down
sampling stage.
Strides can be used to reduce the representation size.
Below is an example representing 2-D Convolution, with (3 * 3) Kernel and Stride of 2 units.
Effect of Zero Padding
Convolution networks can implicitly zero pad the input V, to make it wider.
Without zero padding, the width of representation shrinks by one pixel less than the kernel width
at each layer.
Zero padding the input allows to control kernel width and size of output independently.
Zero Padding Strategies
3 common zero padding strategies are:
Convolution
Properties Advantages and Disadvantages
Type
Advantages
1. No Parameter sharing. 1. Reducing memory consumption
2. Each output unit performs a linear 2. Increasing statistical efficiency
operation on its neighborhood but 3. Reducing the amount of
parameters are not shared across computation needed to perform
Unshared
output units. forward and back-propagation.
Convolution
3. Captures local connectivity while Disadvantages
allowing different features to be 1. Requires much more
computed at different spatial parameters than the convolution
locations. operation.
1) Unshared Convolution
3) Traditional Convolution
Convolutional networks can be trained to output high-dimensional structured output rather than
just a classification score.
To produce an output map as same size as input map, only same-padded convolutions can be
stacked.
The output of the first labelling stage can be refined successively by another convolutional model.
If the models use tied parameters, this gives rise to a type of recursive model.
Variable Description
The data used with a convolutional network usually consist of several channels, each channel being the
observation of a different quantity at some point in space or time.
The Fourier Transform is a tool that breaks a waveform (a function or signal) into an alternate
representation, characterized by sine and cosines.
When a d-dimensional kernel can be expressed as outer product of d vectors, one vector per
dimension, the kernel is called separable.
The kernel also takes fewer parameters to represent as vectors.
Kernel Type Runtime complexity for d-dimensional kernel with w elements wide
Traditional Kernel
Separable Kernel
Hubel and Wiesel studied the activity of neurons in a cat’s brain in response to visual stimuli.
Their work characterized many aspects of brain function.
In a simplified view, we have:
The light entering the eye stimulates the retina. The image then passes through the optic nerve
and a region of the brain called the LGN (lateral geniculate nucleus).
V1 (primary visual cortex): The image produced on the retina is transported to the V1 with
minimal processing.
The properties of V1 that have been replicated in CNNs are:
The V1 response is localized spatially, i.e. the upper image stimulates the cells in the upper region
of V1 [localized kernel].
V1 has simple cells whose activity is a linear function of the input in a small
neighborhood [convolution].
V1 has complex cells whose activity is invariant to shifts in the position of the feature [pooling] as
well as some changes in lighting which cannot be captured by spatial pooling [cross-channel
pooling].