0% found this document useful (0 votes)
9 views54 pages

Report 2022

The document provides an overview of neural networks, detailing their definition, historical development, characteristics, and applications across various fields. It explains the similarities between human and artificial neurons, the architecture of artificial neural networks, and the functioning of different types of networks. The document concludes by emphasizing the extensive investment and interest in neural network technology and its capabilities in processing complex data.

Uploaded by

sideedasayeall
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Report 2022

The document provides an overview of neural networks, detailing their definition, historical development, characteristics, and applications across various fields. It explains the similarities between human and artificial neurons, the architecture of artificial neural networks, and the functioning of different types of networks. The document concludes by emphasizing the extensive investment and interest in neural network technology and its capabilities in processing complex data.

Uploaded by

sideedasayeall
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

1.

Introduction to Neural Network


The recent rise of interest in neural networks has its roots in the recognition that the brain
performs computations in a different manner than do conventional digital computers.
Computers are extremely fast and precise at executing sequences of instructions that have
been formulated for them. A human information processing system is composed of neurons
switching at speeds about a million times slower than computer gates. Yet, humans are more
efficient than computers at computationally complex tasks such as speech understanding.
Moreover, not only humans, but also even animals, can process visual information better than
the fastest computers. Artificial neural systems, or neural networks (NN), are physical
cellular systems, which can acquire, store, and utilize experiential knowledge. The knowledge
is in the form of stable states or mappings embedded in networks that can be recalled in
response to the presentation cues. Neural network processing typically involves dealing with
large-scale problems in terms of dimensionality, amount of data handled, and the volume of
simulation or neural hardware processing. This large-scale approach is both essential and
typical for real-life applications. By keeping view of all these, the research community has
made an effort in designing and implementing the various neural network models for
different applications.

Definition: A neural network is a computing system made up of a number of simple, highly


interconnected nodes or processing elements, which process information by its dynamic state
response to external inputs.

1.1 Historical Background


1943:
McCulloch and Pitts proposed the first formal model of a synthetic neuron, known as the
McCulloch-Pitts (MP) neuron, resembling a binary logic device.
1949:
Hebb introduced a learning mechanism, suggesting that the brain's connectivity changes
as it learns, forming the basis for ANN learning algorithms.
1958:
Rosenblatt incorporated the learning mechanism into ANNs.

1960:
Widrow and Hoff developed the ADALINE(Adaptive Linear Neuron) model using the
least mean squares (LMS) algorithm for quick and accurate learning, with applications in
pattern recognition, weather forecasting, and adaptive controls.
1969:
Minsky and Papert highlighted the limitations of single-layer neural networks in their
book on perceptrons, causing a setback in ANN research.
Post-1969:
Despite the setback, researchers like Kohonen, Grossberg, Anderson, and Hopfield
continued their work, and multi-layer perceptron networks were found to solve nonlinear
problems.
1970s-1980s:
Research on threshold elements and neural network theory continued, with Kunihiko
Fukushima developing neocognitrons in 1980.
Late 1980s:
Demonstrations of ANN capabilities emerged, including text-to-speech conversion,
handwritten character recognition, and image compression, primarily using the
backpropagation algorithm. Backpropagation, developed independently by Werbos,
Parker, and Rumelhart, Hinton, and Williams, enabled the training of multi-layer
networks, overcoming limitations identified by Minsky and Papert.

1.2 Characteristic of Neural Network


Artificial neural networks are biologically inspired; that is, they are composed of
elements that perform in a manner that is analogous to the most elementary functions of
the biological neuron. The important characteristics of artificial neural networks are
learning from experience, generalize from previous examples to new ones, and abstract
essential characteristics from inputs containing irrelevant data.
 Learning
The NNs learn by examples. Thus, NN architectures can be ‘trained’ with known
examples of a problem before they are tested for their ‘inference’ capability on
unknown instances of the problem. They can, therefore, identify new objects
previously untrained. ANN can modify their behaviour in response to their
environment.

 Parallel operation
The NNs can process information in parallel, at high speed, and in a distributed
manner.
 Mapping
The NNs exhibit mapping capabilities, that is, they can map input patterns to
their associated output patterns.
 Generalization
The NNs possess the capability to generalize. Thus, they can predict new
outcomes from past trends. Once trained, a network’s response can be to a
degree, insensitive to minor variations in its input. This ability to see through
noise and distortion to the pattern that lies within is vital to pattern recognition in
a real-world environment.
 Robust
The NNs are robust systems and are fault tolerant. They can, therefore, recall full
patterns from incomplete, partial or noisy patterns .
 Abstraction
Some ANN’s are capable of abstracting the essence of a set of inputs. i.e. they
can extract features of the given set of data, for example, convolution neural
networks are used to extract different features from images like edges, dark
spots, shapes ..etc. Such networks are trained for feature patterns based on which
they can classify or cluster the given input set.
 Applicability
ANN’s are not a panacea. They are clearly unsuited to such tasks as calculating
the payroll. They are preferred for a large class of pattern-recognition tasks that
conventional computers do poorly, if at all.

1.3 Why use Neural Networks ?


Neural networks, with their remarkable ability to derive meaning from complicated or
imprecise data, can be used to extract patterns and detect trends that are too complex to
be noticed by either humans or other computer techniques. A trained neural network can
be thought of as an "expert" in the category of information it has been given to analyse.
This expert can then be used to provide projections given new situations of interest and
answer "what if" questions.
Other advantages include:
1. Adaptive Learning:
ANNs can learn to perform tasks based on the data they are trained on or from initial
experiences. This means they can adjust and improve their performance over time as they
receive more data.
2. Self-Organization:
ANNs can create their own internal organization or representation of the information
they receive during the learning process. This allows them to identify patterns and
relationships in the data without explicit programming.
3. Real-Time Operation:
ANN computations can be performed in parallel, allowing for fast processing.
Specialized hardware is being developed to take advantage of this capability, further
enhancing their speed and efficiency.
4. Fault Tolerance via Redundant Information Coding:
Even if parts of the network are damaged or destroyed, ANNs can still maintain some
functionality. This is due to the distributed nature of information storage within the
network, where redundancy in coding allows for graceful degradation of performance
rather than catastrophic failure.

1.4 Application of Neural Network

1. Aerospace

High performance aircraft autopilots, flight path simulations, aircraft control systems,
autopilot enhancements, aircraft component simulations, aircraft component fault detectors.

2.Automotive

Automobile automatic guidance systems, fuel injector control, automatic braking systems,
misfire detection, virtual emission sensors, warranty activity analyzers.

3.Banking

Check and other document readers, credit application evaluators, cash forecasting, firm
classification, exchange rate forecasting, predicting loan recovery rates, measuring credit risk.

4.Defense
Weapon steering, target tracking, object discrimination, facial recognition, new kinds of
sensors, sonar, radar and image signal processing including data compression, feature
extraction and noise suppression, signal/image identification.

5.Electronics

Code sequence prediction, integrated circuit chip layout, process control, chip failure
analysis, machine vision, voice synthesis, nonlinear modelling.

6.Entertainment

Animation, special effects, market forecasting.

7.Medical

Breast cancer cell analysis, EEG and ECG analysis, prosthesis design, optimization of
transplant times, hospital expense reduction, hospital quality improvement, emergency room
test advisement.

8. Robotics

Trajectory control, forklift robot, manipulator controllers, vision systems, autonomous


vehicles.

9.Speech

Speech recognition, speech compression, vowel classification, text to speech synthesis.

10.Telecommunications

Image and data compression, automated information services, real-time translation of spoken
language, customer payment processing systems.

Conclusion

The number of neural network applications, the money that has been invested in neural
network software and hardware, and the depth and breadth of interest in these devices is
enormous.
2.Human and Artificial Neurons- Investigating the similarities

2.1 How the Human brain learns?

Much is still unknown about how the brain trains itself to process information, so theories
abound.

The brain consists of a large number (approximately 10^11) of highly connected elements
(approximately 10^4 connections per element) called neurons. For our purposes these
neurons have three principal components: the dendrites, the cell body and the axon.

In the human brain, a typical neuron collects signals from others through a host of fine
structures called dendrites. The neuron sends out spikes of electrical activity through a long,
thin stand known as an axon, which splits into thousands of branches. At the end of each
branch, a structure called a synapse converts the activity from the axon into electrical effects
that inhibit or excite activity from the axon into electrical effects that inhibit or excite activity
in the connected neurons. When a neuron receives excitatory input that is sufficiently large
compared with its inhibitory input, it sends a spike of electrical activity down its axon.
Learning occurs by changing the effectiveness of the synapses so that the influence of one
neuron on another changes.

a) Components of Neuron b) The Synapse

2.2 From Human Neurons to Artificial Neuron

The artificial neuron is developed to mimic the first-order characteristics of the biological
neuron. In similar to the biological neuron, the artificial neuron receives many inputs
representing the output of other neurons. Each input is multiplied by a corresponding weight,
analogous to the synaptic strength. All of these weighted inputs are then summed and passed
through an activation function to determine the neuron input.
Fig: An Artificial Neuron

We conduct these neural networks by first trying to deduce the essential features of neurones
and their interconnections. We then typically program a computer to simulate these features.
However, because our knowledge of neurons is incomplete and our computing power is
limited, our models are necessarily gross idealisations of real networks of neurons

2.3 Biological Neuron vs Artificial Neural Network

a. Biological Neuron

b. Artificial Neuron
3.Architecture of Artificial Neural Network

Fig: Artificial Neural Network


3.1 Components of Artificial Neural Network Architecture

Input Layer

This is where the network receives its input data. Each input neuron in the layer corresponds
to a feature in the input data.

Hidden Layers

These layers perform most of the computational heavy lifting. A neural network can have one
or multiple hidden layers. Each layer consists of units (neurons) that transform the inputs into
something that the output layer can use.

Output Layer

The final layer produces the output of the model. The format of these outputs varies
depending on the specific task (e.g., classification, regression).

Neurons

Neurons, also known as artificial neurons or nodes, are fundamental units in artificial neural
networks. Each neuron connects to other neurons within the network, allowing for the
transmission and processing of information. Each neuron has an associated weight and
threshold. Weights determine the strength of the connection between neurons, while the
threshold dictates the minimum input required for a neuron to activate.

Weights

These are numerical values assigned to the connections between neurons in different layers of
the network. They determine the strength or importance of each connection. During training,
the weights are adjusted to minimize the difference between the network's predictions and the
actual target values.

Biases
These are additional numerical values associated with each neuron. They act as an offset,
allowing the neuron to activate even when the input is zero. Like weights, biases are also
adjusted during training to improve the network's accuracy.

Activation Function

An activation function is a mathematical function applied to the output of a neuron in a neural


network. It introduces non-linearity, allowing the network to learn complex patterns and
make better predictions.

The behaviour of the artificial neuron depends both on the synaptic weights and the
activation function. Sigmoid functions are the commonly used activation functions in
multilayered feed forward neural networks. Neurons with sigmoid functions bear a greater
resemblance to the biological neurons than with other activation functions. The other feature
of sigmoid function is that it is differentiable, and gives a continuous values output. Some of
the popular activation functions are described below along with their other characteristics.

1. Sigmoid function ( Unipolar sigmoid)


The characteristics of this function is shown in the Fig and its mathematical description
is

and its range of signal is 0<y<1.


The derivative of the above function is written as
Moreover, sigmoid functions are continuous and monotonic, and remain finite even as x
approaches to positive infinity and negative infinity. Because they are monotonic , they
also provide for more efficient network training.
Selection of activation function
The selection of an activation function is depends upon the application to which the
neural network used and also the level (in which layer) neuron. The activation functions
that are mainly used are the sigmoid (unipolar sigmoidal), the hyperbolic tangent (bipolar
sigmoid), radial basis function, hard limiter and linear functions. The sigmoid and
hyperbolic tangent functions perform well for the prediction and the process-forecasting
types of problems. However, they do not perform as well for classification networks.
Instead, the radial basis function proves more effective for those networks, and highly
recommended function for any problems involving fault diagnosis and feature
categorization. The hard limiter suits well for classification problems. The linear function
may be used at output layer in feed forward networks.

4.Artificial Neural Networks

The simplest network is a group of neuron arranged in a layer. This configuration is


known as single layer neural networks. This type of network comprises of two layers,
namely the input layer and the output layer. The input layer neurons receive the input
signals and the output layer neurons receive the output signals. The synaptic links
carrying the weights connect every input neuron to the output neuron but not vice-versa.
Such a network is said to be feedforward in type or acyclic in nature. Despite the two
layers, the network is termed single layer, since it is the output layer alone which
performs computation. The input layer merely transmits the signals to the output layer.
Hence, the name single layer feedforward network. There are two types of single layer
networks namely, feed-forward and feedback networks.
There are several basic types of Artificial Neural Network, each designed for specific
task and applications.
4.1 Feed forward single layer neural network

The simplest type of network where data flows in one direction from input to output is
the feed forward neural network. It is often use for task like classification and regression.

Consider m numbers of neurons are arranged in a layer structure and each neuron
receiving n inputs as shown in Fig. Output and input vectors are respectively.

Weight wji connects the jth neuron with the ith input. Then the activation value for jth
neuron as

The following nonlinear transformation involving the activation function f(net j), for
j=1,2,. . .m, completes the processing of X. The transformation will be done by each of
the m neurons in the network.
where weight vector wj contains weights leading toward the j th output node and is defined
as follows Wj = [ wj1 wj2 . . . wjn]

Introducing the nonlinear matrix operator F, the mapping of input space X to output
space O implemented by the network can be written as

O = F (W X)

Where W is the weight matrix and also known as connection matrix and is represented as

The weight matrix will be initialized and it should be finalized through appropriate
training method.

The nonlinear activation function f(.) on the diagonal of the matrix operator F(.) operates
component-wise on the activation values net of each neuron. Each activation value is, in
turn, a scalar product of an input with the respective weight vector, X is called input
vector and O is called output vector. The mapping of an input to an output is of the feed-
forward and instantaneous type, since it involves no delay between the input and the
output. Therefore the relation may be written in terms of time t as

O (t) = F (W X(t))

This type of networks can be connected in cascade to create a multilayer network.


Though there is no feedback in the feedforward network while mapping from input X(t)
to output O(t), the output values are compared with the “teachers” information, which
provides the desired output values. The error signal is used for adapting the network
weights.

Example: To illustrate the computation of output O(t), of the single layer feed forward
network consider an input vector X(t) and a network weight matrix W (say initialized
weights), given below. Consider the neurons uses the hard limiter as its activation
function.

The output vector may be obtained as

O = F (W X) = [ sgn(-1, -1) sgn(+1, +2) sgn(1) sgn(-1, +3) ]

= [ -1 1 1 1]

The output vector of the above single layer feedforward network is = [ -1 1 1 1].

4.2 Convolution Neural Network

Convolutional neural network (CNN), a class of artificial neural networks that has become
dominant in various computer vision tasks, is attracting interest across a variety of domains,
including radiology.

CNN is a type of deep learning model for processing data that has a grid pattern, such as
images, which is inspired by the organization of animal visual cortex [13, 14] and designed to
automatically and adaptively learn spatial hierarchies of features, from low- to high-level
patterns.
“A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activations to another through a differentiable function.” What it actually means is that, each
layer is associated with converting the information from the values, available in the previous
layers, into some more complex information and pass on to the next layers for further
generalization.

Defining the layers

 Convolutional Layer: The CONVOLUTIONAL LAYER is related to feature


extraction.
 Filters: Filters or ‘kernels’ are also an image that depict a particular feature.
For example, let us take the picture of this curve. We take this as a sample
feature that we will try to recognize, i.e., determine whether it is present in an
image.
a. Pixel representation of filter b. Visualization of a curve detector
filter

Fig: An example of simple filter depicting a curve line

Convolution: It is a special operation applied on a particular matrix (, usually the image


matrix) using another matrix (usually the filter-matrix).

The operation involves multiplying the values of a cell corresponding to a particular row and
column, of the image matrix, with the value of the corresponding cell in the filter matrix. We
do this for the values of all the cells within the span of the filter matrix and add them together
to form an output. For example, here part of the image matrix and part of the filter matrix are
convolved.

Perform element-wise multiplication:

(1×1)+(0×2)+(1×3)+(0×4)+(1×5)+(1×6)+(1×7)+(0×8)+(1×9)(1×1) + (0×2) + (1×3) + (0×4) +


(1×5) + (1×6) + (1×7) + (0×8) + (1×9)(1×1)+(0×2) +(1×3)+(0×4) +(1×5) + (1×6)
+(1×7)+(0×8)+(1×9) = 31
The weights in the feature detector remain fixed as it moves across the image, which is also
known as parameter sharing. Some parameters such as the weight values, adjust during
training through the process of backpropagation and gradient descent. However, there are
three hyperparameters which affect the volume size of the output that need to be set before
the training of the neural network begins. These include:

1. The number of filters affects the depth of the output. For example, three distinct filters
would yield three different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input matrix.
While stride values of two or greater is rare, a larger stride yields a smaller output.

3. Zero-padding is usually used when the filters do not fit the input image. This sets all
elements that fall outside of the input matrix to zero, producing a larger or equally sized
output. There are three types of padding

Pooling layer

A pooling layer provides a typical downsampling operation which reduces the in-plane
dimensionality of the feature maps in order to introduce a translation invariance to small
shifts and distortions, and decrease the number of subsequent learnable parameters. It is of
note that there is no learnable parameter in any of the pooling layers, whereas filter size,
stride, and padding are hyperparameters in pooling operations, similar to convolution
operations.
Max pooling

The most popular form of pooling operation is max pooling, which extracts patches from the
input feature maps, outputs the maximum value in each patch, and discards all the other
values. A max pooling with a filter of size 2 × 2 with a stride of 2 is commonly used in
practice. This downsample the in-plane dimension of feature maps by a factor of 2. Unlike
height and width, the depth dimension of feature maps remains unchanged.

Global average pooling

Another pooling operation worth noting is a global average pooling. A global average pooling
performs an extreme type of downsampling, where a feature map with size of height × width
is downsampled into a 1 × 1 array by simply taking the average of all the elements in each
feature map, whereas the depth of feature maps is retained. This operation is typically applied
only once before the fully connected layers. The advantages of applying global average
pooling are that it reduces the number of learnable parameters and enables the CNN to
accept inputs of variable size.

Fully connected layer

The output feature maps of the final convolution or pooling layer is typically flattened, i.e.,
transformed into a one-dimensional (1D) array of numbers (or vector), and connected to one
or more fully connected layers, also known as dense layers, in which every input is connected
to every output by a learnable weight. Once the features extracted by the convolution layers
and downsampled by the pooling layers are created, they are mapped by a subset of fully
connected layers to the final outputs of the network, such as the probabilities for each class in
classification tasks. The final fully connected layer typically has the same number of output
nodes as the number of classes. Each fully connected layer is followed by a nonlinear
function, such as ReLU.

5.Training Methods of Artificial Neural Networks

Introduction

The dynamics of neuron consists of two parts. One is the dynamics of the activation state and
the second one is the dynamics of the synaptic weights.

The Short Term Memory (STM) in neural networks is modelled by the activation state of the
network and the Long Term Memory is encoded the information in the synaptic weights due
to learning. The main property of artificial neural network is that, the ability of the learning
from its environment and history.

The network learns about its environment and history through its interactive process of
adjustment applied to its synaptic weights and bias levels.

Generally, the network becomes more knowledgeable about its environment and history,
after completion each iteration of learning process. It is important to distinguish between
representation and learning. Representation refers to the ability of a perceptron (or other
network) to simulate a specified function. Learning requires the existence of a systematic
procedure for adjusting the network weights to produce that function.

5.1 Definition of learning

There are too many activities associated with the notion of learning and we define
learning in the context of neural networks as
“Learning is a process by which the free parameters of neural network are adapted
through a process of stimulation by the environment in which the network is embedded.
The type of learning is determined by the manner in which the parameter changes takes
place”

Based on the above definition the learning process of ANN can be divided into the
following sequence of steps:

1. The ANN is stimulated by an environment.


2. The ANN undergoes changes in its free parameters as a result of the above
stimulation.
3. The ANN responds in a new way to the environment because of the changes that
have occurred in its internal structure.

5.2 Types of Learning Methods / Learning Strategies

A set of defined rules for the solution of a learning problem is called algorithm. There are
different approaches to train an ANN. Most of the methods fall into one of two classes
namely supervised learning and unsupervised learning.

Supervised learning:

An external signal known as teacher controls learning and incorporates information.

Supervised training requires the pairing of each input vector with a target vector
representing the desired output; together these are called a training pair. Usually a
network is trained over a number of such training pairs. An input vector is applied, the
output of the network is calculated and compared to the corresponding target vector and
the difference (error) is fed back through the network and weights are changed according
to an algorithm that tends to minimize the error. The vectors of the training set are
applied sequentially, and errors are calculated and weights adjusted for each vector, until
the error for the entire training set is at the acceptably low value.

Unsupervised learning

No external signal (teacher) is used in the learning process. The neural network relies upon
both internal and local information.

Unsupervised training is a far more plausible model of training in the biological system.
Developed by Kohonen (1984) and many others, it requires no target vector for the outputs,
and hence, no comparisons to predetermined ideal responses. The training set consists solely
of input vectors. The training algorithm modifies network weights to produce output vectors
that consistent; i.e., both application of one of the training vectors and application of a vector
that is sufficiently similar to it will produce the same patterns of outputs.

• The training process, therefore, extracts the statistical properties of the training set and
group’s similar vector into classes.

• Applying a vector from a given class as a input will produce a specific output vector, but
there is no way to determine prior to training which specific output pattern will be produced
by a given input vector class. Hence, the outputs of such a network must generally be
transformed into a comprehensible form subsequent to the training process.

5.3 Types of basic learning mechanisms


A feedforward neural net is an instance of supervised machine learning in which we
know the correct output y for each observation x. What the system produces is ˆy, the
system’s estimate of the true y. The goal of the training procedure is to learn
parameters W[i] and b[i] for each layer i that make ˆy for each training observation
as close as possible to the true y.
First, we’ll need a loss function that models the distance between the system output
and the gold output, and it’s common to use the loss function used for logistic
regression, the cross-entropy loss.
Second, to find the parameters that minimize this loss function, we’ll use the
gradient descent optimization algorithm.
Third, gradient descent requires knowing the gradient of the loss function, the vector
that contains the partial derivative of the loss function with respect to each of the
parameters. In logistic regression, for each observation we could directly compute
the derivative of the loss function with respect to an individual w or b. But for neural
networks, with millions of parameters in many layers, it’s much harder to see how to
compute the partial derivative of some weight in layer 1 when the loss is attached to
some much later layer. How do we partial out the loss over all those intermediate
layers?
The answer is the algorithm called error backpropagation or backward
differentiation.

Loss function

The cross-entropy loss that is used in neural networks is the same one as for logistic
regression. If the neural network is being used as a binary classifier, with the sigmoid at the
final layer, the loss function is same as logistic regression loss.

LCE(y^, y) = −log p(y |x) = −[y logy^ + (1−y)log(1− y^)]

If we are using the network to classify into 3 or more classes, the loss function is exactly the
same as the loss for multinomial regression.

First, when we have more than 2 classes we’ll need to represent both y and yˆ as vectors.
Let’s assume we’re doing hard classification, where only one class is the correct one. The true
label y is then a vector with K elements, each corresponding to a class, with y c = 1 if the
correct class is c, with all other elements of y being 0. Recall that a vector like this, with one
value equal to 1 and the rest 0, is called a one-hot vector. And our classifier will produce an
estimate vector with K elements yˆ, each element yˆ k of which represents the estimated
probability p(yk = 1|x). The loss function for a single example x is the negative sum of the
logs of the K output classes, each weighted by their probabilities.

5.4 Computing the Gradient Descent

Computing the gradient requires the partial derivative of the loss function with respect to
each parameter. For a network with one weight layer and sigmoid output (which is what
logistic regression is), we could simply use the derivative of the loss that we used for logistic
regression.

Or for a network with one weight layer and softmax output (=multinomial logistic
regression), we could use the derivative of the softmax loss shown for a particular weight wk
and input xi .

But these derivatives only give correct updates for one weight layer: the last one. For deep
networks, computing the gradients for each weight is much more complex, since we are
computing the derivative with respect to weight parameters that appear all the way back in
the very early layers of the network, even though the loss is computed only at the very end of
the network. The solution to computing this gradient is an algorithm called error
backpropagation or backprop (Rumelhart et al., 1986). While backprop was invented
specially for neural networks, it turns out to be the same as a more general procedure called
backward differentiation, which depends on the notion of computation graphs. Let’s see how
that works in the next subsection.
6. Backpropagation

In order to train a neural network to perform some task, we must adjust the weights of
each unit in such a way that the error between the desired output and the actual output is
reduced. This process requires that the neural network compute the error derivative of the
weights (EW). In other words, it must calculate how the error changes as each weight is
increased or decreased slightly. The back propagation algorithm is the most widely used
method for determining the EW. The backpropagation algorithm is easiest to understand
if all the units in the network are linear. The algorithm computes each EW by first
computing the EA, the rate at which the error changes as the activity level of a unit is
changed. For output units, the EA is simply the difference between the actual and the
desired output. To compute the EA for a hidden unit in the layer just before the output
layer, we first identify all the weights between that hidden unit and the output units to
which it is connected. We then multiply those weights by the EAs of those output units
and add the products. This sum equals the EA for the chosen hidden unit. After
calculating all the EAs in the hidden layer just before the output layer, we can compute in
like fashion the EAs for other layers, moving from layer to layer in a direction opposite
to the way activities propagate through the network. This is what gives back propagation
its name. Once the EA has been computed for a unit, it is straight forward to compute the
EW for each incoming connection of the unit. The EW is the product of the EA and the
activity through the incoming connection.

Rojas [2005] claimed that BP algorithm could be broken down to four main steps. After
choosing the weights of the network randomly, the back propagation algorithm is used to
compute the necessary corrections. The algorithm can be decomposed in the following
four steps:

i. Feed-forward computation
Back propagation to the output layer
ii. Back propagation to the hidden layer
iii. Weight updates
The algorithm is stopped when the value of the error function has become
sufficiently small. This is very rough and basic formula for BP algorithm. There
are some variation proposed by other scientist but Rojas definition seem to be
quite accurate and easy to follow. The last step, weight updates is happening
through out the algorithm.
Units are connected to one another. Connections correspond to the edges of the
underlying directed graph. There is a real number associated with each
connection, which is called the weight of the connection. We denote by w ij the
weight of the connection from unit u i to unit uj . It is then convenient to represent
the pattern of connectivity in the network by a weight matrix W whose elements
are the weights Wij. Two types of connection are usually distinguished: excitatory
and inhibitory. A positive weight represents an excitatory connection whereas a
negative weight represents an inhibitory connection. The pattern of connectivity
characterises the architecture of the network.

A unit in the output layer determines its activity by following a two step procedure.

First, it computes the total weighted input xj, using the formula:

Xj = ∑i yi Wij

Where yi is the activity level of the j th unit in the previous layer and W ij is the weight of the
connection between the ith and the jth unit.

Once the activities of all output units have been determined, the network computes the error
E, which is defined by the expression:
where yj is the activity level of the j th unit in the top layer and d j is the desired output of the j th
unit.

The back-propagation algorithm consists of four steps:

1. Compute how fast the error changes as the activity of an output unit is changed. This error
derivative (EA) is the difference between the actual and the desired activity.

2. Compute how fast the error changes as the total input received by an output unit is
changed. This quantity (EI) is the answer from step 1 multiplied by the rate at which the
output of a unit changes as its total input is changed.

3. Compute how fast the error changes as a weight on the connection into an output unit is
changed. This quantity (EW) is the answer from step 2 multiplied by the activity level of the
unit from which the connection emanates.

5. Compute how fast the error changes as the activity of a unit in the previous layer
is changed. This crucial step allows backpropagation to be applied to multilayer
networks. When the activity of a unit in the previous layer changes, it affects the
activities of all the output units to which it is connected. So to compute the
overall effect on the error, we add together all these separate effects on output
units. But each effect is simple to calculate. It is the answer in step 2multiplied by
the weight on the connection to that output unit.

By using steps 2 and 4, we can convert the EAs of one layer of units into EAs for the
previous layer. This procedure can be repeated to get the EAs for as many previous layers as
desired. Once we know the EA of a unit, we can use steps 2 and 3 to compute the EWs on its
incoming connections.

7. Perceptron
The most influential work on neural nets in the 60's went under the heading of
'perceptron' a term coined by Frank Rosenblatt. The perceptron turns out to be an
MCP model ( neuron with weighted inputs ) with some additional, fixed, pre-
processing.
Units labelled A1, A2, Aj , Ap are called association units and their task is to extract
specific, localised featured from the input images. Perceptron mimic the basic idea
behind the mammalian visual system. They were mainly used in pattern recognition
even though their capabilities extended a lot more.

In 1969 Minsky and Papert wrote a book in which they described the limitations of
single layer Perceptron. The impact that the book had was tremendous and caused a
lot of neural network researchers to loose their interest. The book was very well
written and showed mathematically that single layer perceptron could not do some
basic pattern recognition operations like determining the parity of a shape or
determining whether a shape is connected or not. What they did not realised, until the
80's, is that given the appropriate training, multilevel perceptron can do these
operations.
a. Perceptron Model
In the 1960, perceptron created a great deal of interest and optimism. Rosenblatt
(1962) proved a remarkable theorem about perceptron learning. Widrow (Widrow
1961, 1963, Widrow and Angell 1962, Widrow and Hoff 1960) made a number of
convincing demonstrations of perceptron like systems. Perceptron learning is of the
supervised type. A perceptron is trained by presenting a set of patterns to its input,
one at a time, and adjusting the weights until the desired output occurs for each of
them. Its synaptic weights are denoted by w1, w2, . . . wn. The inputs applied to the
perceptron are denoted by x1, x2, . . . . xn. The externally applied bias is denoted by
b.

The net input to the activation of the neuron is written as

The output of perceptron is written as


o = f(net)
where f(.) is the activation function of perceptron. Depending upon the type of
activation function, the perceptron may be classified into two types
(i) Discrete perceptron in which the activation function is hard limiter or
sgn(.) function
(ii) Continuous perceptron in which the activation function is sigmoid
function, which is differentiable. The input-output relation may be
rearranged by considering w0=b and fixed bias x0 = 1.0. Then

where W = [w0, w1, w2, . . . . wn ] and X = [x0, x1, x2, . . . xn ]T

Single Layer Discrete Perceptron

For discrete perceptron the activation function should be hard limiter or sgn()function. The
popular application of discrete perceptron is a pattern classification. To develop insight into
the behaviour of a pattern classifier, it is necessary to plot a map of the decision regions in n-
dimensional space, spanned by the n input variables. The two decision regions separated by a
hyper plane defined by

This is illustrated in Figure for two input variables x1 and x2, for which the decision
boundary takes the form of a straight line.

Fig: Illustration of the hyper plane (in this example, a straight lines)as decision boundary for
a two dimensional, two-class patron classification problem.

For the perceptron to function properly, the two classed C 1 and C2 must be linearly separable.
This in turn, means that the patterns to be classified must be sufficiently separated from each
other to ensure that the decision surface consists of a hyper plane. This is illustrated in Figure.
(a)A pair of linearly separable patterns

(b)A pair of nonlinearly separable

the two classes C1 and C 2are sufficiently separated from each other to draw a hyper plane (in
this it is a straight line) as the decision boundary. If however, the two classes C 1 and C2 are
allowed to move too close to each other, as in Figure (b), they become nonlinearly separable,
a situation that is beyond the computing capability of the perceptron.

Suppose then that the input variables of the perceptron originate from two linearly separable
classes. Let æ1 be the subset of training vectors X 1(1), X1(2), . . . . , that belongs to class
C1and æ2 be the subset of train vectors X 2(1), X2(2),. . . . . , that belong to class C2. The
union of æ1 and æ2 is the complete training set æ.

Given the sets of vectors æ1 and æ2 to train the classifier, the training process involves the
adjustment of the W in such a way that the two classes C 1 and C2 are linearly separable. That
is, there exists a weight vector W such that we may write,

WX > 0 for every input vector or X belongs to class C1


WX <= 0 for every input vector or X belongs to class C2

In the second condition, it is arbitrarily chosen to say that the input vector X belongs to class
C2 if WX = 0.

The algorithm for updating the weights may be formulated as follows:


1. If the kth member of the training set, X k is correctly classified by the weight vector W(k)
computed at the kth iteration of the algorithm, no correction is made to the weight vector
of perceptron in accordance with the rule.
Wk+1 =Wk , if Wk Xk > 0 and Xk belongs to class C1

Wk+1 =Wk , if Wk Xk <= 0 and Xk belongs to class C2

2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule.
W(k+1)T = WkT - ⴄXk if Wk Xk > 0 and XK , belongs to class C2
W(k+1)T = WkT + ⴄXk if Wk Xk <= 0 and XK , belongs to class C1
where the learning rule parameter, ⴄ controls the adjustment applied to the weight vector.
Equations may be written generally as :
W(k+1) = Wkt + ⴄ/2(dk -ok )Xk

Limitations of perceptron

There are limitations to the capabilities of perceptron however. They will learn the solution, if
there is a solution to be found.

First, the output values of a perceptron can take on only one of two values (True or False).

Second, perceptron can only classify linearly separable sets of vectors. If a straight line or
plane can be drawn to separate the input vectors into their correct categories, the input vectors
are linearly separable and the perceptron will find the solution. If the vectors are not linearly
separable learning will never reach a point where all vectors are classified properly.

The most famous example of the perceptron's inability to solve problems with linearly non-
separable vectors is the bool exclusive-OR problem.

Consider the case of the exclusive-or (XOR) problem. The XOR logic function has two
inputs and one output, how below
It produces an output only if either one or the other of the inputs is on, but not if both are off
or both are on. It is shown in above table.

We can consider this has a problem that we want the perceptron to learn to solve; output a 1
of the x is on and y is off or y is on and x is off, otherwise output a ‘0’. It appears to be a
simple enough problem. We can draw it in pattern space as

The x-axis represents the value of x, the y-axis represents the value of y. The inside the
circles represent the inputs that produce an output of 1, whilst the outside the circles show the
inputs that produce an output of 0. Considering the inside the circles and outside circles as
separate classes, we find that, we cannot draw a straight line to separate the two classes. Such
patterns are known as linearly inseparable since no straight line can divide them up
successfully. Since we cannot divide them with a single straight line, the perceptron will not
be able to find any such line either, and so cannot solve such a problem. In fact, a single-layer
perceptron cannot solve any problem that is linearly inseparable.
8.Application of Perceptron

Logic Gate Implementation - (AND, OR, NAND, NOR)

import numpy as np

import matplotlib.pyplot as plt

# Step Function (Activation Function)

def step_function(x):

return np.where(x >= 0, 1, 0)

# Perceptron Training Function

def perceptron_train(X, y, epochs=10, lr=0.1):

weights = np.random.rand(3) # Including bias

for _ in range(epochs):

for i in range(len(X)):

x_i = np.insert(X[i], 0, 1) # Insert bias term

y_pred = step_function(np.dot(weights, x_i))

error = y[i] - y_pred

weights += lr * error * x_i

return weights

# Function to plot decision boundary

def plot_decision_boundary(X, weights, gate_name):

x1 = np.linspace(-0.1, 1.1, 100)

x2 = -(weights[1] * x1 + weights[0]) / weights[2]

plt.figure(figsize=(5, 5))

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k')

plt.plot(x1, x2, 'g-', label='Decision Boundary')


plt.xlabel("Input 1")

plt.ylabel("Input 2")

plt.title(f"Perceptron Decision Boundary for {gate_name} Gate")

plt.legend()

plt.show()

# Logic Gate Inputs and Outputs

logic_gates = {

"AND": (np.array([[0,0], [0,1], [1,0], [1,1]]), np.array([0, 0, 0, 1])),

"OR": (np.array([[0,0], [0,1], [1,0], [1,1]]), np.array([0, 1, 1, 1])),

"NAND":(np.array([[0,0], [0,1], [1,0], [1,1]]), np.array([1, 1, 1, 0])),

"NOR": (np.array([[0,0], [0,1], [1,0], [1,1]]), np.array([1, 0, 0, 0]))

# Training Perceptron and Plotting for Each Gate

for gate, (X, y) in logic_gates.items():

weights = perceptron_train(X, y)

print(f"{gate} Gate Weights: {weights}")

equation = f"Decision Boundary Equation: {weights[1]}*x1 + {weights[2]}*x2 +


{weights[0]} = 0"

print(equation)

plot_decision_boundary(X, weights, gate)


import numpy as np
import matplotlib.pyplot as plt

# XOR dataset

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y = np.array([0, 1, 1, 0]) # XOR output

# Initialize perceptron weights and bias

weights = np.random.randn(2)

bias = np.random.randn()

learning_rate = 0.1

epochs = 10

# Activation function (Step function)

def step_function(z):

return 1 if z >= 0 else 0

# Perceptron training

errors = []

for epoch in range(epochs):

total_error = 0

for i in range(len(X)):

z = np.dot(X[i], weights) + bias

y_pred = step_function(z)

error = y[i] - y_pred


total_error += abs(error)

# Update rule

weights += learning_rate * error * X[i]

bias += learning_rate * error

errors.append(total_error)

# Plot decision boundary

plt.figure()

x_vals = np.linspace(-0.5, 1.5, 100)

if weights[1] != 0:

y_vals = -(weights[0] * x_vals + bias) / weights[1]

else:

y_vals = np.full_like(x_vals, -bias / weights[0])

plt.plot(x_vals, y_vals, '--', label=f'Epoch {epoch + 1}')

# Plot data points

for j in range(len(X)):

marker = '$0$' if y[j] == 0 else '$1$'

plt.scatter(X[j][0], X[j][1], marker=marker, s=200)

plt.xlim(-0.5, 1.5)

plt.ylim(-0.5, 1.5)
plt.xlabel('X1')

plt.ylabel('X2')

plt.title(f'Perceptron Decision Boundary at Epoch {epoch + 1}')

plt.legend()

plt.grid()

plt.show()

# Error plot

plt.figure()

plt.plot(range(1, epochs + 1), errors, marker='o')

plt.xlabel('Epochs')

plt.ylabel('Total Errors')

plt.title('Perceptron Training Error on XOR')

plt.grid()

plt.show()

print("Final Weights:", weights)

print("Final Bias:", bias)

print("Perceptron fails to separate XOR correctly!")


Classifying Hand-written digits using a simple perceptron model

import numpy as np

import matplotlib.pyplot as plt

from tensorflow.keras.datasets import mnist

# Load MNIST dataset

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize pixel values (0-255) to (0-1)

x_train = x_train.reshape(x_train.shape[0], -1) / 255.0

x_test = x_test.reshape(x_test.shape[0], -1) / 255.0

# One-vs-all Perceptron class

class Perceptron:

def __init__(self, input_size, num_classes, lr=0.01, epochs=10):

self.lr = lr

self.epochs = epochs

self.weights = np.zeros((num_classes, input_size))

self.bias = np.zeros(num_classes)

def train(self, x_train, y_train):

for epoch in range(self.epochs):

for i in range(len(x_train)):

xi = x_train[i]

yi = y_train[i]

scores = np.dot(self.weights, xi) + self.bias

y_pred = np.argmax(scores)

if y_pred != yi:
self.weights[yi] += self.lr * xi

self.bias[yi] += self.lr

self.weights[y_pred] -= self.lr * xi

self.bias[y_pred] -= self.lr

def predict(self, x):

scores = np.dot(self.weights, x.T) + self.bias[:, np.newaxis]

return np.argmax(scores, axis=0)

# Initialize and train perceptron

num_classes = 10

input_size = x_train.shape[1]

perceptron = Perceptron(input_size, num_classes, lr=0.01, epochs=10)

perceptron.train(x_train, y_train)

# Make predictions

y_pred = perceptron.predict(x_test)

# Calculate accuracy

accuracy = np.mean(y_pred == y_test) * 100

print(f"Model Accuracy: {accuracy:.2f}%")

# Display some test images with true and predicted labels

fig, axes = plt.subplots(3, 3, figsize=(8, 8))

for i, ax in enumerate(axes.flat):

ax.imshow(x_test[i].reshape(28, 28), cmap='gray')

ax.set_title(f"True: {y_test[i]} | Pred: {y_pred[i]}")

ax.axis('off')

plt.show()
Binary classification of linearly separable data using Perceptron

import numpy as np

import matplotlib.pyplot as plt

# Generate synthetic linearly separable data

def generate_data(n):

np.random.seed(0)

X = np.random.randn(n, 2)

y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1) # Linear boundary x + y = 0


return X, y

# Perceptron Algorithm

def perceptron(X, y, epochs=10, lr=0.1):

w = np.zeros(X.shape[1]) # Initialize weights

b = 0 # Initialize bias

losses = []

for epoch in range(epochs):

loss = 0

for i in range(len(y)):

if y[i] * (np.dot(w, X[i]) + b) <= 0:

w += lr * y[i] * X[i]

b += lr * y[i]

loss += 1

losses.append(loss)

# Plot decision boundary for each epoch

plot_decision_boundary(X, y, w, b, epoch)

print(f"Epoch {epoch+1}: Decision boundary equation: {w[0]:.2f}x + {w[1]:.2f}y +


{b:.2f} = 0")

return w, b, losses

# Plot decision boundary

def plot_decision_boundary(X, y, w, b, epoch):

plt.figure()

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k')

x_vals = np.linspace(min(X[:, 0]), max(X[:, 0]), 100)

y_vals = -(w[0] * x_vals + b) / w[1] if w[1] != 0 else np.zeros_like(x_vals)


plt.plot(x_vals, y_vals, 'g-', label=f'Epoch {epoch+1}')

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.legend()

plt.title(f'Decision Boundary at Epoch {epoch+1}')

plt.show()

# Main function

def main():

X, y = generate_data(100)

w, b, losses = perceptron(X, y, epochs=10, lr=0.1)

# Plot loss over epochs

plt.figure()

plt.plot(range(1, len(losses) + 1), losses, marker='o', linestyle='-')

plt.xlabel('Epoch')

plt.ylabel('Misclassification Count')

plt.title('Loss Over Epochs')

plt.show()

You might also like