0% found this document useful (0 votes)
7 views30 pages

Unit 1

The document provides an overview of deep learning, its relationship with machine learning and artificial intelligence, and the historical development of neural networks. It discusses key concepts such as the McCulloch-Pitts neuron, artificial neural networks, and the differences between machine learning and deep learning. Additionally, it highlights the evolution of deep learning techniques and their applications in various fields, along with current trends and future directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views30 pages

Unit 1

The document provides an overview of deep learning, its relationship with machine learning and artificial intelligence, and the historical development of neural networks. It discusses key concepts such as the McCulloch-Pitts neuron, artificial neural networks, and the differences between machine learning and deep learning. Additionally, it highlights the evolution of deep learning techniques and their applications in various fields, along with current trends and future directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT-I

Topics:
Introduction: History of Deep Learning, McCulloch Pitts Neuron, Multilayer
Perceptrons (MLPs), Sigmoid Neurons, Feed Forward Neural Networks, Back
Propagation.

Introduction:
Relation between Deep Learning and Machine Learning:

From the figure above, we can say that Deep Learning is a subset of
Machine Learning and in turn Machine Learning is a subset of Artificial
Intelligence.
You can think of them as a series of overlapping concentric circles, with
AI occupying the largest, followed by machine learning, then deep learning. In
other words, deep learning is AI, but AI is not deep learning.
What is Artificial Intelligence?
At its most basic level, the field of artificial intelligence uses computer
science and data to enable problem solving in machines.
While we don’t yet have human-like robots trying to take over the world,
we do have examples of AI all around us. These could be as simple as a
computer program that can play chess, or as complex as an algorithm that can
predict the RNA structure of a virus to help develop vaccines.
For a machine or program to improve on its own without further input
from human programmers, we need machine learning.
What is Machine Learning?
Machine learning refers to the study of computer systems that learn and
adapt automatically from experience without being explicitly programmed.
With simple AI, a programmer can tell a machine how to respond to
various sets of instructions by hand-coding each “decision.” With machine
learning models, computer scientists can “train” a machine by feeding it large
amounts of data. The machine follows a set of rules—called an algorithm—to
analyze and draw inferences from the data. The more data the machine parses,
the better it can become at performing a task or making a decision.
Here’s one example you may be familiar with: Music streaming service
Spotify learns your music preferences to offer you new suggestions. Each time
you indicate that you like a song by listening through to the end or adding it to
your library, the service updates its algorithms to feed you more accurate
recommendations. Netflix and Amazon use similar machine learning algorithms
to offer personalized recommendations.
What is Deep Learning?
Where machine learning algorithms generally need human correction
when they get something wrong, deep learning algorithms can improve their
outcomes through repetition, without human intervention.
A machine learning algorithm can learn from relatively small sets of data,
but a deep learning algorithm requires big data sets that might include diverse
and unstructured data.
Think of deep learning as an evolution of machine learning. Deep
learning is a machine learning technique that layers algorithms and
computing units—or neurons—into what is called an artificial neural
network.
These deep neural networks take inspiration from the structure of the
human brain. Data passes through this web of interconnected algorithms in a
non-linear fashion, much like how our brains process information.

Machine Learning vs Deep Learning:


Machine Learning Deep Learning

Uses artificial neural network


Apply statistical algorithms to learn
architecture to learn the hidden
the hidden patterns and relationships
patterns and relationships in the
in the dataset.
dataset.

Requires the larger volume of


Can work on the smaller amount of
dataset compared to machine
dataset
learning

Better for complex task like image


Better for the low-label task. processing, natural language
processing, etc.

Takes less time to train the model. Takes more time to train the model.

A model is created by relevant


Relevant features are automatically
features which are manually extracted
extracted from images. It is an end-
from images to detect an object in the
to-end learning process.
image.

More complex, it works like the


Less complex and easy to interpret the
black box interpretations of the
result.
result are not easy.

It can work on the CPU or requires


It requires a high-performance
less computing power as compared to
computer with GPU.
deep learning.

History of Deep Learning:


The history of deep learning is a fascinating journey that spans several decades,
marked by significant milestones, technological advances, and shifts in research
paradigms. Here's a detailed overview:
Early Foundations (1940s-1980s)
1. 1940s-1950s: Theoretical Beginnings
o McCulloch and Pitts (1943): Proposed the first mathematical
model of a neuron, laying the groundwork for neural networks.
o Hebbian Learning (1949): Donald Hebb introduced the idea that
neural pathways strengthen with repeated activation, summarized
as "cells that fire together, wire together."
2. 1950s-1960s: Perceptrons and Initial Models
o Perceptron (1958): Frank Rosenblatt developed the perceptron, an
early neural network capable of binary classification. It could learn
through supervised learning but was limited to linearly separable
data.
o Marvin Minsky and Seymour Papert (1969): Their book
"Perceptrons" highlighted the limitations of single-layer networks,
particularly their inability to solve problems like the XOR function,
leading to a decline in interest.
The AI Winter (1970s-1980s)
3. 1970s-1980s: Reduced Interest
o Funding and research in neural networks dwindled during this
period, known as the "AI winter." Interest shifted towards symbolic
AI and expert systems, which seemed more promising.
Resurgence and Theoretical Advances (1980s-2000s)
4. 1980s: Backpropagation
o Backpropagation Algorithm (1986): David Rumelhart, Geoffrey
Hinton, and Ronald Williams revived interest in neural networks
by introducing backpropagation, a method for training multi-layer
networks.
o This allowed for the training of deeper networks, making it feasible
to tackle more complex problems.
5. Late 1980s-1990s: Theoretical Developments
o Researchers explored various architectures, including recurrent
neural networks (RNNs) and convolutional neural networks
(CNNs), laying the groundwork for later advancements.
The Deep Learning Revolution (2000s-2010s)
6. 2006: The Term "Deep Learning"
o Geoffrey Hinton and his team published a paper introducing the
concept of "deep belief networks," marking the formal adoption of
the term "deep learning."
o This period saw the resurgence of interest in neural networks due
to increased computational power and the availability of large
datasets.
7. 2012: Breakthroughs in Computer Vision
o AlexNet: Hinton's team won the ImageNet competition with a deep
convolutional neural network, significantly outperforming
traditional methods and highlighting the effectiveness of deep
learning for image classification.
o This victory ignited widespread interest in deep learning across
various fields.
Expanding Applications (2010s-Present)
8. 2014: Advancements in Generative Models
o Generative Adversarial Networks (GANs): Ian Goodfellow
introduced GANs, allowing for the generation of realistic images
and other data types, further broadening the scope of deep learning
applications.
9. 2015-2018: Transformers and Natural Language Processing
o The introduction of the Transformer architecture by Vaswani et al.
(2017) revolutionized natural language processing (NLP), leading
to models like BERT and GPT, which could understand and
generate human-like text.
o These models leveraged attention mechanisms, enabling them to
handle long-range dependencies in sequences effectively.
10.2020s: Continued Growth and Integration
o Deep learning continued to permeate various domains, including
healthcare, finance, and autonomous systems.
o Advances in hardware (like GPUs and TPUs) and frameworks (like
TensorFlow and PyTorch) have made deep learning more
accessible to researchers and practitioners.
Current Trends and Future Directions
 Ethics and Fairness: As deep learning becomes more integrated into
society, issues of bias, fairness, and ethical implications are at the
forefront of discussions.
 Self-supervised Learning: Techniques that leverage unlabelled data are
gaining traction, reducing reliance on labelled datasets.
 Multimodal Learning: Integrating multiple types of data (text, images,
audio) is becoming increasingly important for building more robust AI
systems.

Mc-Culloch Pitts Neuron:


The McCulloch-Pitts neuron, introduced by Warren McCulloch and Walter Pitts
in 1943, is a foundational concept in neural networks and artificial intelligence.
It was based on the functionality of a biological neuron.
Biological Neuron:

Dendrite: Receives signals from other neurons.


Soma: Processes the information.
Axon: Transmits the output of this neuron.
Synapse: Point of connection to other neurons.
Basically, a neuron takes an input signal (dendrite), processes it like the
CPU (soma), passes the output through a cablelike structure to other connected
neurons (axon to synapse to another neuron’s dendrite).
Now, this might be biologically inaccurate as there is a lot more going on
out there but on a higher level, this is what is going on with a neuron in our
brain — takes an input, processes it, throws out an output.
Our sense organs interact with the outer world and send the visual and
sound information to the neurons.
Let's say you are watching Friends. Now the information your brain
receives is taken in by the “laugh or not” set of neurons that will help you make
a decision on whether to laugh or not. Each neuron gets fired/activated only
when its respective criteria is met like shown below.
Of course, this is not entirely true. In reality, it is not just a couple of
neurons which would do the decision making. There is a massively parallel
interconnected network of 10¹¹ neurons (100 billion) in our brain and their
connections are not as simple as shown in the above figure. It might look
something like this:

Now the sense organs pass the information to the first/lowest layer of
neurons to process it. And the output of the processes is passed on to the next
layers in a hierarchical manner, some of the neurons will fire and some won’t
and this process goes on until it results in a final response — in this case,
laughter.
This massively parallel network also ensures that there is a division of
work. Each neuron only fires when its intended criteria is met i.e., a neuron may
perform a certain role to a certain stimulus, as shown below.

It is believed that neurons are arranged in a hierarchical fashion and each


layer has its own role and responsibility. To detect a face, the brain could be
relying on the entire network and not on a single layer.
Now that we have established how a biological neuron works, let’s look at what
McCulloch and Pitts had to offer.
McCulloch-Pitts Neuron:

It may be divided into 2 parts.


1) The first part, g takes an input (like dendrite), performs an
aggregation.
2) Based on the aggregated value, the second part f makes a
decision.
Let’s suppose that I want to predict my own decision, whether to watch a
random football game or not on TV. The inputs are all Boolean i.e., {0,1}
and my output variable is also Boolean {1: Will watch it, 0: Won’t watch
it}.
 So, x_1 could be isPremierLeagueOn (I like Premier League
more)
 x_2 could be isItAFriendlyGame (I tend to care less about the
friendlies)
 x_3 could be isNotHome (Can’t watch it when I’m running
errands. Can I?)
 x_4 could be isManUnitedPlaying (I am a big Man United fan.
GGMU!) and so on.
These inputs can either be Excitatory or Inhibitory.
Inhibitory inputs are those that have maximum effect on the
decision making irrespective of other inputs i.e., if x_3 is 1 (not home)
then my output will always be 0 i.e., the neuron will never fire, so x_3 is
an inhibitory input.
Excitatory inputs are NOT the ones that will make the neuron fire
on their own but they might fire it when combined together.
Formally, this is what is going on:

We can see that g(x) is just doing a sum of the inputs — a simple
aggregation. And ‘theta’ here is called thresholding parameter. For
example, if I always watch the game when the sum turns out to be 2 or
more, the ‘theta’ is 2 here. This is called the Thresholding Logic.
Properties of MP Neuron:
1. Binary Nature: Both inputs and outputs are binary, simplifying
the processing model.
2. Logical Operations: The McCulloch-Pitts neuron can
implement basic logical operations (AND, OR, NOT)
depending on the arrangement of inputs and weights.
3. Computational Power: Despite its simplicity, a network of
McCulloch-Pitts neurons can simulate any computable function,
making it a universal model of computation.
Applications of MP Neuron:
 Neural Networks: Serves as a basic building block for more
complex artificial neural networks.
 Theoretical Neuroscience: Provides insights into the
functioning of biological neural networks.
Limitations of MP Neuron:
 Non-Continuity: The binary output limits its ability to model
real-valued inputs and outputs, which are common in biological
systems.
 Static Weights: In the original model, weights do not change,
limiting learning capabilities.

Artificial Neural Network:


An Artificial Neural Network (ANN) is a machine learning model
inspired by the structure and function of the human brain's interconnected
network of neurons.
It consists of interconnected nodes called artificial neurons, organized
into layers. Information flows through the network, with each neuron processing
input signals and producing an output signal that influences other neurons in the
network.
A Multi-Layer Perceptron (MLP) is a type of artificial neural network
consisting of multiple layers of neurons. The neurons in the MLP typically use
nonlinear activation functions, allowing the network to learn complex patterns
in data.
MLPs are significant in because they can learn nonlinear relationships in
data, making them powerful models for tasks3 such as classification, regression,
and pattern recognition.
Basics of Neural Networks:
Neural networks or artificial neural networks are fundamental tools in
machine learning, powering many state-of-the-art algorithms and applications
across various domains, including computer vision, natural language
processing, robotics, and more.
A neural network consists of interconnected nodes, called neurons,
organized into layers. Each neuron receives input signals, performs a
computation on them using an activation function, and produces an output
signal that may be passed to other neurons in the network.
An activation function determines the output of a neuron given its input.
These functions introduce nonlinearity into the network, enabling it to learn
complex patterns in data.
The network is typically organized into layers, starting with the input
layer, where data is introduced. Followed by hidden layers where computations
are performed and finally, the output layer where predictions or decisions are
made.
Neurons in adjacent layers are connected by weighted connections, which
transmit signals from one layer to the next. The strength of these connections,
represented by weights, determines how much influence one neuron's output has
on another neuron's input.
During the training process, the network learns to adjust its weights based
on examples provided in a training dataset. Additionally, each neuron typically
has an associated bias, which allows the neuron to adjust its output threshold.
Neural networks are trained using techniques called feedforward
propagation and backpropagation. During feedforward propagation, input
data is passed through the network layer by layer, with each layer performing a
computation based on the inputs it receives and passing the result to the next
layer.
Backpropagation is an algorithm used to train neural networks by
iteratively adjusting the network's weights and biases in order to minimize the
loss function.
A loss function (also known as a cost function or objective function) is a
measure of how well the model's predictions match the true target values in the
training data. The loss function quantifies the difference between the predicted
output of the model and the actual output, providing a signal that guides the
optimization process during training.
The goal of training a neural network is to minimize this loss function by
adjusting the weights and biases. The adjustments are guided by an optimization
algorithm, such as gradient descent.
Types of Neural Networks:

The ANN depicted on the right of the image is a simple neural network
called ‘perceptron’. It consists of a single layer, which is the input layer, with
multiple neurons with their own weights; there are no hidden layers. The
perceptron algorithm learns the weights for the input signals in order to draw a
linear decision boundary.
However, to solve more complicated, non-linear problems related to
image processing, computer vision, and natural language processing tasks, we
work with deep neural networks.
There are several types of ANN, each designed for specific tasks and
architectural requirements.
Feedforward Neural Networks (FNN)
These are the simplest form of ANNs, where information flows in one
direction, from input to output. There are no cycles or loops in the network
architecture. Multilayer perceptrons (MLP) are a type of feedforward neural
network.
Recurrent Neural Networks (RNN)
In RNNs, connections between nodes form directed cycles, allowing
information to persist over time. This makes them suitable for tasks involving
sequential data, such as time series prediction, natural language processing, and
speech recognition.
Convolutional Neural Networks (CNN)
CNNs are designed to effectively process grid-like data, such as images.
They consist of layers of convolutional filters that learn hierarchical
representations of features within the input data. CNNs are widely used in tasks
like image classification, object detection, and image segmentation.
Long Short-Term Memory Networks (LSTM) and Gated Recurrent
Units (GRU)
These are specialized types of recurrent neural networks designed to
address the vanishing gradient problem in traditional RNN.
LSTMs and GRUs incorporate gated mechanisms to better capture long-
range dependencies in sequential data, making them particularly effective for
tasks like speech recognition, machine translation, and sentiment analysis.
Autoencoder
It is designed for unsupervised learning and consists of an encoder
network that compresses the input data into a lower-dimensional latent space,
and a decoder network that reconstructs the original input from the latent
representation.
Autoencoders are often used for dimensionality reduction, data
denoising, and generative modelling.
Generative Adversarial Networks (GAN)
GANs consist of two neural networks, a generator and a discriminator,
trained simultaneously in a competitive setting.
The generator learns to generate synthetic data samples that are
indistinguishable from real data, while the discriminator learns to distinguish
between real and fake samples.
GANs have been widely used for generating realistic images, videos, and
other types of data.

Multilayer Perceptron:
A multilayer perceptron is a type of feedforward neural network
consisting of fully connected neurons with a nonlinear kind of activation
function. It is widely used to distinguish data that is not linearly separable.
MLPs have been widely used in various fields, including image
recognition, natural language processing, and speech recognition, among others.
Their flexibility in architecture and ability to approximate any function
under certain conditions make them a fundamental building block in deep
learning and neural network research.

Key Concepts:
Input layer:
The input layer consists of nodes or neurons that receive the initial
input data. Each neuron represents a feature or dimension of the input
data. The number of neurons in the input layer is determined by the
dimensionality of the input data.
Hidden layer:
Between the input and output layers, there can be one or more
layers of neurons. Each neuron in a hidden layer receives inputs from all
neurons in the previous layer (either the input layer or another hidden
layer) and produces an output that is passed to the next layer.
The number of hidden layers and the number of neurons in each
hidden layer are hyperparameters that need to be determined during the
model design phase.
Output layer:
This layer consists of neurons that produce the final output of the
network. The number of neurons in the output layer depends on the
nature of the task.
In binary classification, there may be either one or two neurons
depending on the activation function and representing the probability of
belonging to one class; while in multi-class classification tasks, there can
be multiple neurons in the output layer.
Weights:
Neurons in adjacent layers are fully connected to each other. Each
connection has an associated weight, which determines the strength of the
connection. These weights are learned during the training process.
Bias neurons:
In addition to the input and hidden neurons, each layer (except the
input layer) usually includes a bias neuron that provides a constant input
to the neurons in the next layer. Bias neurons have their own weight
associated with each connection, which is also learned during training.
The bias neuron effectively shifts the activation function of the
neurons in the subsequent layer, allowing the network to learn an offset or
bias in the decision boundary.
By adjusting the weights connected to the bias neuron, the MLP
can learn to control the threshold for activation and better fit the training
data.
Note: It is important to note that in the context of MLPs, bias can refer to
two related but distinct concepts: bias as a general term in machine learning and
the bias neuron (defined above).
In general machine learning, bias refers to the error introduced by
approximating a real-world problem with a simplified model. Bias measures
how well the model can capture the underlying patterns in the data.
A high bias indicates that the model is too simplistic and may underfit
the data, while a low bias suggests that the model is capturing the underlying
patterns well.
Activation function:
Typically, each neuron in the hidden layers and the output layer
applies an activation function to its weighted sum of inputs.
Common activation functions include sigmoid, tanh, ReLU
(Rectified Linear Unit), and Softmax. These functions introduce
nonlinearity into the network, allowing it to learn complex patterns in the
data.
Training with backpropagation:
MLPs are trained using the backpropagation algorithm, which
computes gradients of a loss function with respect to the model's
parameters and updates the parameters iteratively to minimize the loss.
Working of MultiLayer Perceptron: Layer by Layer
In a multilayer perceptron, neurons process information in a step-by-step
manner, performing computations that involve weighted sums and nonlinear
transformations. Let's walk layer by layer to see the magic that goes within.
In a multilayer perceptron, neurons process information in a step-by-step
manner, performing computations that involve weighted sums and nonlinear
transformations. Let's walk layer by layer to see the magic that goes within.
Input layer
 The input layer of an MLP receives input data, which could be
features extracted from the input samples in a dataset. Each neuron in
the input layer represents one feature.
 Neurons in the input layer do not perform any computations; they
simply pass the input values to the neurons in the first hidden layer.
Hidden layers
 The hidden layers of an MLP consist of interconnected neurons that
perform computations on the input data.
 Each neuron in a hidden layer receives input from all neurons in the
previous layer. The inputs are multiplied by corresponding weights,
denoted as w. The weights determine how much influence the input
from one neuron has on the output of another.
 In addition to weights, each neuron in the hidden layer has an
associated bias, denoted as b. The bias provides an additional input to
the neuron, allowing it to adjust its output threshold. Like weights,
biases are learned during training.
 For each neuron in a hidden layer or the output layer, the weighted
sum of its inputs is computed. This involves multiplying each input by
its corresponding weight, summing up these products, and adding the
bias:

Where n is the total number of input connections, wi is the weight


for the i-th input, and xi is the i-th input value.
 The weighted sum is then passed through an activation function,
denoted as f. The activation function determines the output range of
the neuron and its behavior in response to different input values. The
choice of activation function depends on the nature of the task and the
desired properties of the network.
Output Layer:
 The output layer of an MLP produces the final predictions or outputs
of the network.
 The number of neurons in the output layer depends on the task being
performed (e.g., binary classification, multi-class classification,
regression).
 Each neuron in the output layer receives input from the neurons in the
last hidden layer and applies an activation function.
 This activation function is usually different from those used in the
hidden layers and produces the final output value or prediction.
During the training process, the network learns to adjust the weights
associated with each neuron's inputs to minimize the discrepancy between the
predicted outputs and the true target values in the training data.
By adjusting the weights and learning the appropriate activation functions,
the network learns to approximate complex patterns and relationships in the
data, enabling it to make accurate predictions on new, unseen samples.
This adjustment is guided by an optimization algorithm, such as stochastic
gradient descent (SGD), which computes the gradients of a loss function with
respect to the weights and updates the weights iteratively.

Sigmoid Neurons:
Sigmoid neuron is the building block of deep neural networks. It is
similar to perceptron, but they are slightly modified such that the output is much
smoother that the step functional output from the perceptron.
Why Sigmoid Neuron?
Perceptron model takes several real-valued inputs and gives a single
binary output. In the perceptron model, every input xi has weight wi associated
with it. The weights indicate the importance of the input in the decision-making
process.
The model output is decided by a threshold Wₒ if the weighted sum of the
inputs is greater than threshold Wₒ output will be 1 else output will be 0. In
other words, the model will fire if the weighted sum is greater than the
threshold.

The image in the left is the perceptron and the image in the right indicates
the mathematical representation of the perceptron.
From the mathematical representation, we can say that the thresholding
logic used by the perceptron is very harsh. Let’s see the harsh thresholding logic
of the perceptron with an example.
Consider the decision-making process of a person, whether he/she would
like to purchase a car or not based on only one input X1 — Salary and by setting
the threshold b(Wₒ) = -10 and the weight W₁ = 0.2. The output from the
perceptron model will look like in the figure shown below.

Fig: Data (left), Graphical Representation of Output (Right)


Red points indicates that a person would not buy a car and green points
indicate that person would like to buy a car. Isn’t it a bit odd that a person with
50.1K will buy a car but someone with a 49.9K will not buy a car?
The small change in the input to a perceptron can sometimes cause the
output to completely flip, say from 0 to 1. This behavior is not a characteristic
of the specific problem we choose or the specific weight and the threshold we
choose. It is a characteristic of the perceptron neuron itself which behaves like a
step function.
We can overcome this problem by introducing a new type of artificial
neuron called a sigmoid neuron.
Sigmoid Neuron:
The output function of the sigmoid neuron is much smoother than the
step function. In the sigmoid neuron, a small change in the input only causes a
small change in the output as opposed to the stepped output.
There are many functions with the characteristic of an “S” shaped curve
known as sigmoid functions. The most commonly used function is the logistic
function.

Fig: Sigmoid Neuron Representation (logistic function)


We no longer see a sharp transition at the threshold b. The output from
the sigmoid neuron is not 0 or 1. Instead, it is a real value between 0–1 which
can be interpreted as a probability.
The inputs to the sigmoid neuron can be real numbers unlike the Boolean
inputs in MP Neuron and the output will also be a real number between 0–1.
In the sigmoid neuron, we are trying to regress the relationship
between X and Y in terms of probability. Even though the output is between 0–
1, we can still use the sigmoid function for binary classification tasks by
choosing some threshold.
Learning Algorithm:
We will discuss an algorithm for learning the parameters w and b of
the sigmoid neuron model by using the gradient descent algorithm.
The objective of the learning algorithm is to determine the best
possible values for the parameters, such that the overall loss (squared error loss)
of the model is minimized as much as possible.

The Learning algorithm is as shown in the figure below.

 We initialise w and b randomly.


 We then iterate over all the observations, for each observation find the
corresponding predicted outcome using the sigmoid function and
compute the squared error loss.
 Based on the loss value, we will update the weights such that overall loss
of the model at the new parameters will be less than the current loss of
the model.
 We will keep doing the update operation till we are satisfied. Till satisfied
could mean any of the following:
o The overall loss of the model becomes zero.
o The overall loss of the model becomes very small value closer to
zero.
o Iterating for fixed number of passes based on the computational
capacity.
Applications
 Binary Classification: Sigmoid neurons are commonly used in the
output layer of binary classification neural networks.
 Logistic Regression: The sigmoid function is central to logistic
regression, linking input features to the probability of a class.
Limitations
 Vanishing Gradient: The sigmoid function can cause gradients to
vanish for very high or low input values, making training deep
networks challenging.
 Not Zero-Centered: The output of the sigmoid function is always
positive, which can lead to inefficiencies in learning.

Back Propagation:
Backpropagation is a key algorithm used in training artificial neural
networks. It efficiently computes the gradient of the loss function with respect
to the weights of the network, enabling optimization methods like gradient
descent. Here’s a detailed breakdown of the backpropagation process:
Overview
1. Feedforward Phase: Input data is passed through the network to produce
an output.
2. Loss Calculation: The output is compared to the target (true) values
using a loss function to compute the error.
3. Backpropagation Phase: The error is propagated backward through the
network to update the weights.
Let us consider a sample neural network as shown below:

Fig: Simple Neural Network.


In the above network, we have 3 input features, 2 hidden layers with 4
neurons each and 3 neurons in the output layer.
The bias units are represented as b1 and b2. The input neurons are not
connected to the bias units.
Generally, if we consider a node in one layer, let us call this ai (‘a’ stands
for activation, the activation comes after the two operations i.e., summation
and non-linearity)

Fig: Layer to Layer connection example.


If we look at this ai, which is sitting at level ‘l’ and aj which is sitting at
level ‘l+1’, the two are connected by a single line. The line is denoted by
‘wij’. ‘wij’ itself is a scalar. It is a single value. The line tells that a j(l+1) gets
some contribution from ai(l) and the portion of that contribution is multiplied
by wij(l).
So, wij(l) is the weight at the lth layer connecting the aith neuron of level ‘l’
connecting to the ajth neuron of level ‘l+1’.
The input vector in the figure ‘simple neural network’ is nothing but ‘a’
vector at level 1. It can be represented as:
(1 )
⃗x =⃗a

The output vector is nothing but ‘a’ vector at level 4. It can be represented
as:
^y =⃗a( 4)

In the example, we have two hidden layers, and they can be represented
in terms of activation(a) as a⃗ (2) and a⃗ (3) .
In the feed forward phase, the input data is passed through the network to
produce an output. This is shown in the figure below.

In all the procedures, the weights are guessed. At the end of this phase,
we get the cost function ‘J’. Ideally, the cost function should be 0. But that is
not going to be the case because the guessed weights are typically not going
to be so good.
So, ‘J’ is a function of y, the ground truth and the output ^y . It can be
represented as:
J (y, ^y ¿
The next step is to figure out which of these weights was responsible for
this higher J.
So, the task is to essentially redistribute this ‘J’ to all these weights. This
is shown in the figure below.

∂J
So, this procedure of redistributing weights(w) using ∂ ω is called
∂J
Gradient Descent, but just calculating ∂ ω is called Back Propagation.
Back Propagation can be done in 2 methods:
1) Finite Difference method.
2) Chain rule.
Finite Difference method:
The steps followed in this method are as follows:
o Guess ⃗ω and do a forward pass. Forward pass is nothing but for
given input x, calculate ^y using w. Calculate J (⃗
ω)
o Guess ⃗ ω , do a forward pass, we get the output as ^y + Δ ^y .
ω+ Δ ⃗
Calculate J (⃗ ω ).
ω+ Δ ⃗
o Then,
∂J J(⃗ ω )−J ( ω
w+Δ ⃗ ⃗)

∂ω Δ⃗
ω

As we cannot divide by a vector, we can do in this way. If we want to


∂J
calculate ∂ ω , then,
1

∂ J J ( ω1 + Δ ω1 )−J ( ω1 )

∂ ω1 Δ ω1

Even though this method is simple to code, it is expensive. It is


expensive because for each forward pass, for each gradient descent pass
∂J
that you have to do, you have to calculate multiple of these ∂ ω s.

∂J
∂ ω is calculated for every parameter and there could be millions of

calculation, there are 2 calculations to be done i.e., J(w) and J(w+ Δw).
parameters and so, these are millions of calculations and for each

So, this turns out to be very expensive.


Chain Rule Method:
 Chain rule is very similar to Logistic regression.
 Programming chain rule is very hard but it turns out to be very
cheap.
 TensorFlow and every single package including MATLAB etc will
have back propagation using chain rule routine.
Simplified derivation for Back Propagation using Chain rule:
Let us consider the below figure for the derivation:
To make the derivations simple, all these are treated as scalars and also
ignore the bias term. Even after treating as scalars, the expressions derived
are very close to final expressions.
For ease of comprehension, the above figure is drawn differently as
shown below.

Looking at the above figure, following expressions can be deduced.


a(1) = x a(2) = g(z(2))
z(2) = w(1)a(1) a(3) = g(z(3))
z(3) = w(2)a(2) a(4) = g(z(4))
z(4) = w(3)a(3)
^y = a(4)

∂J
In the example above, back propagation is nothing but calculating ∂ ω ,
1
∂J ∂J
∂ ω2
and ∂ ω3
.
∂J
Now, it is actually easiest to find ∂ ω because this is the closest for being
3

responsible for J.
So, let’s find this term first.

The assumption here is that ‘J’ is the binary entropy cost function.
J = -{yln ^y +(1-y) ln (1- ^y ))}
g = Sigmoid function
So,
(4) (4 )
∂J ∂ J ∂a ∂z
( 3)
= ( 4 ) ⋅ ( 4) ⋅ ( )
∂ω ∂a ∂ z ∂ω 3

So,
∂J
∂ω
( 3) = -{y-a(4)} a(3)

Error in output activation


Now, using a notation, we can say that,
∂J
(4)
=δ ( 4 )
∂z

The generalised notation for above equation (at layer ‘l’) can be written as:
∂J
( l)
≡ δ ( l)
∂z
∂J
( 4 ) denotes error in activation.
∂z

Therefore,
∂J
( 3)
=δ (4 ) a(3 )
∂ω

∂J
Suppose, we have to calculate
∂ω
( 2) , then it can be represented using
chain rule as:
(4) ( 4) ( 3) (3 )
∂J ∂ J ∂a ∂z ∂a ∂ z
( 2)
= ( 4 ) ⋅ ( 4 ) ⋅ ( 3 ) ⋅ ( 3) ⋅ ( )
∂ω ∂a ∂ z ∂a ∂z ∂ω2

∂J
( 2)
=δ (3) a( 2)
∂ω

So, we can say,


∂J ( l+1 ) (l )
( l)
=δ a
∂ω

In the above equation, a(l) is known from the forward pass.


However, δ(l+1) is not known.
The next task is to find δ(l) i.e., δ(4), δ(3), δ(2) , where ‘l’ is no: of
levels.
δ(4) is known already as it is error in output activation. We need to find
other deltas.
(4 ) (4 ) (4) (3 )
(4) ∂J ∂a (3) ∂J ∂a ∂ z ∂a
δ = ( 4) ⋅ ( 4 ) δ = ( 4) ⋅ ( 4 ) ⋅ ( 3 ) ⋅ ( 3)
∂a ∂z ∂a ∂z ∂a ∂ z

(3 )
∂a (3) (3)
( 3 ) can be written as g (z ) i.e., g prime of z .
∂z

Therefore,
δ(3) = δ(4)w(3) g (z(3))
Similarly,
δ (l) = δ(l+1)w(l)g (z(l))

The other expression we had was:

∂J ( l+1 ) (l )
( l)
=δ a
∂ω

These two combined, will form the back propagation algorithm.


General vectorized expressions for the above equations are as follows:

∂J
( l)
=δ (jl+1 ) a (il )
∂ ω ⅈj

δ⃗ = [ w(l) δ⃗ (l )] [ g' ( ⃗z (l) )]


( l +1)

The summary is if there is some input ‘x’, somehow using some guessed
∂J
weights ‘w’, we get output ^y weight ‘w’ is improved using ∂ w .

∂J
The main computation in neural networks is calculating this ∂ w for a
given guess ‘w’. This is done using Back Propagation.
The reason it is called back propagation is obvious. First, the error at the
last layer is calculated and we propagate to the first layer.

Advantages:
 Efficiency
Enabling Neural networks to learn from vast amounts of data.
 Flexibility
Can be applied to wide range of neural networks.
 Accuracy
By iteratively adjusting weights, it can achieve high accuracy in
complex tasks.
Applications:
 Image Recognition
 Identifies objects in images
 Enables applications like self-driving cars and medical
diagnosis.
 Natural Language Processing
 Powers machine translation, text summarization, and
chatbot systems.
 Robotics
 Enable robots to learn from experience and adapt to
changing environments.
 Game AI
 Creates intelligent opponents and enhancing game
realism.

You might also like