DL notes
DL notes
(R18)
Deep Learning and Neural Networks
Lecture Notes
Prepared by
Dr.D.Shanthi
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Outcomes:
Ability to understand the concepts of Neural Networks
Ability to select the Learning Networks in modeling real world systems
Ability to use an efficient algorithm for Deep Models
Ability to apply optimization strategies for large scale applications
UNIT-I
Artificial Neural Networks Introduction, Basic models of ANN, important terminologies, Supervised
Learning Networks, Perceptron Networks, Adaptive Linear Neuron, Back-propagation Network.
Associative Memory Networks. Training Algorithms for pattern association, BAM and Hopfield
Networks.
UNIT-II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet, Hamming
Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter Propagation
Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various networks.
UNIT - III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed - forward networks,
Gradient-Based learning, Hidden Units, Architecture Design, Back-Propagation and Other
Differentiation Algorithms
UNIT - IV
Regularization for Deep Learning: Parameter norm Penalties, Norm Penalties as Constrained
Optimization, Regularization and Under-Constrained Problems, Dataset Augmentation, Noise
Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping, Parameter Typing and
Parameter Sharing, Sparse Representations, Bagging and other Ensemble Methods, Dropout,
Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent Classifier
UNIT - V
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic Algorithms,
Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates, Approximate Second-
Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language
Processing
TEXT BOOKS:
The term "Artificial Neural Network" is derived from Biological neural networks that develop the structure of a
human brain. Similar to the human brain that has neurons interconnected to one another, artificial neural
networks also have neurons that are interconnected to one another in various layers of the networks. These
neurons are known as nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
An Artificial Neural Network in the field of Artificial intelligence attempts to mimic the network of neurons
makes up a human brain so that computers will have an option to understand things and make decisions in a
human-like manner. The artificial neural network is designed by programming computers to behave simply like
interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association point somewhere in
the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to be distributed, and we
can extract more than one piece of this data when necessary from our memory parallelly. We can say that the
human brain is made up of incredibly amazing parallel processors.
A neuron, or nerve cell, is an electrically excitable cell that communicates with other cells via
specialized connections called synapses. It is the main component of nervous tissue. Neurons are
typically classified into three types based on their function. Sensory neurons respond to stimuli such as
touch, sound, or light that affect the cells of the sensory organs, and they send signals to the spinal
cord or brain. Motor neurons receive signals from the brain and spinal cord to control everything from
muscle contractions to glandular output. Interneurons connect neurons to other neurons within the
same region of the brain or spinal cord.A group of connected neurons is called a neural circuit.
A typical neuron consists of a cell body ( refered as soma), dendrites, and a single axon. The soma is
usually compact. The axon and dendrites are filaments that extrude from it. Dendrites typically branch
profusely and extend a few hundred micrometers from the soma. The axon leaves the soma at a
swelling called the axon hillock, and travels for as far as 1 meter in humans or more in other species. It
branches but usually maintains a constant diameter. At the farthest tip of the axon's branches are axon
terminals, where the neuron can transmit a signal across the synapse to another cell. Neurons may lack
dendrites or have no axon. The term neurite is used to describe either a dendrite or an axon,
particularly when the cell is undifferentiated.
The soma is the body of the neuron. As it contains the nucleus, most protein synthesis occurs here. The
The dendrites of a neuron are cellular extensions with many branches. This overall shape and structure
is referred to metaphorically as a dendritic tree. This is where most of the input to the neuron occurs
via the dendritic spine.
The axon is a finer, cable-like projection that can extend tens, hundreds, or even tens of thousands of
times the diameter of the soma in length. The axon primarily carries nerve signals away from the soma,
and carries some types of information back to it. Many neurons have only one axon, but this axon may
usually undergo extensive branching, enabling communication with many target cells. The part of the
axon where it emerges from the soma is called the axon hillock. Besides being an anatomical structure,
the axon hillock also has the greatest density of voltage-dependent sodium channels. This makes it the
most easily excited part of the neuron and the spike initiation zone for the axon. In electrophysiological
terms, it has the most negative threshold potential.
While the axon and axon hillock are generally involved in information outflow, this region can also
receive input from other neurons.
The axon terminal is found at the end of the axon farthest from the soma and contains synapses.
Synaptic are specialized structures where neurotransmitter chemicals are released to communicate
with target neurons. In addition to synaptic at the axon terminal, a neuron may have a synaptic bout,
which is located along the length of the axon.
Most neurons receive signals via the dendrites and soma and send out signals down the axon. At most
synapses, signals cross from the axon of one neuron to a dendrite of another. However, synapses can
connect an axon to another axon or a dendrite to another dendrite. The signaling process is partly
electrical and partly chemical. Neurons are electrically excitable, due to maintenance of voltage
gradients across their membranes. If the voltage changes by a large amount over a short interval, the
neuron generates an all-or-nothing electrochemical pulse called an action potential.
This potential travels rapidly along the axon and activates synaptic connections as it reaches them.
Synaptic signals may be excitatory or inhibitory, increasing or reducing the net voltage that reaches the
soma.
In most cases, neurons are generated by neural stem cells during brain development and childhood.
Neurogenesis largely ceases during adulthood in most areas of the brain.
Neurons communicate with each other by sending signals, called neurotransmitters, across a narrow space,
called a synapse, between the axons of the sender neuron and dendrites of the receiver neuron.
1. Network Topology
2. Adjustments of Weights or Learning
3. Activation Functions
1. Network Topology: A network topology is the arrangement of a network along with its nodes and
connecting lines. According to the topology, ANN can be classified as the following kinds:
A. Feed forward Network: It is a non-recurrent network having processing units/nodes in layers and
all the nodes in a layer are connected with the nodes of the previous layers. The connection has
different weights upon them. There is no feedback loop means the signal can only flow in one
direction, from input to output. It may be divided into the following two types:
Multilayer feed forward network: The concept is of feed forward ANN having more than
one weighted layer. As this network has one or more layers between the input and the output
layer, it is called hidden layers.
Recurrent networks: They are feedback networks with closed loops. Following are the two types
of recurrent networks.
Fully recurrent network: It is the simplest neural network architecture because all nodes are
connected to all other nodes and each node works as both input and output.
Jordan network − It is a closed loop network in which the output will go to the input again as
feedback as shown in the following diagram.
The purpose of an artificial neural network is to mimic how the human brain works with the hope
that we can build a machine that behaves like a human. An artificial neuron is the core building
block of an artificial neural network.
The structure of an artificial neuron is very similar to a biological neuron, it consists of 3 main parts,
weight and bias as a dendrite denoted by w and b respectively, output as an axon denoted by y,
and activation function as a cell body (nucleus) denoted by f(x). The x is the input signals
received by the dendrite.
In artificial neurons, input and weight are represented as a vector while bias is represented as a
scalar. Artificial neuron processes input signals by performing a dot product between the input
vector and the weight vector, add the bias, then apply an activation function, and finally
Faculty Name : Dr.D.Shanthi Subject Name :DL
propagates the result to other neurons.
Activation Function
Activation functions are an essential part we should not underestimate, It’s a function that an
artificial neuron uses to get the output of a neuron, it is also known as Transfer Function. The
result of the dot product between weight and input plus bias is in the range of -inf and +inf,
activation function aims to map the result into a certain range depending upon the function.
There are many activation functions out there, but the most important is the sigmoid activation
function. It often used as activation in the output layer for binary classification tasks. Sigmoid
bounds the result in the range of 0 until 1, it represents the probability of x whether x belongs
to class 1 or 0. Sigmoid decides by thresholding the result, if the result is ≥ 0.5 then x is
classified as 1 otherwise, x is classified as 0.
An artificial neural network is a bunch of artificial neurons interconnected to each other. Artificial
neural networks (ANNs) learn to solve problems like the human brain. They process information
by filtering it through densely connected artificial neurons. Each connection between neurons
can transmit the signal to one another.
An artificial neural network learns by nudging it’s weight and bias so that it has a better
prediction. The most popular algorithm is gradient descent, it is an iterative process that aims
1. Adaptive learning: An ability to learn how to do tasks based on the data given
for training or initialexperience.
2. Self-Organisation: An ANN can create its own organisation or representation
of the information itreceives during learning time.
3. Real Time Operation: ANN computations may be carried out in parallel, and special
hardware devices are being designed and manufactured which take advantage of this
capability.
4. Pattern recognition: is a powerful technique for harnessing the information in the
data and generalizingabout it. Neural nets learn to recognize the patterns which exist
in the data set.
5. The system is developed through learning rather than programming.. Neural nets
teach themselves the patterns in the data freeing the analyst for more interesting
work.
LIMITATIONS OF ANN
In this technological era everything has Merits and some Demerits in others words
there is a Limitation with every system which makes this ANN technology weak in
some points. The various Limitations of ANN are:-
Inputs: Source data fed into the neural network, with the goal of making a decision
or prediction about the data. Inputs to a neural network are typically a set of real
values; each value is fed into one of the neurons in the input layer.
Training Set: A set of inputs for which the correct outputs are known, used to train the
neural network.
Outputs : Neural networks generate their predictions in the form of a set of real
values or booleandecisions. Each output value is generated by one of the neurons in
the output layer.
Neuron/perceptron: The basic unit of the neural network. Accepts an input and
generates a prediction.
Faculty Name : Dr.D.Shanthi Subject Name :DL
Each neuron accepts part of the input and passes it through the activation function.
Common activation functions are sigmoid, TanH and ReLu. Activation functions help
generate output values within an acceptable range, and their non-linear form is
crucial for training the network.
Weight Space: Each neuron is given a numeric weight. The weights, together with
the activation function, define each neuron’s output. Neural networks are trained
by fine-tuning weights, to discover the optimal set of weights that generates the
most accurate prediction.
Forward Pass: The forward pass takes the inputs, passes them through the network
and allows each neuron to react to a fraction of the input. Neurons generate their
outputs and pass them on to the nextlayer, until eventually the network generates
an output.
Error Function: Defines how far the actual output of the current model is from the
correct output. When training the model, the objective is to minimize the error
function and bring output as close as possible to the correct value.
If the shape of the object is rounded and has a depression at the top, is red in color, then it
will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be
labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in the Banana category. Thus the machine learns the things from training
data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Types:-
Regression
Faculty Name : Dr.D.Shanthi Subject Name :DL
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees
Support Vector Machine
Advantages:-
Supervised learning allows collecting data and produces data output from previous
experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world computation
problems.
It performs classification and regression tasks.
It allows estimating or mapping the result to a new sample.
We have complete control over choosing the number of classes we want in the training data.
Disadvantages:-
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
Supervised learning cannot handle all complex tasks in Machine Learning.
Computation time is vast for supervised learning.
It requires a labelled data set.
It requires a training process.
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Computational
Simpler method Computationally complex
Complexity
Training data Use training data to infer model. No training data is used.
Model We can test our model. We can not test our model.
PERCEPTRON
Figure: A perceptron
Perceptrons can represent all of the primitive Boolean functions AND, OR, NAND (~ AND),
and NOR (~OR)
Example: Representation of AND functions
The learning problem is to determine a weight vector that causes the perceptron to
produce the correct + 1 or - 1 output for each of the given training examples.
Weights w1 = 0.6, w2 = 0.6, Threshold = 1 and Learning Rate n = 0.5 are given
This is not greater than the threshold of 1, so the output = 0. Here the target is same as
calculated output.
This is not greater than the threshold of 1, so the output = 0. Here the target does not match
with calculated output.
Weights w1 = 0.6, w2 = 1.1, Threshold = 1 and Learning Rate n = 0.5 are given
This is not greater than the threshold of 1, so the output = 0. Here the target is same as
calculated output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated
output.
This is not greater than the threshold of 1, so the output = 0. Here the target does not match
with calculated output.
Now,
Weights w1 = 1.1, w2 = 1.1, Threshold = 1 and Learning Rate n = 0.5 are given
This is greater than the threshold of 1, so the output = 0. Here the target is same as calculated
output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated
output.
This is greater than the threshold of 1, so the output = 1. Here the target is same as calculated
output.
ADALINE (Adaptive Linear Neuron or later Adaptive Linear Element) is an early single-
layer artificial neural network and the name of the physical device that implemented
this network. The network uses memistors. It was developed by Professor Bernard
Widrow and his graduate student Ted Hoff at Stanford University in 1960. It is based
on the McCulloch–Pitts neuron. It consists of a weight, a bias and a summation
function. The difference between Adaline and the standard (McCulloch–Pitts)
perceptron is that in the learning phase, the weights are adjusted according to the
weighted sum of the inputs (the net). In the standard perceptron, the net is passed to
the activation (transfer) function and the function's output is used for adjusting the
weights. Some important points about Adaline are as follows:
It uses bipolar activation function.
It uses delta rule for training to minimize the Mean-Squared Error (MSE)
between the actual outputand the desired/target output.
The weights and the bias are adjustable.
with the desired/target output. After comparison based on training algorithm, the
weights and bias will be updated.
For easy calculation and simplicity, weights and bias must be set equal to 0 and the
learning rate must be set equal to 1.
xi = si(i=1 to n)
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output:
1, iƒ 𝑦i𝑛 ≥ 0
( ) −1, iƒ 𝑦i𝑛 < 0
b(new) = b(old)
wi(old)
Faculty Name : Dr.D.Shanthi Subject Name :DL
b(new) = b(old)
Here ‘y’ is the actual output and ‘t’ is the desired/target output. (t−yin) is the computed
error.
Step 8 − Test for the stopping condition, which will happen when there is no change
in weight or thehighest weight change occurred during training is smaller than the
specified tolerance.
Madaline which stands for Multiple Adaptive Linear Neuron, is a network which
consists of many Adalines in parallel. It will have a single output unit.. training
algorithm is based on a principle called "minimal disturbance". It proceeds by looping
over training examples, then for each example, it:
finds the hidden layer unit (ADALINE classifier) with the lowest
confidence in its prediction,tentatively flips the sign of the unit,
accepts or rejects the change based on whether the network's error is reduced,
stops when the error is zero.
By now we know that only the weights and bias between the input and the Adaline
layer are to beadjusted, and the weights and bias between the Adaline and the
Madaline layer are fixed.
Step 5 − Obtain the net input at each hidden layer, i.e. the Adaline layer with the
𝑛
following relation:
Step 6 − Apply the following activation function to obtain the final output at the
Adaline and the Madaline Layer:
(𝑦i ) =
1, i𝑓 𝗑 ≥ 0
{
−1, i𝑓 𝗑 < 0}
and t = 1 then,
wij(new) = wij(old)+α(1−Qinj)xi
bj(new) = bj(old)+α(1−Qinj)
In this case, the weights would be updated on Qj where the net input is close to 0
because t = 1.
In this case, the weights would be updated on Qk where the net input is
positive because t = -1.Here ‘y’ is the actual output and ‘t’ is the
desired/target output.
Step 8 − Test for the stopping condition, which will happen when there is no change
in weight or thehighest weight change occurred during training is smaller
than the specified tolerance.
Backpropagation:
Backpropagation Algorithm:
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the
error.
Step 6: Repeat the process until the desired output is achieved.
Backpropagation is “backpropagation of errors” and is very useful for training neural networks.
It’s fast, easy to implement, and simple. Backpropagation does not require any parameters to
be set, except the number of inputs. Backpropagation is a flexible method because no prior
knowledge of the network is required.
Advantages:
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The figure given below illustrates a memory containing the names of various
people. If the given memory is content addressable, the incorrect string
"Albert Einstein" as a key is sufficient to recover the correct name "Albert
Einstein."
In this condition, this type of memory is robust and fault-tolerant because of this
type of memory model, and some form of error-correction capability.
Hetero-associative memory:
If the memory is produced with an input pattern, may say α, the associated
pattern ω is recovered automatically.
1. Hebbian Learning Rule: The Hebbian rule was the first learning rule. In 1949 Donald
Hebb developed it aslearning algorithm of the unsupervised neural network. We can use it
to identify how to improve the weights of nodes of a network. The Hebb learning rule
assumes that – If two neighbor neurons activated and deactivated at the same time, then
the weight connecting these neurons should increase. At the start, values of all weights are
set to zero. This learning rule can be used for both soft- and hard-activation functions.
Since desired responses of neurons are not used in the learning procedure, this is the
unsupervised learning rule. The absolute values of the weights are usually proportional
to the learning time, which is undesired.
Hebb Rule for Pattern Association:-The Hebb rule is the simplest and most common method of
determining the weights for an associative memory neural net.-
we denote our training vector pairs (input training-target output vectors) as
s:t.
We then denote our testing input vector as x,which may or may not be the same as one of the
training input vectors.
-In the training algorithm of hebb rule the weights initially adjusted to 0, then updated using
the following formula:
In the training algorithm of hebb rule the weights initially adjusted to 0, then updated using the
following
2. Delta Learning Rule: Developed by Widrow and Hoff, the delta rule, is one of the most
common learning rules. It depends on supervised learning. This rule states that the modification
in sympatric weight of a node is equal to the multiplication of error and the input. In
Mathematical form the delta rule is as follows:
For a given input vector, compare the output vector is the correct answer. If the difference
is zero, no learning takes place; otherwise, adjusts its weights to reduce this difference.
Bidirectional Associative Memory (BAM) is a recurrent neural network (RNN) of a special type, initially
proposed by Bart Kosko, in the early 1980s,
Bidirectional Associative Memory (BAM) is a supervised learning model in Artificial Neural Network.
This is hetero-associative memory, for an input pattern, it returns another pattern which is potentially
of a different size. This phenomenon is very similar to the human brain. Human memory is necessarily
associative. It uses a chain of mental associations to recover a lost memory like associations of faces
with names, in exam questions with answers, etc.
In such memory associations for one type of object with another, a Recurrent Neural Network (RNN) is
needed to receive a pattern of one set of neurons as an input and generate a related, but different,
output pattern of another set of neurons.
BAM Architecture:
The BAM models are rather efficient when deployed as a part of an AI-based decision-making process,
inferring the
solution to a specific data analysis problem, based on various associations of many interrelated data.
Generally, the BAM model can be used for a vast of applications, such as:
The BAM models are very useful whenever the varieties of knowledge, acquired by the ANNs, are not
enough for processing the data, introduced to an AI for analysis.
For example, the prediction of words missed out from incomplete texts with ANN, basically requires that
the associations of words-to-sentences are stored in the ANN’s memory. However, this would incur an
incorrect prediction, because the same missing words might occur in more than one incomplete
Hopfield Network:
Hopfield network is a special kind of neural network whose response is different from other neural
networks. It is calculated by converging iterative process. It has just one layer of neurons relating to the
size of the input and output, which must be the same. When such a network recognizes, for example,
digits, we present a list of correctly rendered digits to the network. Subsequently, the network can
transform a noise input to the relating perfect output.
In 1982, John Hopfield introduced an artificial neural network to collect and retrieve memory like the
human brain. Here, a neuron is either on or off the situation. The state of a neuron(on +1 or off 0) will
be restored, relying on the input it receives from the other neuron. A Hopfield network is at first
prepared to store various patterns or memories. Afterward, it is ready to recognize any of the learned
patterns by uncovering partial or even some corrupted data about that pattern, i.e., it eventually settles
down and restores the closest pattern. Thus, similar to the human brain, the Hopfield model has stability
in pattern recognition.
A Hopfield network is a particular type of single-layered neuron network. Dr. John J. Hopfield invented it
in 1982. These networks were introduced to collect and retrieve memory and store various patterns.
Also, auto-association and optimization of the task can be done using these networks. In this network,
each node is fully connected(recurrent) to other nodes. These nodes exist only in two states: ON
(1) or OFF (0). These states can be restored based on the input received from other nodes. Unlike other
neural networks, the output of the Hopfield network is finite. Also, the input and output sizes must be
the same in these networks. .
The Hopfield network consists of associative memory. This memory allows the system to retrieve the
memory using an incomplete portion. The network can restore the closest pattern using the data
captured in associative memory. This feature of Hopfield networks makes it a good candidate for
pattern recognition.
Associative memory is a content addressable memory that establishes a relation between the input
vector and the output target vector. It enables the reallocation of data stored in the memory based on
its similarity with the input vector.
Discrete Networks
These networks give any of the two discrete outputs. Based on the output received, further two types:
1. Binary: In this type, the output is either 0 or 1.
2. Bipolar: In bipolar networks, the output is either -1 (When output < 0) or 1 (When output > 0)
Continuous Networks
Instead of receiving binary or bipolar output, the output value lies between 0 and 1.
The Architecture of Hopfield Network
The architecture of the Hopfield network consists of the following elements:
Individual nodes preserve their states until required an update.
The node to be updated is selected randomly.
Each node is connected to all other nodes except itself.
The state of each node is either 0/1 or 1/-1.
The Hopfield network structure is symmetric, i.e., Wij = Wji for all i's and j's.
The representation of the sample architecture of the Hopfield network having three nodes is as follows:
UNIT-II Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet,
Hamming Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization, Counter
Propagation Networks, Adaptive Resonance Theory Networks. Special Networks-Introduction to various
networks.
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may contain
all pics having dogs in them and the second part may contain all pics having cats in them. Here
you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabeled data.
Unsupervised learning is classified into two categories of algorithms:
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types: -
1. Hierarchical clustering
2. K-means clustering
This learning process is independent. During the training of ANN under unsupervised
learning, the input vectors of similar type are combined to form clusters. When a new
input pattern is applied, then the neural network gives an output response indicating the
class to which input pattern belongs. In this, there would be no feedback from the
environment as to what should be the desired output and whether it is correct or
incorrect. Hence, in this type of learning the network itself must discover the patterns,
features from the input data and the relation for the input data over the output.
Max Net
This is also a fixed weight network, which serves as a subnet for selecting the node having the
highest input. All the nodes are fully interconnected and there exists symmetrical weights in all
these weighted interconnections.
When a net is trained to classify the input signal into one of the output
categories, A, B, C, D, E, J, or K, the net sometimes responded that the signal was both
a C and a K, or both an E and a K, or both a J and a K, due to similarities in these
character pairs. In this caseit will be better to include additional structure in the net to force
it to make a definitive decision. The mechanism by which this can be accomplished is
called competition.
The most extreme form of competition among a group of neurons is called Winner-
Take- All, where only one neuron (the winner) in the group will have a nonzero output
signal when the competition is completed. An example of that is the MAXNET.
Architecture
It uses the mechanism which is an iterative process and each node receives inhibitory inputs
from all other nodes through connections. The single node whose value is maximum would be
active or winner and the activations of all other nodes would be inactive.
This kind of network is Hamming network, where for every given input vectors, it would be
clustered into different groups. Following are some important features of Hamming Networks −
Lippmann started working on Hamming networks in 1987.
It is a single layer network.
The inputs can be either binary {0, 1} of bipolar {-1, 1}.
The weights of the net are calculated by the exemplar vectors.
It is a fixed weight network which means the weights would remain the same even during
training.
Hamming Distance
Kohonen Self-Organizing feature map (SOM) refers to a neural network, which is trained using competitive learning. Basic
competitive learning implies that the competition process takes place before the cycle of learning. The competition process
suggests that some criteria select a winning processing element.
The self-organizing map makes topologically ordered mappings between input data and processing elements of the map.
Topological ordered implies that if two inputs are of similar characteristics, the most active processing elements answering
to inputs that are located closed to each other on the map. The weight vectors of the processing elements are
organized in ascending to descending order. Wi < Wi+1 for all values of i or Wi+1 for all values of i (this definition is valid for
one-dimensional self-organizing map only).
The self-organizing map is typically represented as a two-dimensional sheet of processing elements described in the figure
given below. Each processing element has its own weight vector, and learning of SOM (self-organizing map) depends on the
adaptation of these vectors. The processing elements of the network are made competitive in a self-organizing process,
and specific criteria pick the winning processing element whose weights are updated. Generally, these criteria are used to
limit the Euclidean distance between the input vector and the weight vector. SOM (self-organizing map) varies from basic
competitive learning so that instead of adjusting only the weight vector of the winning processing element also weight
vectors of neighboring processing elements are adjusted
It is discovered by Finnish professor and researcher Dr. Teuvo Kohonen in 1982. The self-organizing map refers to an
unsupervised learning model proposed for applications in which maintaining a topology between input and output spaces
All the entire learning process occurs without supervision because the nodes are self-organizing. They are also known as
feature maps, as they are basically retraining the features of the input data, and simply grouping themselves as indicated
by the similarity between each other. It has practical value for visualizing complex or huge quantities of high dimensional
data and showing the relationship between them into a low, usually two-dimensional field to check whether the given
unlabeled data have any structure to it.
A Self-Organizing Map utilizes competitive learning instead of error-correction learning, to modify its weights. It implies that
only an individual node is activated at each cycle in which the features of an occurrence of the input vector are introduced
to the neural network, as all nodes compete for the privilege to respond to the input.
The selected node- the Best Matching Unit (BMU) is selected according to the similarity between the current input values
and all the other nodes in the network. The node with the fractional Euclidean difference between the input vector, all
nodes, and its neighboring nodes is selected and within a specific radius, to have their position slightly adjusted to
coordinate the input vector. By experiencing all the nodes present on the grid, the whole grid eventually matches the entire
input dataset with connected nodes gathered towards one area, and dissimilar ones are isolated.
Algorithm:
Step:1
Step:2
Step:3
Step:4
Step:5
Step:6
Calculate the overall Best Matching Unit (BMU). It means the node with the smallest distance from all calculated
ones.
Step:7
Discover topological neighborhood βij(t) its radius σ(t) of BMU in Kohonen Map.
Step:8
Repeat for all nodes in the BMU neighborhood: Update the weight vector w_ij of the first node in the neighborhood
of the BMU by including a fraction of the difference between the input vector x(t) and the weight w(t) of the neuron.
Wij(new)=wij(old)+alpha[xi-wij(old)]
Step:9
Repeat the complete iteration until reaching the selected iteration limit t=n.
Here, step 1 represents initialization phase, while step 2 to 9 represents the training phase.
Where;
t = current iteration.
W= weight vector
X = input vector
β_ij = the neighborhood function, decreasing and representing node i,j distance from the BMU.
σ(t) = The radius of the neighborhood function, which calculates how far neighbor nodes are examined in the 2D
grid when updating vectors.
Let’s say that an input data of size ( m, n ) where m is the number of training examples and n is the number of
features in each example and a label vector of size ( m, 1 ). First, it initializes the weights of size ( n, c ) from the
first c number of training samples with different labels and should be discarded from all training samples. Here, c is
the number of classes. Then iterate over the remaining input data, for each training example, it updates the
winning vector ( weight vector with the shortest distance ( e.g Euclidean distance ) from the training example ).
The weight updation rule is given by:
if correctly_classified:
wij (new) = wij (old) + alpha(t) * i(x k - w (old))
else: ij
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the i th feature of training example
and k denotes the kth training example from the input data. After training the LVQ network, trained weights are
used for classifying new examples. A new example is labelled with the class of the winning vector.
Algorithm LVQ :
Counter propagation network (CPN) were proposed by Hecht Nielsen in 1987.They are
multilayer network based on the combinations of the input, output, and clustering layers. The
application of counter propagation net are data compression, function approximation and pattern
association. The counter propagation network is basically constructed from an instar-outstar
model. This model is three layer neural network that performs input-output data mapping,
producing an output vector y in response to input vector x, on the basis of competitive learning.
The three layer in an instar-outstar model are the input layer, the hidden(competitive) layer and
the output layer.
There are two stages involved in the training process of a counter propagation net. The input
vector are clustered in the first stage. In the second stage of training, the weights from the cluster
layer units to the output units are tuned to obtain the desired response.
There are two types of counter propagation network:
Full CPN
• The Full CPN allows to produce a correct output even when it is given an input vector that is
partially incomplete or incorrect.
• In first phase, the training vector pairs are used to form clusters using either dot product or
Euclidean distance.
• If dot product is used, normalization is a must.
• During second phase, the weights are adjusted between the cluster units and output units.
• The architecture of CPN resembles an instar and outstar model.
• The model which connects the input layers to the hidden layer is called Instar model and the
model which connects the hidden layer to the output layer is called Outstar model.
• The weights are updated in both the Instar (in first phase) and Outstar model (second phase).
• The network is fully interconnected network.
X1 w Hidden layer Y1
Xi u
n Yk
Z
Zp t
1 1
* *
Y
k* Cluster *
* n
First phase of Full CPN
• This phase of training is called as In star modeled training.
• The active units here are the units in the x-input, z-cluster and y-input layers.
• The winning unit uses standard Kohonen learning rule for its weigh updation.
• The rule is: • v ij(new)= vij(old) + α(xi– vij (old)
= (1- α)vij(old) + α.xi ;where i=1 to n
• w kj(new)= wkj(old) + β(yk– wi k (old)
= (1- β)wkj(old) + β.yk;where k=1 to n
It consists of three layers: input layer, cluster layer and output layer. Its architecture resembles the back-propagation network, but
in CPN there exists interconnections between the units in the cluster layer.
X
X
X Y Y
X Z Y
Xn Zp Y
Output
Adaptive Resonance Theory (ART) Adaptive resonance theory is a type of neural network
technique developed by Stephen Grossberg and Gail Carpenter in 1987. The basic ART uses
unsupervised learning technique. The term “adaptive” and “resonance” used in this suggests
that they are open to new learning(i.e. adaptive) without discarding the previous or the old
information(i.e. resonance). The ART networks are known to solve the stability-plasticity
dilemma i.e., stability refers to their nature of memorizing the learning and plasticity refers to
the fact that they are flexible to gain new information. Due to this the nature of ART they are
always able to learn new input patterns without forgetting the past. ART networks implement a
clustering algorithm. Input is presented to the network and the algorithm checks whether it fits
into one of the already stored clusters. If it fits then the input is added to the cluster that matches
the most else a new cluster is formed.
Types of Adaptive Resonance Theory(ART) Carpenter and Grossberg developed different
ART architectures as a result of 20 years of research. The ARTs can be classified as follows:
ART1 – It is the simplest and the basic ART architecture. It is capable of clustering binary
input values.
ART2 – It is extension of ART1 that is capable of clustering continuous-valued input data.
Fuzzy ART – It is the augmentation of fuzzy logic and ART.
ARTMAP – It is a supervised form of ART learning where one ART learns based on the
previous ART module. It is also known as predictive ART.
FARTMAP – This is a supervised ART architecture with Fuzzy logic included.
Basic of Adaptive Resonance Theory (ART) Architecture The adaptive resonant theory is a
type of neural network that is self-organizing and competitive. It can be of both types, the
unsupervised ones(ART1, ART2, ART3, etc) or the supervised ones(ARTMAP). Generally, the
supervised algorithms are named with the suffix “MAP”. But the basic ART model is
unsupervised in nature and consists of :
F1 layer or the comparison field(where the inputs are processed)
F2 layer or the recognition field (which consists of the clustering units)
The Reset Module (that acts as a control mechanism)
The F1 layer accepts the inputs and performs some processing and transfers it to the F2 layer
that best matches with the classification factor. There exist two sets of weighted
interconnection for controlling the degree of similarity between the units in the F1 and the F2
layer. The F2 layer is a competitive layer. The cluster unit with the large net input becomes the
candidate to learn the input pattern first and the rest F2 units are ignored. The reset unit makes
the decision whether or not the cluster unit is allowed to learn the input pattern depending on
Generally two types of learning exists,slow learning and fast learning. In fast learning, weight
update during resonance occurs rapidly. It is used in ART1.In slow learning, the weight change
occurs slowly relative to the duration of the learning trial. It is used in ART2.
Application of ART:
ART stands for Adaptive Resonance Theory. ART neural networks used for fast, stable learning
and prediction have been applied in different areas. The application incorporates target
recognition, face recognition, medical diagnosis, signature verification, mobile control robot.
Target recognition:
Medical diagnosis:
Signature verification:
Automatic signature verification is a well known and active area of research with various
applications such as bank check confirmation, ATM access, etc. the training of the network is
finished using ART1 that uses global features as input vector and the verification and recognition
phase uses a two-step process. In the initial step, the input vector is coordinated with the stored
reference vector, which was used as a training set, and in the second step, cluster formation takes
place.
Nowadays, we perceive a wide range of robotic devices. It is still a field of research in their
program part, called artificial intelligence. The human brain is an interesting subject as a model
for such an intelligent system. Inspired by the structure of the human brain, an artificial neural
emerges. Like the brain, the artificial neural network contains numerous simple computational
units, neurons that are interconnected mutually to allow the transfer of the signal from the
neurons to neurons. Artificial neural networks are used to solve different issues with good
outcomes compared to other decision algorithms.
Limitations of Adaptive Resonance Theory Some ART networks are inconsistent (like the
Fuzzy ART and ART1) as they depend upon the order of training data, or upon the learning
rate.
Special Networks
Interconnections
Activation functions
Learning rules
Interconnections:
Interconnection can be defined as the way processing elements (Neuron) in ANN are connected
to each other. Hence, the arrangements of these processing elements and geometry of
interconnections are very essential in ANN.
These arrangements always have two layers that are common to all network architectures, the
Input layer and output layer where the input layer buffers the input signal, and the output layer
generates the output of the network. The third layer is the Hidden layer, in which neurons are
neither kept in the input layer nor in the output layer. These neurons are hidden from the people
In this type of network, we have only two layers input layer and the output layer but the input
layer does not count because no computation is performed in this layer. The output layer is
formed when different weights are applied to input nodes and the cumulative effect per node is
taken. After this, the neurons collectively give the output layer to compute the output signals.
2. Multilayer feed-forward network
When outputs can be directed back as inputs to the same layer or preceding layer nodes, then it
results in feedback networks. Recurrent networks are feedback networks with closed loops. The
above figure shows a single recurrent network having a single neuron with feedback to itself.
4. Single-layer recurrent network
In this type of network, processing element output can be directed to the processing element in
the same layer and in the preceding layer forming a multilayer recurrent network. They perform
the same task for every element of a sequence, with the output being dependent on the previous
computations. Inputs are not needed at each time step. The main feature of a Recurrent Neural
Network is its hidden state, which captures some information about a sequence.
Deep learning is a subfield of machine learning that deals with algorithms inspired by the structure
and function of the brain. Deep learning is a subset of machine learning, which is a part of artificial
intelligence (AI).
Artificial intelligence is the ability of a machine to imitate intelligent human behavior. Machine
learning allows a system to learn and improve from experience automatically. Deep learning is an
application of machine learning that uses complex algorithms and deep neural nets to train a model
Machine learning works only with sets of structured and semi-structured data, while deep learning
works with both structured and unstructured data
Deep learning algorithms can perform complex operations efficiently, while machine learning
algorithms cannot
Machine learning algorithms use labeled sample data to extract patterns, while deep learning
accepts large volumes of data as input and analyzes the input data to extract features out of
an object
The performance of machine learning algorithms decreases as the number of data increases; so to
maintain the performance of the model, we need a deep learning
Virtual Assistants are cloud-based applications that understand natural language voice commands and
complete tasks for the user. Amazon Alexa, Cortana, Siri, and Google Assistant are typical examples
of virtual assistants. They need internet-connected devices to work with their full capabilities. Each
time a command is fed to the assistant, they tend to provide a better user experience based on past
experiences using Deep Learning algorithms.
2. Chatbots
Chatbots can solve customer problems in seconds. A chatbot is an AI application to chat online via text
or text-to-speech. It is capable of communicating and performing actions similar to a human.
Chatbots are used a lot in customer interaction, marketing on social network sites, and instant
messaging the client. It delivers automated responses to user inputs. It uses machine learning and
deep learning algorithms to generate different types of reactions.
3. Healthcare
Deep Learning has found its application in the Healthcare sector. Computer-aided disease detection and
computer-aided diagnosis have been possible using Deep Learning. It is widely used for medical
research, drug discovery, and diagnosis of life-threatening diseases such as cancer and diabetic
retinopathy through the process of medical imaging.
4. Entertainment
Next, we have News Aggregation as our next important deep learning application.
Deep Learning allows you to customize news depending on the readers’ persona. You can aggregate
and filter out news information as per social, geographical, and economic parameters and the
individual preferences of a reader. Neural Networks help develop classifiers that can detect fake and
biased news and remove it from your feed. They also warn you of possible privacy breaches.
6. Composing Music
A machine can learn the notes, structures, and patterns of music and start producing music
independently. Deep Learning-based generative models such as WaveNet can be used to develop raw
audio. Long Short Term Memory Network helps to generate music automatically. Music21 Python
toolkit is used for computer-aided musicology. It allows us to train a system to develop music by
teaching music theory fundamentals, generating music samples, and studying music.
7. Image Coloring
8. Robotics
Deep Learning is heavily used for building robots to perform human-like tasks. Robots powered by
Deep Learning use real-time updates to sense obstacles in their path and pre-plan their journey
instantly. It can be used to carry goods in hospitals, factories, warehouses, inventory management,
manufacturing products, etc.
9. Image Captioning
Image Captioning is the method of generating a textual description of an image. It uses computer
vision to understand the image's content and a language model to turn the understanding of the
image into words in the right order. A recurrent neural network such as an LSTM is used to turn the
labels into a coherent sentence. Microsoft has built its caption bot where you can upload an image or
the URL of any image, and it will display the textual description of the image. Another application
that suggests a perfect caption and best hashtags for a picture is Caption AI.
10. Advertising
In Advertising, Deep Learning allows optimizing a user's experience. Deep Learning helps publishers
and advertisers to increase the significance of the ads and boosts the advertising campaigns.
Deep Learning is the driving force behind the notion of self-driving automobiles that are autonomous.
Deep Learning technologies are actually "learning machines" that learn how to act and respond using
millions of data sets and training. To diversify its business infrastructure, Uber Artificial
The perplexing problem about self-driving vehicles that the bulk of its designers are addressing is
subjecting self-driving cars to a variety of scenarios to assure safe driving. They have operational
sensors for calculating adjacent objects. Furthermore, they manoeuvre through traffic using data
from its camera, sensors, geo-mapping, and sophisticated models. Tesla is one popular example.
Another important field where Deep Learning is showing promising results is NLP, or Natural
Language Processing. It is the procedure for allowing robots to study and comprehend human
language.
However, keep in mind that human language is excruciatingly difficult for robots to understand.
Machines are discouraged from correctly comprehending or creating human language not only
because of the alphabet and words, but also because of context, accents, handwriting, and other
factors.
Many of the challenges associated with comprehending human language are being addressed by
Deep Learning-based NLP by teaching computers (Autoencoders and Distributed Representation) to
provide suitable responses to linguistic inputs.
Just assume you're going through your old memories or photographs. You may choose to print some
of these. In the lack of metadata, the only method to achieve this was through physical labour. The
most you could do was order them by date, but downloaded photographs occasionally lack that
metadata. Deep Learning, on the other hand, has made the job easier. Images may be sorted using it
based on places recognised in pictures, faces, a mix of individuals, events, dates, and so on. To detect
aspects when searching for a certain photo in a library, state-of-the-art visual recognition algorithms
with various levels from basic to advanced are required.
Another attractive application for deep learning is fraud protection and detection; major companies
in the payment system sector are already experimenting with it. PayPal, for example, uses predictive
analytics technology to detect and prevent fraudulent activity. The business claimed that examining
sequences of user behavior using neural networks' long short-term memory architecture increased
Faculty Name : Dr.D.Shanthi Subject Name :DL
anomaly identification by up to 10%. Sustainable fraud detection techniques are essential for every
fintech firm, banking app, or insurance platform, as well as any organization that gathers and uses
sensitive data. Deep learning has the ability to make fraud more predictable and hence avoidable.
15. Personalisations
Every platform is now attempting to leverage chatbots to create tailored experiences with a human
touch for its users. Deep Learning is assisting e-commerce behemoths such as Amazon, E-Bay, and
Alibaba in providing smooth tailored experiences such as product suggestions, customised packaging
and discounts, and spotting huge income potential during the holiday season. Even in newer markets,
reconnaissance is accomplished by providing goods, offers, or plans that are more likely to appeal to
human psychology and contribute to growth in micro markets. Online self-service solutions are on
the increase, and dependable procedures are bringing services to the internet that were previously
only physically available.
Early diagnosis of developmental impairments in children is critical since early intervention improves
children's prognoses. Meanwhile, a growing body of research suggests a link between developmental
impairment and motor competence, therefore motor skill is taken into account in the early diagnosis
of developmental disability. However, because of the lack of professionals and time restrictions,
testing motor skills in the diagnosis of the developmental problem is typically done through informal
questionnaires or surveys to parents. This is progressively becoming achievable with deep learning
technologies. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory and the
Institute of Health Professions at Massachusetts General Hospital have created a computer system
that can detect language and speech impairments even before kindergarten.
The technique of taking grayscale photos in the form of input and creating colourized images for
output that represent the semantic colours and tones of the input is known as image colourization.
Given the intricacy of the work, this technique was traditionally done by hand using human labour.
However, using today's Deep Learning Technology, it is now applied to objects and their context
inside the shot - to colour the image, in the same way that a human operator would. In order to
reproduce the picture with the addition of color, high-quality convolutional neural networks are
utilized in supervised layers.
Deep learning has changed several disciplines in recent years. In response to these advancements,
the field of Machine Translation has switched to the use of deep-learning neural-based methods,
which have supplanted older approaches such as rule-based systems or statistical phrase-based
methods. Neural MT (NMT) models can now access all information accessible anywhere in the source
phrase and automatically learn which piece is important at which step of synthesising the output
text, thanks to massive quantities of training data and unparalleled processing power. The
elimination of previous independence assumptions is the primary cause for the remarkable
improvement in translation quality. This resulted in neural translation closing the quality gap
between human and neural translation.
This Deep Learning application includes the creation of a new set of handwriting for a given corpus of
a word or phrase. The handwriting is effectively presented as a series of coordinates utilised by a pen
to make the samples. The link between pen movement and letter formation is discovered, and
additional instances are developed.
A corpus of text is learned here, and fresh text has created word for word or character for character.
Using deep learning algorithms, it is possible to learn how to spell, punctuate, and even identify the
style of the text in corpus phrases. Large recurrent neural networks are typically employed to learn
text production from objects in sequences of input strings. However, LSTM recurrent neural networks
have lately shown remarkable success in this challenge by employing a character-based model that
creates one character at a time.
Machine translation is receiving a lot of attention from technology businesses. This investment, along
with recent advances in deep learning, has resulted in significant increases in translation quality.
According to Google, transitioning to deep learning resulted in a 60% boost in translation accuracy
over the prior phrase-based strategy employed in Google Translate. Google and Microsoft can now
Faculty Name : Dr.D.Shanthi Subject Name :DL
translate over 100 different languages with near-human accuracy in several of them.
It was impossible to zoom into movies beyond their actual resolution until Deep Learning came along.
Researchers at Google Brain created a Deep Learning network in 2017 to take very low-quality photos
of faces and guess the person's face from them. Known as Pixel Recursive Super Resolution, this
approach uses pixels to achieve super resolution. It dramatically improves photo resolution,
highlighting salient characteristics just enough for personality recognition.
Gebru et al used 50 million Google Street View pictures to see what a Deep Learning network might
accomplish with them. As usual, the outcomes were amazing. The computer learned to detect and
pinpoint automobiles and their specs. It was able to identify approximately 22 million automobiles, as
well as their make, model, body style, and year. The explorations did not end there, inspired by the
success story of these Deep Learning capabilities. The algorithm was shown to be capable of
estimating the demographics of each location based just on automobile makeup.
DeepDream is an experiment that visualises neural network taught patterns. DeepDream, like a
toddler watching clouds and attempting to decipher random forms, over-interprets and intensify the
patterns it finds in a picture.
It accomplishes this by sending an image across the network and then calculating the gradient of the
picture in relation to the activations of a certain layer. The image is then altered to amplify these
activations, improving the patterns perceived by the network and producing a dream-like visual. This
method was named "Inceptionism" (a reference to InceptionNet, and the movie Inception).
Today, artificial intelligence (AI) is a thriving field with many practical applications
and active research topics. We look to intelligent software to automate routine labor,
understand speech or images, make diagnoses in medicine and support basic scientific
research.
In the early days of artificial intelligence, the field rapidly tackled and solved
problems that are intellectually difficult for human beings but relatively straight
forward for computers—problems that can be described by a list of formal,
mathematical rules. The true challenge to artificial intelligence proved to be solving
Faculty Name : Dr.D.Shanthi Subject Name :DL
the tasks that are easy for people to perform but hard for people to describe formally
— problems that we solve intuitively, that feel automatic, like recognizingspoken
words or faces in images.
This solution is to allow computers to learn from experience and understand the
world in terms of a hierarchy of concepts, with each concept defined in terms of its
relation to simpler concepts. By gathering knowledge from experience, this approach
avoids the need for human operators to formally specify all the knowledge that the
computer needs. The hierarchy of concepts allows the computer to learn complicated
concepts by building them out of simpler ones. If we draw a graph showing how
these concepts are built on top of each other, the graph is deep, with many layers. For
this reason, we call this approach to AI deep learning.
Many of the early successes of AI took place in relatively sterile and formal
environments and did not require computers to have much knowledge about the
world. For example, IBM’s Deep Blue chess-playing system defeated world champion
Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing
only sixty-four locations and thirty-two pieces that can move in only rigidly
circumscribed ways. Devising a successful chess strategy is a tremendous
accomplishment, but the challenge is not due to the difficulty of describing the set of
chess pieces and allowable moves to the computer. Chess can be completely described
by a very brief list of completely formal rules, easily provided ahead of time by the
programmer.
Ironically, abstract and formal tasks that are among the most difficult mental
undertakings for a human being are among the easiest for a computer. Computers
have long been able to defeat even the best human chess player, but are only recently
matching some of the abilities of average human beings to recognize objectsor speech. A
person’s everyday life requires an immense amount of knowledge about the world.
Much of this knowledge is subjective and intuitive, and therefore difficult to articulate
in a formal way. Computers need to capture this same knowledge to behave in an
intelligent way. One of the key challenges in artificial intelligence is how to get this
informal knowledge into a computer.
Several artificial intelligence projects have sought hard-code knowledge about the
world in formal languages. A computer can reason for statements in these formal
languages automatically using logical inference rules. This is known as the knowledge
base approach to artificial intelligence.
The difficulties faced by systems relying on hard-coded knowledge suggest that AI
systems need the ability to acquire their own knowledge, by extracting patterns from
The performance of these simple machine learning algorithms depends heavily on the
representation of the data they are given Instead, the doctor tells the system several pieces
of relevant information, such as the presence or absence of a uterine scar. Each piece of
information included in the representation of the patient is known as a feature. Logistic
regression learns how each of these features of the patient correlates with various
outcomes. However, it cannot influence the way that the features are defined in any
way. If logistic regression was given an MRI scan of the patient, rather than the doctor’s
formalized report, it would not be able to make useful predictions. Individual pixels in an
MRI scan have negligible correlation with any complications that might occur during
delivery.
One solution to this problem is to use machine learning to discover not only the
mapping from representation to output but also the representation itself. This
approach is known as representation learning. Learned representations often result in
much better performance than can be obtained with hand-designed representations.
They also allow AI systems to rapidly adapt to new tasks, with minimal human
intervention. A representation learning algorithm can discover a good set of features
for a simple task in minutes, or a complex task in hours to months. Manually designing
features for a complex task requires a great deal of human time and effort; it can take
decades for an entire community of researchers.
Deep learning solves this central problem in representation learning by introduc- ing
representations that are expressed in terms of other, simpler representations. Deep
learning allows the computer to build complex concepts out of simpler con-cepts. Fig.
1.2 shows how a deep learning system can represent the concept of an image of a
person by combining simpler concepts, such as corners and contours, which are in turn
defined in terms of edges.
The quintessential example of a deep learning model is the feedforward deep
network or multilayer perceptron (MLP). A multilayer perceptron is just a mathe-
matical function mapping some set of input values to output values. The function is
formed by composing many simpler functions. We can think of each application of a
different mathematical function as providing a new representation of the input.
The idea of learning the right representation for the data provides one perspec-
tive on deep learning. Another perspective on deep learning is that depth allows the
computer to learn a multi-step computer program. Each layer of the representation can
be thought of as the state of the computer’s memory after executing another set of
instructions in parallel. Networks with greater depth can execute more instructions in
sequence. Sequential instructions offer great power because later instructions can
refer back to the results of earlier instructions. According to this
view of deep learning, not all of the information in a layer’s activations necessarily encodes
factors of variation that explain the input. The representation also stores state
information that helps to execute a program that can make sense of the input.This state
each concept may be much deeper than the graph of the concepts themselves.
This is because the system’s understanding of the simpler concepts can be
refinedgiven information about the more complex concepts. For example, an AI system
observing an image of a face with one eye in shadow may initially only see one eye.After
detecting that a face is present, it can then infer that a second eye is probably present
as well. In this case, the graph of concepts only includes two layers—a layer for eyes
and a layer for faces—but the graph of computations includes 2n layers if we refine our
estimate of each concept given the other n times.
Because it is not always clear which of these two views—the depth of the
computational graph, or the depth of the probabilistic modeling graph—is most
relevant, and because different people choose different sets of smallest elements
from which to construct their graphs, there is no single correct value for the
depth of an architecture, just as there is no single correct value for the length ofa
computer program. Nor is there a consensus about how much depth a model requires
to qualify as “deep.” However, deep learning can safely be regarded as the study of
models that either involve a greater amount of composition of learned functions or
learned concepts than traditional machine learning does.
• Deep learning has had a long and rich history, but has gone by many names
reflecting different philosophical viewpoints, and has waxed and waned in
popularity.
• Deep learning has become more useful as the amount of available training data
has increased.
• Deep learning models have grown in size over time as computer hardware and
software infrastructure for deep learning has improved.
Deep learning has solved increasingly complicated applications with increasing accuracy over time
The Many Names and Changing Fortunes of Neural Net-works
1 Deep learning dates back to the 1940s. Deep learning only appears to be
new, because it was relatively unpopular for several years preceding its
current popularity, and because it has gone through many different names,
and has only recently become called “deep learning.” The field has been
rebranded many times, reflecting the influence of different researchers and
different perspectives.
2 However, some basic context is useful for understanding deep learning. Broadly
speaking, there have been three waves of development of deep learning: deep
learn- ing known as cybernetics in the 1940s–1960s, deep learning known as
connectionism in the 1980s–1990s, and the current resurgence under the name
deep learning beginning in 2006. This is quantitatively illustrated in Fig. 1.7.
3 Some of the earliest learning algorithms we recognize today were intended
to be computational models of biological learning, i.e. models of how learning
happens or could happen in the brain. As a result, one of the names that
deep learning has gone by is artificial neural networks (ANNs). The
corresponding perspective on deep learning models is that they are
engineered systems inspired by the biological brain (whether the human
brain or the brain of another animal).
4 While the kinds of neural networks used for machine learning have
sometimes been used to understand brain function (Hinton and Shallice,
1991), they are generally not designed to be realistic models of biological
function. The neural perspective on deep learning is motivated by two main
ideas. One idea is that the brain provides a proof by example that intelligent
behavior is possible, and a conceptually straightforward path to building
intelligence is to reverse engineer the computational principles behind the
brain and duplicate its functionality. Anotherperspective is that it would be
deeply interesting to understand the brain and the principles that underlie
human intelligence, so machine learning models that shedlight on these basic
scientific questions are useful apart from their ability to solve engineering
applications.
5 The modern term “deep learning” goes beyond the neuroscientific
perspectiveon the current breed of machine learning models. It appeals to a
more general principle of learning multiple levels of composition, which can
be applied in machine learning frameworks that are not necessarily neurally
inspired.
Figure 1.7: The figure shows two of the three historical waves of artificial neural nets
research, as measured by the frequency of the phrases “cybernetics” and “connectionism”
or “neural networks” according to Google Books (the third wave is too recent to appear).
The first wave started with cybernetics in the 1940s–1960s, with the development of
theories of biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and
implementations ofthe first models such as the perceptron (Rosenblatt, 1958) allowing the
training of a single neuron. The second wave started with the connectionist approach of the
1980–1995 period, with back-propagation (Rumelhart et al., 1986a) to train a neural
network with one or two hidden layers. The current and third wave, deep learning,
started around 2006 (Hintonet al., 2006; Bengio et al., 2007; Ranzato et al., 2007a), and is
just now appearing in book form as of 2016. The other two waves similarly appeared in
book form much later than the corresponding scientific activity occurred.
The earliest predecessors of modern deep learning were simple linear models
motivated from a neuroscientific perspective. These models were designed to take a
set of n input values x1, . . . , xn and associate them with an output y. These models
would learn a set of weights w1, . . . , wn and compute their output f(x, w) = x1w1 + · · · +
xnwn . This first wave of neural networks research was known as cybernetics, as
illustrated in Fig. 1.7.
The McCulloch-Pitts Neuron (McCulloch and Pitts, 1943) was an early modelof
brain function. This linear model could recognize two different categories of inputs by
testing whether f (x, w) is positive or negative. Of course, for the model to correspond
to the desired definition of the categories, the weights needed to be set correctly.
These weights could be set by the human operator. In the 1950s, the perceptron
(Rosenblatt, 1958, 1962) became the first model that could learn the weights defining
the categories given examples of inputs from each category. The adaptive linear
element (ADALINE), which dates from about the same time, simply returned the value
of f (x) itself to predict a real number (Widrow and Hoff, 1960), and could also learn to
predict these numbers from data.
It is worth noting that the effort to understand how the brain works on an
algorithmic level is alive and well. This endeavor is primarily known as “computational
neuroscience” and is a separate field of study from deep learning. It is common for
researchers to move back and forth between both fields. The field of deep learning is
primarily concerned with how to build computer systemsthat are able to successfully
solve tasks requiring intelligence, while the field of computational neuroscience is
primarily concerned with building more accurate models of how the brain actually
works.
In the 1980s, the second wave of neural network research emerged in great part via a
movement called connectionism or parallel distributed processing (Rumelhart
One may wonder why deep learning has only recently become recognized as a crucial
technology though the first experiments with artificial neural networks were conducted in
the 1950s. Deep learning has been successfully used in commercial applications since
the 1990s, but was often regarded as being more of an art than a technology and
something that only an expert could use, until recently. It is true that some skill is
required to get good performance from a deep learning algorithm. Fortunately, the
amount of skill required reduces as the amount of training data increases. The learning
algorithms reaching human performance on complex tasks today are nearly identical
to the learning algorithms that struggled to solve toy problems in the 1980s, though
the models we train with these algorithms have undergone changes that simplify the
training of very deep architectures. The most important new development is that
today we can provide these algorithms with the resources they need to succeed The
age of “Big Data” has made machine learning much easier because the key burden of
statistical estimation—generalizing well to new data after observing only a small
amount of data—has been considerably lightened. As of 2016, a rough rule of thumbis
that a supervised deep learning algorithm will generally achieve acceptable
performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10 million
labeled examples. Working successfully with datasets smaller than this is an important
research area, focusing on how we can take advantage of large quantities of unlabeled
examples, with unsupervised or semi-supervised learning.
Figure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statisticians
studied datasets using hundreds or thousands of manually compiled measurements (Garson,
1900; Gosset, 1908; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the pioneers
of biologically inspired machine learning often worked with small, synthetic datasets, such as
low-resolution bitmaps of letters, that were designed to incur low computational cost and
demonstrate that neural networks were able to learn specific kinds of functions (Widrow and
Hoff, 1960; Rumelhart et al., 1986b). In the 1980s and 1990s, machine learning became more
statistical in nature and began to leverage larger datasets containing tens of thousands of
examples such as the MNIST dataset (shown in Fig. 1.9) of scans of handwritten numbers
(LeCun et al., 1998b). In the first decade of the 2000s, more sophisticated datasets of this
same size, such as the CIFAR-10 dataset (Krizhevsky and Hinton, 2009) continued to be
produced.
Toward the end of that decade and throughout the first half of the 2010s, significantly larger
datasets, containing hundreds of thousands to tens of millions of examples, completely
changed what was possible with deep learning. These datasets included the public Street
View House Numbers dataset (Netzer et al., 2011), various versions of the ImageNet dataset
(Deng et al., 2009, 2010a; Russakovsky et al., 2014a), and the Sports-1M dataset (Karpathy et
al., 2014). At the top of the graph, we see that datasets of translated sentences, such as IBM’s
dataset constructed from the Canadian Hansard (Brown et al., 1990) and the WMT 2014
English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.
its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A sequence
of inputs enter the layer and are multiplied by the weights in this model. The weighted input
values are then summed together to form a total. If the sum of the values is more than a
predetermined threshold, which is normally set at zero, the output value is usually 1, and if the
sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently
used for classification. Single-layer perceptrons can also contain machine learning features.
The neural network can compare the outputs of its nodes with the desired values
using a property known as the delta rule, allowing the network to alter its weights
through training to create more accurate output values. This training and learning
procedure results in gradient descent. The technique of updating weights in multi-
layered perceptrons is virtually the same, however, the process is referred to as
back-propagation.
Feed forward neural networks are artificial neural networks in which nodes do not form loops.
This type of neural network is also known as a multi-layer neural network as all information is
only passed forward.
During data flow, input nodes receive data, which travel through hidden layers, and exit output
nodes. No links exist in the network that could get used to by sending information back from
the output node.
This model multiplies inputs with weights as they enter the layer. Afterward, the weighted
input values get added together to get the sum. If the sum of the values rises above a certain
threshold, set at zero, the output value is usually 1, while if it falls below the threshold, it is
usually -1.
As a feed forward neural network model, the single-layer perceptron often gets used for
classification. Machine learning can also get integrated into single-layer perceptrons. Through
training, neural networks can adjust their weights based on a property called the delta rule,
which helps them compare their outputs with the intended values.
Deep feedforward networks, also often called feedforward neural networks, or multi-
layer perceptrons (MLPs), are the quintessential deep learning models. The goal
of a feedforward network is to approximate some function f∗. For example, for
a classifier, y = f ∗(x) maps an input x to a category y. A feedforward network
defines a mapping y = f (x; θ) and learns the value of the parameters θ that result
in the best function approximation.
These models are called feedforward because information flows through the function
being evaluated from x, through the intermediate computations used to define f , and
finally to the output y. There are no feedback connections in which outputs of the model
are fed back into itself. When feedforward neural networks are extended to include
feedback connections, they are called recurrent neural networks
input layer:
The neurons of this layer receive input and pass it on to the other layers of
the network. Feature or attribute numbers in the dataset must match the
number of neurons in the input layer.
Output layer:
According to the type of model getting built, this layer represents the forecasted
feature.
Hidden layer:
There are several neurons in hidden layers that transform the input before
transferring it to the next layer. This network constantly updated with
weights to make it easier to predict.
Neuron weights:
Neurons:
Artificial neurons get used in feed forward networks, which later get adapted
from biological neurons. A neural network consists of artificial neurons.
Neurons function in two ways: first, they create weighted input sums, and
second, they activate the sums to make them normal.
Activation Function:
Here’s how the neural network computes the data in three simple steps:
1.Multiplication of weights and inputs: The input is multiplied by the assigned weight values, which
this case would be the following:
2. Adding the biases: In the next step, the product found in the previous step is added to their
respective biases. The modified inputs are then summed up to a single value.
(x1* w1) + b1 = 0
+ 1 (x2* w2) + b2
= 12 + 0 (x3* w3)
+ b3 = 11 + 0
3. Activation: An activation function is the mapping of summed weighted input to the output of the
neuron. It is called an activation/transfer function because it governs the inception at which the neuron is
activated and the strength of the output signal.
4. Output signal: Finally, the weighted sum obtained is turned into an output signal by feeding the
weighted sum into an activation function (also called transfer function). Since the weighted sum in our
example is greater than 20, the perceptron predicts it to be a rainy day.
Calculating the Loss
Mathematically:
Gradient Descent
delta training rule, consider the task of training a threshold perception.That is, a linear unit for
which the output O is given by
To derive a weight learning rule for linear units, specify a measure for the training error
of ahypothesis (weight vector), relative to the training examples.
Where,
D is the set of training examples,
td is the target output for training example d,
E ( →w→ →→ ) is simply half the squared difference between the target output td and
od is the output of the linear unit for training example d
the linear unit output od, summed over all training examples.
Gradient descent search determines a weight vector that minimizes E by starting with an
arbitrary initial weight vector, then repeatedly modifying it in small steps.
At each step, the weight vector is altered in the direction that produces the steepest descent
along the error surface depicted in above figure. This process continues until the global
minimum error is reached.
Backpropagation
Let us take a look at how back propagation works. It has four layers: input layer, hidden layer,
hidden layer II and final output layer.
1. Input layer
2. Hidden layer
3. Output layer
Each layer has its own way of working and its own way to take action such that we are able to get
the desired results and correlate these scenarios to our conditions. Let us discuss other details
needed to help summarizing this algorithm.
p=x+y
The above computational graph has an addition node (node with "+" sign)
with two input variables x and y and one output q.
Let us take another example, slightly more complex. We have the following
equation.
g=(x+y)∗z
1. Forward Pass
Forward pass is the procedure for evaluating the value of the mathematical
expression represented by computational graphs. Doing forward pass means
we are passing the value from variables in forward direction from the left
(input) to the right where the output is.
x=1,y=3,z=−3
By giving these values to the inputs, we can perform forward pass and get
the following values for the outputs on each node.
Regularization is a set of techniques that can prevent overfitting in neural networks and thus
improve the accuracy of a Deep Learning model when facing completely new data from the
problem domain.
This penalty discourages the model from becoming too complex or having large parameter
values, which helps in controlling the model’s ability to fit noise in the training data.
Regularization methods include L1 and L2 regularization, dropout, early stopping, and more.
By applying regularization, models become more robust and better at making accurate
Example An epoch is when all the training data is used at once and is defined as the total
number of iterations of all the training data in one cycle for training the machine learning
model. Another way to define an epoch is the number of passes a training dataset takes
around an algorithm.
A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts learning
from the noise and inaccurate data entries in our data set. And when testing with test data
too many details and noise. The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have more freedom in
building the model based on the dataset and therefore they can really build unrealistic
models. A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision trees.
Parameter Norm Penalties are regularization methods that apply a penalty to the norm of
parameters in the objective function of a neural network.
Lasso Regression
where,
m – Number of Features
n – Number of Examples
y_i – Actual Target Value
y_i(hat) – Predicted Target Value
Ridge Regression
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.
In L2, we have:
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our model.
Otherwise, we usually prefer L2 over it.
In keras, we can directly apply regularization to any layer using the regularizers. Below I
have applied regularizer on dense layer having 500 neurons and relu activation function.
In [11]:
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
we can construct a generalized Lagrangian function containing the objective function along
with the penalties can be increased or decreased. Suppose we wanted Ω(θ) < k, then we
could construct the following Lagrangian equation proposed by author:
Likewise, if Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α should be
small. This is now similar to the parameter norm penalty regularized objective function as
both of them encourage lower values of the norm. Thus, parameter norm penalties
naturally impose a constraint, like the L²-regularization, defining a constrained L²-ball.
Larger α implies a smaller constrained region as it pushes the values really low, hence,
allowing a small radius and vice versa. The idea of constraints over penalties is important for
several reasons. Large penalties might cause non-convex optimization algorithms to get
stuck in local minima due to small values of θ, leading to the formation of so-called dead
cells, as the weights entering and leaving them are too small to have an impact.
Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.
Underdetermined problems are those problems that have infinitely many solutions. A logistic
regression problem having linearly separable classes with as a solution, will always
have 2w as a solution and so on. In some machine learning problems, regularization is
necessary. For e.g., many algorithms require the inversion of X’ X, which might be singular.
In such a case, we can use a regularized form instead. (X’ X + αI) is guaranteed to be
invertible.
Many linear models in machine learning, including linear regression depend on inverting the
whenever the data generating distribution truly has no variance in some direction, or when
no variance in observed in some direction because there are fewer examples (rows of X)
than input features (columns of X). In this case, many forms of regularization correspond to
inverti
Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data. In machine
learning, we were not able to increase the size of training data as the labeled data was too
costly.
But, now let’s consider we are dealing with images. In this case, there are a few ways of
increasing the size of the training data – rotating the image, flipping, scaling, shifting, etc. In
the below image, some transformation has been done on the handwritten digits dataset.
the accuracy of the model. It can be considered as a mandatory trick in order to improve our
predictions.
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False) # randomly flip images
datagen.fit(x_train)
Dropout
This is the one of the most interesting types of regularization techniques. It also produces
very good results and is consequently the most frequently used regularization technique in
the field of deep learning.
So what does dropout do? At every iteration, it randomly selects some nodes and removes
them along with all of their incoming and outgoing connections as shown below.
So each iteration has a different set of nodes and this results in a different set of outputs. It
can also be thought of as an ensemble technique in machine learning.
Ensemble models usually perform better than a single model as they capture more
randomness. Similarly, dropout also performs better than a normal neural network model.
This probability of choosing how many nodes should be dropped is the hyperparameter of
the dropout function. As seen in the image above, dropout can be applied to both the
hidden layers as well as the input layers.
In keras, we can implement dropout using the keras layer. Below is the Dropout
Implementation. I have introduced dropout of 0.5 as the probability of dropping in my
neural network architecture after last hidden layer having 64 kernels and after first dense
layer having 500 neurons.
example
linkcode
#creating sequential model
model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape
=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
# 1st dropout
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(500,activation="relu"))
# 2nd dropout
model.add(Dropout(0.2))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons
Early stopping is a kind of cross-validation strategy where we keep one part of the training
set as the validation set. When we see that the performance on the validation set is getting
worse, we immediately stop the training on the model. This is known as early stopping.
In the above image, we will stop training at the dotted line since after that our model will
in keras, we can apply early stopping using the callbacks function. Below is the
implementation code for it.I have applied early stopping so that it will stop immendiately if
validation error will not decreased after 3 epochs.
In [14]:
from keras.callbacks import EarlyStopping
earlystop= EarlyStopping(monitor='val_acc', patience=3)
epochs = 20 #
batch_size = 256
Here, monitor denotes the quantity that needs to be monitored and ‘val_err’ denotes the
validation error.
Patience denotes the number of epochs with no further improvement after which the
training will be stopped. For better understanding, let’s take a look at the above image
again. After the dotted line, each epoch will result in a higher value of validation error.
Therefore, 5 epochs after the dotted line (since our patience is equal to 3), our model will
stop because no further improvement is seen.
Noise applied to inputs is a data augmentation, For some models addition of noise with
extremely small variance at the input is equivalent to imposing a penalty on the norm of the
weights.
Noise applied to hidden units, Noise injection can be much more powerful than simply
shrinking the parameters. Noise applied to hidden units is so important that Dropout is the
main development of this approach.
Training a neural network with a small dataset can cause the network to memorize all
training examples, in turn leading to overfitting and poor performance on a holdout dataset.
One approach to making the input space smoother and easier to learn is to add noise to
inputs during training.
Small datasets can make learning challenging for neural nets and the examples can be
memorized.
Adding noise during training can make the training process more robust and reduce
generalization error.
Noise is traditionally added to the inputs, but can also be added to weights, gradients, and
even activation functions.
random noise can be added to other parts of the network during training. Some examples
include:
The addition of noise to weights allows the approach to be used throughout the network in
a consistent way instead of adding noise to inputs and layer activations. This is particularly
useful in recurrent neural networks.
The addition of noise to gradients focuses more on improving the robustness of the
optimization process itself rather than the structure of the input domain. The amount of
noise can start high at the beginning of training and decrease over time, much like a
decaying learning rate. This approach has proven to be an effective method for very deep
Adding noise to the activations, weights, or gradients all provide a more generic approach to
adding noise that is invariant to the types of input variables provided to the model.
If the problem domain is believed or expected to have mislabeled examples, then the
addition of noise to the class label can improve the model’s robustness to this type of error.
Although, it can be easy to derail the learning process.
Adding noise to a continuous target variable in the case of regression or time series
forecasting is much like the addition of noise to the input variables and may be a better use
case.
Semi-Supervised Learning
Semi-supervised learning is a type of machine learning that falls in between supervised
and unsupervised learning. It is a method that uses a small amount of labeled data and a
large amount of unlabeled data to train a model. The goal of semi-supervised learning is to
learn a function that can accurately predict the output variable based on the input
variables, similar to supervised learning. However, unlike supervised learning, the
algorithm is trained on a dataset that contains both labeled and unlabeled data.
Semi-supervised learning is particularly useful when there is a large amount of unlabeled
data available, but it’s too expensive or difficult to label all of it.
Multi-Task Learning
Hard Parameter Sharing – A common hidden layer is used for all tasks but several task
specific layers are kept intact towards the end of the model. This technique is very useful
as by learning a representation for various tasks by a common hidden layer, we reduce the
risk of overfitting.
Soft Parameter Sharing – Each model has their own sets of weights and biases and
the distance between these parameters in different models is regularized so that
the parameters become similar and can represent all the tasks.
Parameter Typing
Two models are doing the same classification task (with the same set of classes), but their
input distributions are somewhat different.
We have model A has the parameters
Another model B has the parameters .
W(A)
Faculty Name : Dr.D.Shanthi Subject Name :DL
and
W(B)
are the two models that transfer the input to two different but related outputs.
Assume the tasks are comparable enough (possibly with similar input and output
distributions) that the model parameters should be near to each
We can take advantage of this data by regularising it. We can apply a parameter norm
penalty of the following form We utilized an L2 penalty here, but there are other options.
Parameter Sharing
The parameters of one model, trained as a classifier in a supervised paradigm, were
regularised to be close to the parameters of another model, trained in an unsupervised
paradigm, using this method (to capture the distribution of the observed input data).
Many of the parameters in the classifier model might be linked with similar parameters in
the unsupervised model thanks to the designs.
Example : Convolutional neural networks (CNNs) used in computer vision are by far the
most widespread and extensive usage of parameter sharing. Many statistical features of
natural images are translation insensitive. A shot of a cat, for example, can be translated
one pixel to the right and still be a shot of a cat. By sharing parameters across several
picture locations, CNNs take this property into account. Different locations in the input are
computed with the same feature (a hidden unit with the same weights).
Sparse representation (SR) is used to represent data with as few atoms as possible in a given
overcomplete dictionary. By using the SR, we can concisely represent the data and easily
extract the valuable information from the data
Sparse representations classification (SRC) is a powerful technique for pixelwise
classification of images and it is increasingly being used for a wide variety of image analysis
tasks. The method uses sparse representation and learned redundant dictionaries to classify
image pixels.
the terms "sparse" and "dense" are commonly used to describe the distribution of zero
and non-zero array members in machine learning (e.g. vector or matrix). Sparse matrices
are those that primarily consist of zeros, while dense matrices have a large number of
nonzero entries.
Machine learning makes use of sparse and dense representations due to their usefulness in
efficient data representation. While dense representations are useful for capturing intricate
interactions between data points, sparse representations can help minimize the amount of
a dataset.
Sparse representations are helpful for reducing the dimensionality of the data in
tasks like natural language processing and picture recognition. Further, sparse
representations can be utilized to capture only the most crucial elements of the
data, which can greatly cut down on the time needed to train a model.
Dense representations are able to capture complicated interactions between data
points, they are frequently employed in machine learning and can be especially
helpful for tasks like classification and regression. Because of their increased
computational efficiency, dense representations can also shorten the time it takes to train
a model.
A matrix is a two-dimensional data object made of m rows and n columns, therefore
having total m x n values. If most of the elements of the matrix have 0 value, then it is
called a sparse matrix.
Why to use Sparse Matrix instead of simple matrix ?
Storage: There are lesser non-zero elements than zeros and thus lesser memory can
be used to store only those elements.
Computing time: Computing time can be saved by logically designing a data structure
traversing only non-zero elements..
sparse Matrix Representations can be done in many ways following are two common
representations:
1. Array representation
2. Linked list representation
Example -
Let's understand the array representation of sparse matrix with the help of the example
given below -
In the above figure, we can observe a 5x4 sparse matrix containing 7 non-zero elements and
13 zero elements. The above matrix occupies 5x4 = 20 memory space. Increasing the size of
matrix will increase the wastage space.
In the above structure, first column represents the rows, the second column represents the
columns, and the third column represents the non-zero value. The first row of the table
represents the triplets. The first triplet represents that the value 4 is stored at 0th row and
1st column. Similarly, the second triplet represents that the value 5 is stored at the 0th row
and 3rd column. In a similar manner, all triplets represent the stored location of the non-
zero elements in the matrix.
The size of the table depends upon the total number of non-zero elements in the given
sparse matrix. Above table occupies 8x3 = 24 memory space which is more than the space
occupied by the sparse matrix. So, what's the benefit of using the sparse matrix? Consider
the case if the matrix is 8*8 and there are only 8 non-zero elements in the matrix, then the
space occupied by the sparse matrix would be 8*8 = 64, whereas the space occupied by the
table represented using triplets would be 8*3 = 24.
Example -
Let's understand the linked list representation of sparse matrix with the help of the example
given below -
In the above figure, the sparse matrix is represented in the linked list form. In the node, the
first field represents the index of the row, the second field represents the index of the
column, the third field represents the value, and the fourth field contains the address of the
next node.
In the above figure, the first field of the first node of the linked list contains 0, which means
0th row, the second field contains 2, which means 2 nd column, and the third field contains 1
that is the non-zero element. So, the first node represents that element 1 is stored at the
0th row-2nd column in the given sparse matrix. In a similar manner, all of the nodes represent
the non-zero elements of the sparse matrix.
sparse code follows the more all-encompassing idea of neural code. Consider the case
when you have binary neurons. So, basically:
The neural networks will get some inputs and deliver outputs
Some neurons in the neural network will be frequently activated while others won’t
be activated at all to calculate the outputs
The average activity ratio refers to the number of activations on some data, whereas
the neural code is the observation of those activations for a specific input
Neural coding is the process of instructing your neurons to produce a reliable neural
code
Now that we know what a neural code is, we can speculate on what it may be like. Then,
data will be encoded using a sparse code while taking into consideration the following
scenarios:
Bagging or Bootstrap Aggregating is an ensemble learning method that is used to reduce the
error by training homogeneous weak learners on different random samples from the
training set, in parallel. The results of these base learners are then combined through voting
or averaging approach to produce an ensemble model that is more robust and accurate.
Bagging mainly focuses on obtaining an ensemble model with lower variance than the individual
base models composing it. Hence, bagging techniques help avoid the overfitting of the
model.
Benefits of Bagging
Reduce Overfitting
Improve Accuracy
Handles Unstable Models
Note: Random Forest Algorithm is one of the most common Bagging Algorithm.
Image classification
In the above example, it was observed that a specific record was predicted as a dog by the
logistic regression and decision tree models, while a support vector machine identified it as
a cat. As various models have their distinct advantages and disadvantages for particular
The procedure is called aggregation or voting and combines the predictions of all underlying
models, to come up with one prediction that is assumed to be more precise than any sub-
model that would stay alone.
Boosting is an ensemble learning method that involves training homogenous weak
learners sequentially such that a base model depends on the previously fitted base models.
All these base learners are then combined in a very adaptive way to obtain an ensemble
model.
In boosting, the ensemble model is the weighted sum of all constituent base learners. There
are two meta-algorithms in boosting that differentiate how the base models are aggregated:
Adaptive Boosting (AdaBoost)
Gradient Boosting
XGBoost
Bias:While making predictions, a difference occurs between prediction values made by the
model and actual values/expected values, and this difference is known as bias errors or
Errors due to bias
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform well on new data.
o Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.
It combines this prior knowledge with observed training data, by minimizing an objective
function that measures both the network's error with respect to the training example values
(fitting the data) and its error with respect to the desired derivatives (fitting the prior
knowledge).
Tangent propagation is closely related to dataset augmentation. In both cases, the user of the
algorithm encodes his or her prior knowledge of the task by specifying a set of
transformations that should not alter the output of the network.
The difference is that in the case of dataset augmentation, the network is explicitly trained to
correctly classify distinct inputs that were created by applying more than an infinitesimal
amount of these transformations.
tangent propagation does not require explicitly visiting a new input point. Instead, it analytically
regularizes the model to resist perturbation in the directions corresponding to the specified
transformation. While this analytical approach is intellectually elegant,
it has two major drawbacks. First, it only regularizes the model to resist infinitesimal
perturbation. Explicit dataset augmentation confers resistance to larger
perturbations( means changes in datasets) Second, the infinitesimal approach poses
difficulties for models based on rectified linear units. These models can only shrink their
derivatives by turning units off or shrinking their weights.
The TANGENTPROP algorithm assumes various training derivatives of the target function are
also provided. For example, if each instance xi is described by a single real value, then each
training example may be of the form (xi, f (xi), q lx, ). Here lx, denotes the derivative of the
target function f with respect to x, evaluated at the point x = xi.
To develop an intuition for the benefits of providing training derivatives as well as training
values during learning, consider the simple learning task depicted in Figure
The task is to learn the target function f shown in the leftmost plot of the figure, based on
the three training examples shown: (xl, f (xl)), (x2, f (x2)), and (xg, f (xg)).
Given these three training examples, the BACKPROPAGATION algorithm can be expected to
hypothesize a smooth function, such as the function g depicted in the middle plot of the
figure. The rightmost plot shows the effect of
providing training derivatives, or slopes, as additional information for each training example
(e.g., (XI, f (XI), I,, )). By fitting both the training values f (xi) and these training derivatives PI,,
the learner has a better chance to correctly generalize from the sparse training data.
To summarize, the impact of including the training derivatives is to override the usual
syntactic inductive bias of BACKPROPAGATION that favors a smooth interpolation between
points, replacing it by explicit input information about required derivatives. The resulting
hypothesis h shown in the rightmost plot of the figure provides a much more accurate
estimate of the true target function f.
In the Figure one f(X) are the hypothesis and x1 , x2 ,x3 are the instances and these
instances fit to proper hypothesis shown in first figure and in second fig we can see the
instances classified and machine learns to fit to proper hypothesis by doing necessary
modification by using
TANGEPROP considers the squared error between the specified training derivative and the
actual derivative of the learned neural network. The modified error function is
where p is a constant provided by the user to determine the relative importance of fitting
training values versus fitting training derivatives.
Notice the first term in this definition of E is the original squared error of the network versus
training values, and the second term is the squared error in the network versus training
derivatives.
In the third figure we can see the instances are classified properly and maintaining accuracy.
An Illustrative Example
It combines this prior knowledge with observed training data, by minimizing an objective
function that measures both the network's error with respect to the training example values
(fitting the data) and its error with respect to the desired derivatives (fitting the prior
knowledge).
Optimization for Train Deep Models: Challenges in Neural Network Optimization, Basic
Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates,
Approximate Second Order Methods, Optimization Strategies and Meta-Algorithms
There are several types of optimization in deep learning algorithms but the most interesting
The core of deep learning optimization relies on trying to minimize the cost function of a
model without affecting its training performance. That type of optimization problem
contrasts with the general optimization problem in which the objective is to simply
minimize a specific indicator without being constrained by the performance of other
elements( ex:training).
Most optimization algorithms in deep learning are based on gradient estimations. In that
context, optimization algorithms try to reduce the gradient of specific cost functions
evaluated against the training dataset. There are different categories of optimization
algorithms depending on the way they interact with the training dataset. For instance,
algorithms that use the entire training set at once are called deterministic. Other techniques
that use one training example at a time has come to be known as online algorithms.
Similarly, algorithms that use more than one but less than the entire training dataset during
the optimization process are known as minibatch stochastic or simply stochastic.
There are plenty of challenges in deep learning optimization but most of them are related to
the nature of the gradient of the model. Below, I’ve listed some of the most common
challenges in deep learning optimization that you are likely to run into:
a)Local Minima: local minima is a permanent challenge in the optimization of any deep
learning algorithm. The local minima problem arises when the gradient encounters many
local minimums that are different and not correlated to a global minimum for the cost
function.
B.saddle points
saddle points are another reason for gradients to vanish. A saddle point is any location
where all gradients of a function vanish but which is neither a global nor a local minimum.
c) Inexact Gradients: There are many deep learning models in which the cost function is
intractable which forces an inexact estimation of the gradient. In these cases, the
inexact gradients introduce a second layer of uncertainty in the model.
d) Local vs. Global Structures: Another very common challenge in the optimization of
deep leavening models is that local regions of the cost function don’t correspond with its
global structure producing a misleading gradient.
Solution: Gradient clipping, advanced weight initialization, and skip connections help a
computer learn things accurately and consistently.
Catastrophic Forgetting
When a model forgets previously learned knowledge after training on new data, it
encounters the issue of catastrophic forgetting.
Solution: Implement techniques like elastic weight consolidation (EWC) or knowledge
distillation to retain old knowledge during continual learning.
Hardware and Deployment Constraints
Using trained models on devices with not much computing power can be hard.
Solution: Scientists use special techniques to make computer models run better on devices
with limited resources.
Data Privacy and Security
When training computers to do complex tasks, it is essential to keep data private and
ensure the computers are secure.
Solution: Employ federated learning, secure aggregation, or differential privacy techniques
to protect data and model privacy.
Long Training Times
Training deep neural networks is like doing a challenging puzzle. It takes a lot of time to
assemble the puzzle, especially if it is vast and has a lot of pieces.
Solution: Special tools like GPUs or TPUs can help us train our computers faster. We can also
try using different computers simultaneously to make the training even quicker.
Exploding Memory Usage
Some models are too big and need a lot of space, so they are hard to use on regular
computers.
Solution: Explore memory-efficient architectures, use gradient checkpointing, or consider
model parallelism for training.
Learning Rate Scheduling
Setting an appropriate learning rate schedule can be challenging, affecting model
convergence and performance.
Solution: Using special learning rate schedules can help make learning easier and faster.
These schedules can be used to help teach things in a better way.
Faculty Name : Dr.D.Shanthi Subject Name :DL
Avoiding Local Minima
Deep neural networks can get stuck in local minima during training, impacting the model's
final performance.
Solution: Using unique strategies like simulated annealing, momentum-based optimization,
and evolutionary algorithms can help us escape difficult spots.
Unstable Loss Surfaces
Finding the best way to do something can be very hard when there are many different
options because the surface it is on is complicated and bumpy.
Solution: Utilize weight noise injection, curvature-based optimization, or geometric
methods to stabilize loss surfaces.
Ill-Conditioned Matrix
In neural network the adjustments of weights computation and calculation in hidden layer
when calculate in matrix form it simply tells us the characteristics of the matrix in terms of
further computations and calculations, or formally it can be defined as a measure of how
much the output value of the function can change for a small change in the input argument.
A matrix is said to be Ill-conditioned if the condition number is very high, so for a small
change in the input function/the Hessian matrix (The Hessian Matrix is a square matrix
of second ordered partial derivatives of a scalar function. It is of immense use in
Basic Algorithms
Gradient Descent is an iterative optimization process that searches
for an objective function’s optimum value (Minimum/Maximum). It is
one of the most used methods for changing a model’s parameters
to reduce a cost function in machine learning projects.
The primary goal of gradient descent is to identify the model
parameters that provide the maximum accuracy on both training
and test datasets
In SGD, instead of using the entire dataset for each iteration, only a
single random training example (or a small batch) is selected to
calculate the gradient and update the model parameters. This
Faculty Name : Dr.D.Shanthi Subject Name :DL
random selection introduces randomness into the optimization
process, hence the term “stochastic” in stochastic Gradient
Descent
The advantage of using SGD is its computational efficiency,
especially when dealing with large datasets. By using a single
example or a small batch, the computational cost per iteration is
significantly reduced compared to traditional Gradient Descent
methods that require processing the entire dataset.
Stochastic Gradient Descent Algorithm
Initialization: Randomly initialize the parameters of the model.
Set Parameters: Determine the number of iterations and the
learning rate (alpha) for updating the parameters.
Stochastic Gradient Descent Loop: Repeat the following steps until
the model converges or reaches the maximum number of
iterations:
a. Shuffle the training dataset to introduce randomness.
b. Iterate over each training example (or a small
batch) in the shuffled order.
c. Compute the gradient of the cost function with
respect to the model parameters using the current training
example (or batch).
d. Update the model parameters by taking a step in the
direction of the negative gradient, scaled by the learning rate.
e. Evaluate the convergence criteria, such as the
difference in the cost function between iterations of the gradient.
Return Optimized Parameters: Once the convergence criteria are
met or the maximum number of iterations is reached, return the
optimized model parameters.
The momentum algorithm introduces a variable v that plays the role of velocity—it is the direction
Faculty Name : Dr.D.Shanthi Subject Name :DL
and speed at which the parameters move through parameter space. The velocity is set to an
exponentially decaying average of the negative gradient. The name momentum derives from a
physical analogy, in which the negative gradient is a force moving a particle through parameter
space, according to Newton’s laws of motion. Momentum in physics is mass times velocity. In the
momentum learning algorithm, we assume unit mass, so the velocity vector v may also be regarded
as the momentum of the particle The algorithm is balanced with momentum and steps velocity is
added as
Training algorithms for deep learning models are iterative in nature and
require the specification of an initial point. This is extremely crucial as it
often decides whether the algorithm converges and if it does, then does
the algorithm converge to a point with high cost or low cost.
We have limited understanding of neural network optimization but the
one property that we know with complete certainty is that the
Biases are often chosen heuristically (zero mostly) and only the weights
are randomly initialized, almost always from a Gaussian or uniform
distribution. The scale of the distribution is of utmost concern. Large
weights might have better symmetry-breaking effect but might lead to
chaos (extreme sensitivity to small perturbations in the input) and
exploding values during forward & back propagation. As an example of
how large weights might lead to chaos, consider that there’s a slight
noise adding ϵ to the input. Now,
we if did just a simple linear transformation like W * x, the ϵ noise would
add a factor of W * ϵ to the output. In case the weights are high, this ends
up making a significant contribution to the output. SGD and its variants
tend to halt in areas near the initial values, thereby expressing a prior
that the path to the final parameters from the initial values is
discoverable by steepest descent algorithms.
Various suggestions have been made for appropriate initialization of the
parameters. The most commonly used ones include sampling the weights
of each fully-connected layer having m inputs and n outputs uniformly
from the following distributions:
parameters has reduced to 100 but that of the other parameter is still
around 750. However, because of the accumulation at each update, the
accumulated gradient would still have almost the same value. For e.g. let
the accumulated gradients at each step for
the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100,
1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 + 750 =
4300, 1/4300 = 0.0002. This would lead to a similar decrease in the
learning rates for both the parameters, even though the parameter
having the lower gradient might have its learning rate reduced too much
leading to slower learning.
This allows the algorithm to converge rapidly after finding a convex bowl, as
if it were an instance of AdaGrad initialized within that bowl. . Consider
the figure below. The region represented
by 1 indicates usual RMSProp parameter updates as given by the update
equation, which is nothing but exponentially averaged AdaGrad updates.
Once the optimization process lands on A, it essentially lands at the top
of a convex bowl. At this point, intuitively, all the updates before A can
be seen to be forgotten due to the exponential averaging and it can be
seen as if (exponentially averaged) AdaGrad updates start from point A
onwards.
For quadratic surfaces (i.e. where cost function is quadratic), this directly
gives the optimal result in one step whereas gradient descent would still
need to iterate. However, for surfaces that are not quadratic, as long as
the Hessian remains positive definite, we can obtain the optimal point
through a 2-step iterative process — 1) Get the inverse of the Hessian
and 2) update the parameters.
Saddle points are problematic for Newton’s method. If all the eigenvalues
are not positive, Newton’s method might cause the updates to move in
the wrong direction. A way to avoid this is to add regularization:
Now, the previous search direction contributes towards finding the next search
direction.
This cost function describes the learning problem called sparse coding.
Here, H refers to the sparse representation of X and W is
Coordinate descent may fail terribly when one variable influences the
optimal value of another variable.
Applications: Large-Scale Deep Learning : Computer Vision, Speech Recognition, Natural Language Processing
Deep learning has many uses in many fields, and its potential grows. Let’s analyze a
few of artificial intelligence’s widespread profound learning uses.
Recommendation Systems
Autonomous Vehicles
Recommendation Systems
Conclusion