0% found this document useful (0 votes)
39 views455 pages

Biological Neuron and Memory: Understanding The Basics of Neural Function and Memory Mechanisms

Uploaded by

Pavani R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views455 pages

Biological Neuron and Memory: Understanding The Basics of Neural Function and Memory Mechanisms

Uploaded by

Pavani R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 455

INTRODUCTION:

BIOLOGICAL NEURON
AND MEMORY
UNDERSTANDING THE BASICS OF NEURAL FUNCTION AND MEMORY
MECHANISMS
BIOLOGICAL NEURON:

• Definition:
• A neuron is a specialized cell transmitting nerve impulses; a nerve cell.

• Structure:
• Dendrites: Receive signals from other neurons.
• Cell Body (Soma): Contains the nucleus and processes incoming signals.
• Axon: Transmits signals away from the cell body.
• Synapse: The junction between two neurons where information is transmitted.
STRUCTURE OF A BIOLOGICAL NEURON:
FEATURES OF BIOLOGICAL NEURAL NETWORK

• Biological neural networks consist of interconnected neurons in the brain and nervous system.
• They process information through electrical and chemical signals, enabling complex behaviors and
cognitive functions.
• Robustness and fault tolerance: The decay of nerve cells does not seem to affect the performance
significantly.
• Flexibility: The network automatically adjusts to a new environment without using any
preprogrammed instructions.
• Ability to deal with a variety of data situations: The network can deal with information that is
fuzzy, probabilistic, noisy and inconsistent.
• Collective computation: The network performs routinely many operations in parallel and also a
given task in a distributed manner.
• Neural networks are a biologically-inspired algorithms that attempt to mimic the
functions of neurons in the brain.
• Each neuron acts as a computational unit, accepting input from the dendrites and
outputting signal through the axon terminals.
• Actions are triggered when a specific combination of neurons are activated.” The
Human Brain is made up of about 100 billion neurons
• Neurons receive electric signals at the dendrites and send them to the axon.
KEY FEATURES:

Neurons: Axon:
• Basic building blocks of the nervous system. • Long, slender projection that conducts electrical impulses
away from the cell body.
• Comprise dendrites, cell body (soma), axon, and synapses.
• Transmits signals to other neurons, muscles, or glands.
Dendrites:
• Branch-like structures that receive signals from other
neurons. Synapses:
• Conduct electrical impulses toward the cell body. • Junctions between neurons where communication occurs.
Cell Body (Soma): • Consist of the presynaptic ending, synaptic cleft, and
postsynaptic membrane.
• Contains the nucleus and other organelles.
• Integrates incoming signals and generates outgoing signals.
KEY FEATURES:

Neurons Function: Mechanisms of Memory Formation:


Electrical Signaling: Neurons communicate via electrical • Synaptic Plasticity:
impulses called action potentials. • Changes in the strength of synapses are crucial for learning and
memory.
Chemical Signaling: Neurotransmitters released at synapses
facilitate signal transmission between neurons. • Long-term Potentiation (LTP):
• A long-lasting increase in synaptic strength following high-
frequency stimulation.
Memory:
Definition: Memory is the process by which information is
Role of Neurons in Memory:
encoded, stored, and retrieved.
• Neurons form networks that encode and store memories.
Types of Memory: • Hippocampus: A critical brain region for memory formation
• Short-term Memory: Holds information temporarily.
• Long-term Memory: Stores information indefinitely.
THANKS

8
STRUCTURE AND FUNCTION OF
SINGLE NEURON - ARTIFICIAL
NEURAL NETWORK (ANN)
OVERVIEW:

• Artificial Neural Networks (ANNs) are computational models inspired by the biological
neural networks in the brain.
• A single neuron in an ANN is also known as a perceptron.
COMPONENTS OF THE BASIC ARTIFICIAL NEURON

• Inputs: Inputs are the set of values for which we need to predict a output value. They
can be viewed as features or attributes in a dataset.
• Weights: weights are the real values that are attached with each input/feature and they
convey the importance of that corresponding feature in predicting the final output. (will
discuss about this in-detail in this article).

4
COMPONENTS OF THE BASIC ARTIFICIAL NEURON

• Bias: Bias is used for shifting the activation function towards left or right, you can
compare this to y-intercept in the line equation. (will discuss more about this in this
article)
• Summation Function: The work of the summation function is to bind the weights and
inputs together and calculate their sum.
• Activation Function: It is used to introduce non-linearity in the model.

5
WHAT IS THE ROLE OF THE ACTIVATION
FUNCTIONS IN NEURAL NETWORKS?
• The idea behind the activation function is to introduce nonlinearity into the neural network
so that it can learn more complex functions.
• Without the Activation function, the neural network behaves as a linear classifier, learning the
function which is a linear combination of its input data.
• The activation function converts the inputs into outputs.
• The activation function is responsible for deciding whether a neuron should be activated i.e.,
fired or not.
• To make the decision, firstly it calculates the weighted sum and further adds bias with it.
• So, the basic purpose of the activation function is to introduce non-linearity into the output of
a neuron.

6
ACTIVATION FUNCTIONS

• Activation functions are functions used in a neural network to compute the weighted
sum of inputs and biases, which is in turn used to decide whether a neuron can be
activated or not.
• Activation functions play an integral role in neural networks by introducing nonlinearity.
• This nonlinearity allows neural networks to develop complex representations and
functions based on the inputs that would not be possible with a simple linear regression
model.

7
TYPES OF ACTIVATION FUNCTIONS

• Linear Activation Functions


• Sigmoid
• ReLU
• Leaky ReLU
• Tanh
• Step Function

8
LINEAR ACTIVATION FUNCTIONS

• The linear activation function, also known


as "no activation," or "identity function"
(multiplied x1.0), is where the activation is
proportional to the input.
• The function doesn't do anything to the
weighted sum of the input, it simply spits
out the value it was given.

• Mathematically, it can be represented as

9
LIMITATIONS OF LINEAR ACTIVATION FUNCTION

• It’s not possible to use backpropagation as the derivative of the function is a constant and
has no relation to the input x.
• All layers of the neural network will collapse into one if a linear activation function is
used. No matter the number of layers in the neural network, the last layer will still be a
linear function of the first layer. So, essentially, a linear activation function turns the neural
network into just one layer.

10
BINARY STEP FUNCTION

• Binary step function depends on a


threshold value that decides whether a
neuron should be activated or not.
• The input fed to the activation function is
compared to a certain threshold; if the
input is greater than it, then the neuron is
activated, else it is deactivated, meaning
that its output is not passed on to the next
hidden layer.
• Mathematically it can be represented as:

11
LIMITATIONS OF BINARY STEP FUNCTION

• It cannot provide multi-value outputs—for example, it cannot be used for multi-class


classification problems.
• The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.

12
SIGMOID / LOGISTIC ACTIVATION FUNCTION

• This function takes any real value as input


and outputs values in the range of 0 to 1.

• The larger the input (more positive), the


closer the output value will be to 1.0,
whereas the smaller the input (more
negative), the closer the output will be to
0.0, as shown in figure.
• Mathematically, it can be represented as

13
LIMITATIONS OF SIGMOID ACTIVATION FUNCTION

• The derivative of the function is f'(x) = sigmoid(x)*(1-


sigmoid(x)).
• As we can see from the Figure, the gradient values are
only significant for range -3 to 3, and the graph gets
much flatter in other regions.
• It implies that for values greater than 3 or less than -3,
the function will have very small gradients. As the
gradient value approaches zero, the network ceases to
learn and suffers from the Vanishing gradient problem.
• The output of the logistic function is not symmetric
around zero. So the output of all the neurons will be of
the same sign. This makes the training of the neural
network more difficult and unstable.

14
TANH FUNCTION (HYPERBOLIC TANGENT)

• Tanh function is very similar to the


sigmoid/logistic activation function, and
even has the same S-shape with the
difference in output range of -1 to 1. In
Tanh, the larger the input (more positive),
the closer the output value will be to 1.0,
whereas the smaller the input (more
negative), the closer the output will be to -
1.0.
• Mathematically, it can be represented as

15
ADVANTAGE OF TANH FUNCTION

• The output of the tanh activation function is Zero centered; hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
• Usually used in hidden layers of a neural network as its values lie between -1 to 1;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.

16
LIMITATIONS OF TANH FUNCTION

• As you can see— it also faces the problem of


vanishing gradients similar to the sigmoid
activation function. Plus the gradient of the tanh
function is much steeper as compared to the
sigmoid function.
• Note: Although both sigmoid and tanh face
vanishing gradient issue, tanh is zero centered,
and the gradients are not restricted to move in
a certain direction. Therefore, in practice, tanh
nonlinearity is always preferred to sigmoid
nonlinearity.

17
RELU FUNCTION

• ReLU stands for Rectified Linear Unit.


• Although it gives an impression of a linear function,
ReLU has a derivative function and allows for
backpropagation while simultaneously making it
computationally efficient.
• The ReLU function is a simple max(0, x) function,
which can also be thought of as a piecewise
function with all inputs less than 0 mapping to 0
and all inputs greater than or equal to 0 mapping
back to themselves (i.e., identity function).
Graphically,
• Mathematically it can be represented as

18
ADVANTAGES OF RELU FUNCTION

• Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
• ReLU accelerates the convergence of gradient descent towards the global minimum of
the loss function due to its linear, non-saturating property.

19
LIMITATIONS OF RELU FUNCTION

• The Dying ReLU problem,


• The negative side of the graph makes the
gradient value zero. Due to this reason, during
the backpropagation process, the weights and
biases for some neurons are not updated. This
can create dead neurons which never get
activated.
• All the negative input values become zero
immediately, which decreases the model’s ability
to fit or train from the data properly.

20
LEAKY RELU FUNCTION

• Leaky ReLU is an improved version of


ReLU function to solve the Dying ReLU
problem as it has a small positive slope
in the negative area.
• Mathematically it can be represented as:

21
ADVANTAGES OF LEAKY RELU FUNCTION

• The advantages of Leaky ReLU are same as


that of ReLU, in addition to the fact that it
does enable backpropagation, even for
negative input values.

• By making this minor modification for


negative input values, the gradient of the
left side of the graph comes out to be a
non-zero value. Therefore, we would no
longer encounter dead neurons in that
region.

22
LIMITATIONS OF LEAKY RELU FUNCTION

• The predictions may not be consistent for negative input values.


• The gradient for negative values is a small value that makes the learning of model
parameters time-consuming.

23
THANKS

24
COMPARISON AND
CHARACTERISTICS OF BIOLOGICAL
AND ARTIFICIAL NEURAL NETWORK

1
RESEMBLANCE OF BIOLOGICAL AND ARTIFICIAL
NEURON

2
COMPARISION BETWEEN THE ANN AND BNN
THANKS

4
MC-CULLOCH AND PITTS MODEL

1
McCulloch-Pitts Neuron Model
• McCulloch (neuroscientist) and Pitts (logician) proposed a highly
simplified computational model of the neuron (1943),
abbreviated as MP Neuron is the fundamental building block of
Artificial Neural Network. This can be mainly used for
y ∈ { 0, 1}
classification problems.
• g aggregates the inputs and the function f takes a decision based on
this aggregation
f • The inputs can be excitatory or inhibitory
g • y = 0 if any x i is inhibitory, else
y = f (g(x )) = 1 if g(x ) ≥ θ
= 0 if g(x) < θ
x1 x2 .. .. x n ∈ { 0, 1} • θ is called the thresholding parameter
• This is called Thresholding Logic
11
• On taking various inputs the function aggregates them and takes decision based on
the aggregation.
• Aggregation simply means sum of these binary inputs. If the aggregated value
exceeds the threshold, the output is 1 else it is 0.

• For more details about McCulloch Pitts model clink on the below link
• https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1

3
Case Study: McCulloh and Pitt’s model
Case Study: McCulloh and Pitt’s model
• Consider an input signal
• X1 : is it raining?
• X2 : is it sunny?
Case Study: McCulloh and Pitt’s model
• The value of both scenario is either 1 or 0
• Assumption: Use the weights for both X1 and X2 as 1
• Keep threshold as 1
• Draw the simple McCulloh and Pitt’s model for this scenario
Case Study: McCulloh and Pitt’s model
Case Study: McCulloh and Pitt’s model
• Write the truth table for this case study
Case Study: McCulloh and Pitt’s model
• Truth Table
Case Study: McCulloh and Pitt’s model
• Write the function for Ysum and Yout
Case Study: McCulloh and Pitt’s model

• Conclusion:
- Situation where Yout is 1, John needs to bring umbrella.
- In scenarios 2,3,4, John has to bring umbrella
Implement AND function using Mc-Culloch
pitts model

12
13
14
15
Implementation of xor function using MP
neuron model

16
17
18
19
20
21
22
23
24
The McCulloch-Pitts model and the perceptron model are both early computational models of artificial neurons, but they have some key differences:
1. Inventors and Time Period:
- McCulloch-Pitts Model: Proposed by Warren McCulloch and Walter Pitts in 1943.
- Perceptron Model: Developed by Frank Rosenblatt in 1957.

2. Architecture:
- McCulloch-Pitts Model: It is a simplified model of a biological neuron, consisting of binary inputs (0 or 1), weighted connections, and a threshold function to produce binary
outputs.
- Perceptron Model: The perceptron is an extension of the McCulloch-Pitts model that incorporates real-valued weights and a summation function followed by a threshold
activation function.

3. Activation Function:
- McCulloch-Pitts Model: Uses a binary threshold activation function. If the weighted sum of inputs exceeds a certain threshold, the neuron fires (output is 1); otherwise, it doesn't
(output is 0).
- Perceptron Model: Also uses a threshold activation function, but it can handle real-valued inputs and weights. The output is 1 if the weighted sum exceeds the threshold, and 0
otherwise.

4. Learning:
- McCulloch-Pitts Model: It does not incorporate a learning mechanism. The weights and thresholds are typically set manually.
- Perceptron Model: Introduced the concept of learning. Rosenblatt's perceptron learning algorithm updates the weights based on errors, allowing it to learn from data.

5. Capabilities:
- McCulloch-Pitts Model: Limited in its ability to learn complex patterns due to fixed, manually set weights.
- Perceptron Model: Can learn linearly separable patterns and is capable of performing binary classification tasks.

25
THANKS

26
MC-CULLOCH AND PITTS MODEL

1
McCulloch-Pitts Neuron Model
• McCulloch (neuroscientist) and Pitts (logician) proposed a highly
simplified computational model of the neuron (1943),
abbreviated as MP Neuron is the fundamental building block of
Artificial Neural Network. This can be mainly used for
y ∈ { 0, 1}
classification problems.
• g aggregates the inputs and the function f takes a decision based on
this aggregation
f • The inputs can be excitatory or inhibitory
g • y = 0 if any x i is inhibitory, else
y = f (g(x )) = 1 if g(x ) ≥ θ
= 0 if g(x) < θ
x1 x2 .. .. x n ∈ { 0, 1} • θ is called the thresholding parameter
• This is called Thresholding Logic
11
• On taking various inputs the function aggregates them and takes decision based on
the aggregation.
• Aggregation simply means sum of these binary inputs. If the aggregated value
exceeds the threshold, the output is 1 else it is 0.

• For more details about McCulloch Pitts model clink on the below link
• https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1

3
Case Study: McCulloh and Pitt’s model
Case Study: McCulloh and Pitt’s model
• Consider an input signal
• X1 : is it raining?
• X2 : is it sunny?
Case Study: McCulloh and Pitt’s model
• The value of both scenario is either 1 or 0
• Assumption: Use the weights for both X1 and X2 as 1
• Keep threshold as 1
• Draw the simple McCulloh and Pitt’s model for this scenario
Case Study: McCulloh and Pitt’s model
Case Study: McCulloh and Pitt’s model
• Write the truth table for this case study
Case Study: McCulloh and Pitt’s model
• Truth Table
Case Study: McCulloh and Pitt’s model
• Write the function for Ysum and Yout
Case Study: McCulloh and Pitt’s model

• Conclusion:
- Situation where Yout is 1, John needs to bring umbrella.
- In scenarios 2,3,4, John has to bring umbrella
Implement AND function using Mc-Culloch
pitts model

12
13
14
15
THANKS

16
ARTIFICIAL NEURON
MODEL
SESSION - 5
CONTENT

• Rosenblatt’s Perceptron Model


• Case study
• Minsky and Papert Model
• Summary
MCCULLOH-PITTS MODEL

• The first computational model of a neuron was proposed by Warren McCulloch and
Walter Pitts in 1943.
• Simplest binary classification can be achieved by the following way
LIMITATIONS

• It cannot process non-Boolean inputs.


• It gives equal weights to each input.
• Threshold ϴ must be chosen manually.
PERCEPTRON MODEL

• Introduced in 1957 by Frank Rosenblatt.


• Core of deep learning concepts.
• A perceptron is a single layer Neural Network.
• A perceptron can simply be seen as a set of inputs, that are weighted and to which we
apply an activation function.
• This produces sort of a weighted sum of inputs, resulting in an output.
• This is typically used for classification problems, but can also be used for regression
problems.
PERCEPTRON MODEL

• We attach to each input a weight ( wi) and notice how we add an input of value 1 with a
weight of −θ. This is called bias.
• The inputs can be seen as neurons and will be called the input layer. Altogether, these
neurons and the function form a perceptron.
• The binary classification function of perceptron network is represented as
PERCEPTRON MODEL

• Schematic Representation
PERCEPTRON MODEL

• Comes under single layer feedforward networks.


• Supervised learning approach - mostly used in classification tasks, but also works for
regression approaches.
• Two types – single layer perceptron, multi layer perceptron
PERCEPTRON MODEL

• The output of perceptron network is


• Yin = x1w1 + x2w2 + x3w3 + …… xnwn
• Yout = f(yin)

1 𝑖𝑓 𝑦𝑖𝑛 > ϴ
• Y = f(yin) = 0 𝑖𝑓 − ϴ ≤ 𝑦𝑖𝑛 < 0
−1 𝑖𝑓 𝑦𝑖𝑛 < ϴ
• F(yin) is the activation function (step function)
PERCEPTRON MODEL

• Compare yout with the target output.


• Weight Updation:
• If y ≠ t, then
• w (new) = w (old) + α.t.x
• Where
• α- learning rate, t – target, x – input
• Similarly bias updation:
• b (new) = b (old) + α.t
TRAINING ALGORITHM

• Step 0: Initialize weights and bias. α=1


• Step 1: Perform steps 2 to 5 until stop condition is reached.
• Step 2: Perform step 3 and 4 for each training pair.
• Step 3: Calculate the output
𝑛

𝑦𝑖𝑛 = 𝑤𝑖𝑥𝑖 + 𝑏
𝑖=1

1 𝑖𝑓 𝑦𝑖𝑛 > ϴ
Y = f(yin) = 0 𝑖𝑓 − ϴ ≤ 𝑦𝑖𝑛 < 0
−1 𝑖𝑓 𝑦𝑖𝑛 < ϴ
TRAINING ALGORITHM

• Step 4: weight and bias adjustment


• If y ≠ t, then
w (new) = w (old) + α.t.x
b (new) = b (old) + α.t
• Else:
w (new) = w (old)
b (new) = b (old)
• Step 5: Train the network until stop condition is reached.(no change in weight for all cases)
CASE STUDY
CASE STUDY

• Implementation of two input AND gate for bipolar input using Rosenblatt’s perceptron
model
MINSKY AND PAPERT
MODEL
INTRODUCTION

• Marvin Minsky and Seymour Papert are two influential figures in the field of artificial
intelligence and neural networks.
• Their book, "Perceptrons: An Introduction to Computational Geometry," published in
1969, is a seminal work that critically analyzed the capabilities and limitations of the
Perceptron model, a simple type of artificial neural network.
• Their analysis highlighted significant challenges in the field and spurred the development
of more complex neural network architectures.
KEY CONCEPTS

17
THE PERCEPTRON MODEL

• A single-layer neural network used for binary classification tasks.


• Composed of input units, weights, a bias term, and an activation function (typically a step
function).
• Functions by computing a weighted sum of inputs and passing the result through the
activation function to produce a binary output.
LINEAR SEPARABILITY

• A fundamental concept in understanding the limitations of the Perceptron.


• Refers to the ability of a model to classify data points using a linear decision boundary.
• Problems that can be separated by a straight line (or hyperplane in higher dimensions)
are linearly separable.
MINSKY AND PAPERT'S ANALYSIS

20
LIMITATIONS OF THE PERCEPTRON

• Inability to Solve Non-Linearly Separable Problems: Minsky and Papert


demonstrated that the Perceptron cannot solve problems where data points are not
linearly separable. The XOR problem is a classic example.
• XOR Problem: The XOR (exclusive OR) problem involves classifying pairs of binary inputs
into two classes. The classes cannot be separated by a single straight line, making it unsolvable
by a single-layer Perceptron.

• Limited Computational Power: They argued that the Perceptron's computational


power is limited and insufficient for complex tasks.
MATHEMATICAL PROOFS

• Minsky and Papert provided mathematical proofs showing the Perceptron's limitations.
They rigorously analyzed the types of problems that could and could not be solved by
the Perceptron.
• Their proofs highlighted that any problem requiring a non-linear decision boundary
cannot be solved by a single-layer Perceptron.
IMPLICATIONS FOR AI RESEARCH

• Criticism and Impact: Their work initially led to a decline in interest and funding for
neural network research during the 1970s, a period often referred to as the "AI Winter."
• Reevaluation and Revival: Despite the initial setback, their criticisms were crucial for
the eventual resurgence of neural networks. Researchers recognized the need for multi-
layer networks, leading to the development of more advanced models like multi-layer
perceptrons (MLPs) and deep neural networks.
ADVANCES FOLLOWING MINSKY AND
PAPERT'S WORK

24
MULTI-LAYER PERCEPTRONS (MLPS):

• Introduction of additional layers (hidden layers) between the input and output layers.
• Use of non-linear activation functions (e.g., sigmoid, tanh, ReLU) in hidden layers.
• Ability to solve non-linearly separable problems and learn complex patterns.

25
BACKPROPAGATION ALGORITHM:

• Developed in the 1980s, backpropagation is a supervised learning algorithm used to train


multi-layer neural networks.
• Involves adjusting weights through gradient descent to minimize the error between
predicted and actual outputs.
• Enabled the training of deep neural networks and addressed the limitations highlighted by
Minsky and Papert.

26
NEURAL NETWORK ARCHITECTURES:

• Emergence of various neural network architectures, such as Convolutional Neural


Networks (CNNs) for image processing and Recurrent Neural Networks (RNNs) for
sequential data.
• Significant improvements in computational power and availability of large datasets
facilitated advancements in neural networks.

27
ACTIVATION FUNCTIONS
Session - 6

28
ACTIVATION FUNCTIONS

• Activation functions are functions used in a neural network to compute the weighted
sum of inputs and biases, which is in turn used to decide whether a neuron can be
activated or not.
• Activation functions play an integral role in neural networks by introducing nonlinearity.
• This nonlinearity allows neural networks to develop complex representations and
functions based on the inputs that would not be possible with a simple linear regression
model.

29
TYPES OF ACTIVATION FUNCTIONS

• Linear Activation Functions


• Sigmoid
• ReLU
• Leaky ReLU
• Tanh
• Step Function

30
LINEAR ACTIVATION FUNCTIONS

• The linear activation function, also known


as "no activation," or "identity function"
(multiplied x1.0), is where the activation is
proportional to the input.
• The function doesn't do anything to the
weighted sum of the input, it simply spits
out the value it was given.

• Mathematically, it can be represented as

31
LIMITATIONS OF LINEAR ACTIVATION FUNCTION

• It’s not possible to use backpropagation as the derivative of the function is a constant and
has no relation to the input x.
• All layers of the neural network will collapse into one if a linear activation function is
used. No matter the number of layers in the neural network, the last layer will still be a
linear function of the first layer. So, essentially, a linear activation function turns the neural
network into just one layer.

32
BINARY STEP FUNCTION

• Binary step function depends on a


threshold value that decides whether a
neuron should be activated or not.
• The input fed to the activation function is
compared to a certain threshold; if the
input is greater than it, then the neuron is
activated, else it is deactivated, meaning
that its output is not passed on to the next
hidden layer.
• Mathematically it can be represented as:

33
LIMITATIONS OF BINARY STEP FUNCTION

• It cannot provide multi-value outputs—for example, it cannot be used for multi-class


classification problems.
• The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.

34
SIGMOID / LOGISTIC ACTIVATION FUNCTION

• This function takes any real value as input


and outputs values in the range of 0 to 1.

• The larger the input (more positive), the


closer the output value will be to 1.0,
whereas the smaller the input (more
negative), the closer the output will be to
0.0, as shown in figure.
• Mathematically, it can be represented as

35
LIMITATIONS OF SIGMOID ACTIVATION FUNCTION

• The derivative of the function is f'(x) = sigmoid(x)*(1-


sigmoid(x)).
• As we can see from the Figure, the gradient values are
only significant for range -3 to 3, and the graph gets
much flatter in other regions.
• It implies that for values greater than 3 or less than -3,
the function will have very small gradients. As the
gradient value approaches zero, the network ceases to
learn and suffers from the Vanishing gradient problem.
• The output of the logistic function is not symmetric
around zero. So the output of all the neurons will be of
the same sign. This makes the training of the neural
network more difficult and unstable.

36
TANH FUNCTION (HYPERBOLIC TANGENT)

• Tanh function is very similar to the


sigmoid/logistic activation function, and
even has the same S-shape with the
difference in output range of -1 to 1. In
Tanh, the larger the input (more positive),
the closer the output value will be to 1.0,
whereas the smaller the input (more
negative), the closer the output will be to -
1.0.
• Mathematically, it can be represented as

37
ADVANTAGE OF TANH FUNCTION

• The output of the tanh activation function is Zero centered; hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
• Usually used in hidden layers of a neural network as its values lie between -1 to 1;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.

38
LIMITATIONS OF TANH FUNCTION

• As you can see— it also faces the problem of


vanishing gradients similar to the sigmoid
activation function. Plus the gradient of the tanh
function is much steeper as compared to the
sigmoid function.
• Note: Although both sigmoid and tanh face
vanishing gradient issue, tanh is zero centered,
and the gradients are not restricted to move in
a certain direction. Therefore, in practice, tanh
nonlinearity is always preferred to sigmoid
nonlinearity.

39
RELU FUNCTION

• ReLU stands for Rectified Linear Unit.


• Although it gives an impression of a linear function,
ReLU has a derivative function and allows for
backpropagation while simultaneously making it
computationally efficient.
• The ReLU function is a simple max(0, x) function,
which can also be thought of as a piecewise
function with all inputs less than 0 mapping to 0
and all inputs greater than or equal to 0 mapping
back to themselves (i.e., identity function).
Graphically,
• Mathematically it can be represented as

40
ADVANTAGES OF RELU FUNCTION

• Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
• ReLU accelerates the convergence of gradient descent towards the global minimum of
the loss function due to its linear, non-saturating property.

41
LIMITATIONS OF RELU FUNCTION

• The Dying ReLU problem,


• The negative side of the graph makes the
gradient value zero. Due to this reason, during
the backpropagation process, the weights and
biases for some neurons are not updated. This
can create dead neurons which never get
activated.
• All the negative input values become zero
immediately, which decreases the model’s ability
to fit or train from the data properly.

42
LEAKY RELU FUNCTION

• Leaky ReLU is an improved version of


ReLU function to solve the Dying ReLU
problem as it has a small positive slope
in the negative area.
• Mathematically it can be represented as:

43
ADVANTAGES OF LEAKY RELU FUNCTION

• The advantages of Leaky ReLU are same as


that of ReLU, in addition to the fact that it
does enable backpropagation, even for
negative input values.

• By making this minor modification for


negative input values, the gradient of the
left side of the graph comes out to be a
non-zero value. Therefore, we would no
longer encounter dead neurons in that
region.

44
LIMITATIONS OF LEAKY RELU FUNCTION

• The predictions may not be consistent for negative input values.


• The gradient for negative values is a small value that makes the learning of model
parameters time-consuming.

45
THANKS

46
THANKS
ANN TEAM

47
HEBBIAN LEARNING
RULE
SESSION - 7
BEGINNINGS OF ARTIFICIAL NEURON

• 1943 Artificial Neuron Model (McCulloch Pitt)

• 1949 Hebb’s Rule of Learning Weights (Hebb)

• 1958 Perceptron (Rosenblatt)

Unsupervised Learning

2
HEBBIAN LEARNING RULE

• Hebbian Learning Rule, also known as Hebb Learning Rule, was proposed by Donald O Hebb
in 1949.
• It is one of the first and also easiest learning rules in the neural network.
• According to Hebb’s rule, the weights are found to increase proportionately to the product of
input and output.
• It means that in a Hebb network if two neurons are interconnected then the weights
associated with these neurons can be increased by changes in the synaptic gap.
• This network is suitable for bipolar data.
• The Hebbian learning rule is generally applied to logic gates.
HEBBIAN LEARNING RULE

• The weights are updated as:

x input
y target output
TRAINING ALGORITHM

• Step -1: Initially the weights are set to zero; set bias to zero.
w=0 for all inputs i=1 to n where n is the total number of neurons; b=0
• Step -2: Let s be the output. The activation function for inputs is generally set as an identity
function.
• Step -3: The activation function for output is also set to y=s.
• Step-4: Weight adjustment and bias are adjusted using the formula:
w(new) = w(old) + x*y
b(new) = b(old) + y

• Step -5: the steps 2 to 4 are repeated for each input vector and output
AND GATE
IMPLEMENTATION
Hebbian Learning Rule

Source: https://www.youtube.com/watch?v=xqMCEFPk2cY
Prof. Preethi J, IIT Bombay

6
HEBBIAN LEARNING RULE:
AND GATE IMPLEMENTATION
HEBBIAN LEARNING RULE : AND GATE
IMPLEMENTATION
HEBBIAN LEARNING RULE : AND GATE
IMPLEMENTATION
HEBBIAN LEARNING RULE: AND GATE w(new) = w(old) + x*y
IMPLEMENTATION b(new) = b(old) + y
HEBBIAN LEARNING RULE: AND GATE
IMPLEMENTATION
HEBBIAN LEARNING RULE: AND GATE
IMPLEMENTATION

Before Learning After Hebbian learning


PROBLEM SOLVING

• Implement a 2 input OR gate using Hebbian learning for bipolar input and draw the Hebb
network for OR gate with updated weights.

13
HEBBIAN LEARNING RULE: OR GATE w(new) = w(old) + x*y
IMPLEMENTATION b(new) = b(old) + y
BEGINNINGS OF ARTIFICIAL NEURON

• 1943 Artificial Neuron Model (McCulloch Pitt)

• 1949 Hebb’s Rule of Learning Weights (Hebb)

• 1958 Perceptron (Rosenblatt)

Supervised Learning

15
PERCEPTRON
LEARNING
SESSION - 8
PERCEPTRON MODEL

• Introduced in 1957 by Frank Rosenblatt.


• Core of deep learning concepts.
• A perceptron is a single layer Neural Network.
• A perceptron can simply be seen as a set of inputs, that are weighted and to which we
apply an activation function.
• This produces a sort of a weighted sum of inputs, resulting in an output.
• This is typically used for classification problems, but can also be used for regression
problems.
PERCEPTRON MODEL

• We attach to each input a weight (wi);


we add an input of value 1 with a weight
of −θ. This is called bias.
• The inputs can be seen as nodes and will
be called the input layer. Altogether,
these nodes and the function form a
perceptron.

18
PERCEPTRON MODEL

• We attach to each input a weight (wi).


We add an input of value 1 with a weight of −θ. This is called bias.
• The inputs can be seen as nodes and will be called the input layer. Altogether, these
nodes and the function form a perceptron.
• The binary classification function of perceptron network is represented as
PERCEPTRON MODEL

• Schematic Representation
PERCEPTRON MODEL

• It is a type of feedforward networks


• Supervised learning approach - mostly used in classification tasks, but also works for
regression approaches.
• Two types:
• single layer perceptron
• multi layer perceptron
SINGLE LAYER PERCEPTRON MODEL

• The output of perceptron network (yout) is a function of the weighted sum of inputs:
• yin = x1w1 + x2w2 + x3w3 + …… xnwn
• yout = f(yin)

1 𝑖𝑓 𝑦𝑖𝑛 > ϴ
• y = f(yin) =ቐ 0 𝑖𝑓 − ϴ ≤ 𝑦𝑖𝑛 < 0
−1 𝑖𝑓 𝑦𝑖𝑛 < ϴ
• f(yin) is the activation function (step function)
PERCEPTRON MODEL: LEARNING

• Compare yout with the target output (t).


• Weight Updation:
• If y ≠ t, then
w (new) = w (old) + α.t.x
where
α- learning rate, t – target, x – input
• Similarly bias updation:
b (new) = b (old) + α.t
TRAINING ALGORITHM

• Step 0: Initialize weights and bias. α=1


• Step 1: Perform steps 2 to 5 until stop condition is reached.
• Step 2: Perform step 3 and 4 for each training pair.
• Step 3: Calculate the output
𝑛

𝑦𝑖𝑛 = ෍ 𝑤𝑖𝑥𝑖 + 𝑏
𝑖=1

1 𝑖𝑓 𝑦𝑖𝑛 > ϴ
y = f(yin) =ቐ 0 𝑖𝑓 − ϴ ≤ 𝑦𝑖𝑛 < 0
−1 𝑖𝑓 𝑦𝑖𝑛 < ϴ
TRAINING ALGORITHM

• Step 4: weight and bias adjustment


If y ≠ t, then
w (new) = w (old) + α.t.x
b (new) = b (old) + α.t
Else:
w (new) = w (old)
b (new) = b (old)
• Step 5: Train the network until stop condition is reached.(no change in weight for all cases)
CASE STUDY
CASE STUDY

• Implementation of two input AND gate for bipolar input using Rosenblatt’s perceptron
model

Source: https://www.youtube.com/watch?v=CvbYumf_wSI
Credit: Mahesh Huddar
Perceptron Learning Rule

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

28
AND function (2 bipolar inputs and output) using Perceptron

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

29
AND function using Perceptron

If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

30
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

31
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

32
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

33
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

34
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

35
KEY SLIDES OF PERCEPTRON TRAINING
WORKED OUT EXAMPLE
ANN TEAM

36
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

37
AND function using Perceptron If y ≠ t, then

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

38
AND function using Perceptron

Before Learning After Perceptron learning

Source: https://www.youtube.com/watch?v=CvbYumf_wSI

39
HEBB’S RULE VS PERCEPTRON LEARNING
RULE
• 1943 Artificial Neuron Model (McCulloch Pitt)

• 1949 Hebb’s Rule of Learning Weights (Hebb)

• 1958 Perceptron (Rosenblatt)

40
THANKS
ANN TEAM

41
DELTA LEARNING
RULE
SESSION 9 & 10
HEBB’S RULE VS
PERCEPTRON LEARNING RULE
• 1943 Artificial Neuron Model (McCulloch Pitt)

• 1949 Hebb’s Rule of Learning Weights (Hebb)

• 1958 Perceptron (Rosenblatt)

2
HEBB’S RULE VS
PERCEPTRON LEARNING RULE
• 1943 Artificial Neuron Model (McCulloch Pitt)

• 1949 Hebb’s Rule of Learning Weights (Hebb)

• 1958 Perceptron (Rosenblatt)

3
POPULAR LEARNING RULES IN ANN
DELTA LEARNING RULE

• developed by Bernard Widrow and Marcian Hoff


• It depends on supervised learning and has a continuous activation function.
• It is also known as the Least Mean Square method.
• it minimizes error over all the training patterns.
• It is based on a gradient descent approach which continues forever.
• It states that the modification in the weight of a node is equal to the product of the error
and the input where the error is the difference between desired and actual output.
LIMITATIONS OF PERCEPTRON LEARNING
DELTA LEARNING RULE
GRADIENT
DESCENT
ALGORITHM

Image source: https://databasecamp.de/en/ml/gradient-descent


DELTA LEARNING RULE
DELTA LEARNING RULE
DELTA LEARNING RULE
DELTA LEARNING RULE
DELTA LEARNING RULE
DELTA LEARNING RULE
GRADIENT DESCENT ALGORITHM
GRADIENT DESCENT ALGORITHM
REFERENCES

• https://www.youtube.com/watch?v=ktGm0WCoQOg
• https://www.youtube.com/watch?v=MUoEv1Hv0KM

17
18

THANKS
ANN TEAM
Department of IR&D

COURSE NAME: ANN


COURSE CODE: 22AIP3204R

Session - 11
2
AIM OF THE SESSION

To model complex functions by propagating input data through multiple layers to produce an output.

INSTRUCTIONAL OBJECTIVES

This Session is designed to: Defining their architecture and function, describing how weights and
biases are adjusted during training, and demonstrating.

LEARNING OUTCOMES

At the end of this session, you should be able to: Design and implement a neural network model to
solve real-world problems.
FEED-FORWARD NEURAL NETWORK (FFNN)

• Feed-forward neural networks (FFNNs) are a type of artificial neural network that can be
used for analyzing pattern association, pattern classification, and pattern mapping. In
these tasks, the network is trained to recognize patterns in input data and map them to a
corresponding output.
• Pattern Classification is the process of assigning input data to one of several pre-defined
categories or classes. For example, a FFNN could be trained to classify images of animals into
categories such as "cat," "dog," or "bird." During training, the network is presented with a set of
input images and their corresponding categories, and it learns to map each image to its correct
category. Once trained, the network can be used to classify new images into these categories.

4
CONT…

The process of training a feed-forward neural network involves the following steps:
1. Initialization: The weights and biases of the network are randomly initialized to small values.
2. Forward Propagation: The input data is fed through the network, and each neuron in each layer
calculates a weighted sum of its inputs, applies an activation function to this sum, and passes the
result on to the next layer.
3. Error Calculation: The difference between the actual output and the desired output is calculated,
and this error is used to adjust the weights and biases in the network.
4. Backward Propagation: The error is propagated backwards through the network, and the weights
and biases of each neuron are adjusted to minimize the error.
5. Repeat: Steps 2-4 are repeated many times until the network reaches a state where the error is
minimized, and the network is able to accurately map inputs to outputs.

5
Feed-Forward Neural Network
Cont…
Cont…
Cont…
Cont…
REFERENCES FOR FURTHER LEARNING OF THE
SESSION

Reference Books:
1. "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
2. "Introduction to Artificial Neural Systems" by Jacek M. Zurada
3. "Artificial Intelligence: A Modern Approach" by Stuart Russell and Peter Norvig.
Sites and Web links:
1. Website: http://www.deeplearningbook.org/
2. Website: https://www.wiley.com/en-
us/Introduction+to+Artificial+Neural+Systems%2C+Second+Edition-p-9780471551616
3. Website: http://aima.cs.berkeley.edu/
THANK YOU

Team – Artificial Neural Network


ARTIFICIAL NEURAL NETWORKS
22AIP3204
CO-3
SESSION-14
OVERVIEW
Perceptron
Multi Layer Perceptron
Back-propagation algorithm

2
Perceptron
IMAGE SOURCE:https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53
AND gate perceptron

b -1

A
1 N OUTPUT
𝒙𝟏 D
1

𝒙𝟐

4
Multi layer Perceptron
Image source: https://www.researchgate.net/figure/A-hypothetical-example-of-Multilayer-Perceptron-Network_fig4_303875065
MLP
•The output y is calculated by:

 m 
y j (n) =  j (v j (n)) =  j   w ji (n) yi (n) 
 i =0 
Where w0(n) is the bias.

•The function j(•) is an activation/transfer function.


Derivatives of some activation functions
• Target for a hidden
unit is not known
Multi layer Perceptron • Target for a hidden unit is
not known
Image source: http://www.cs.us.es/~fsancho/?e=135 • Multiple output units
Back Propagation: Solved example

https://www.youtube.com/watch?v=tUeGI--71q8&list=PL2sEPpvG8-TPUjB10H1AQpcTmumWlpODe
Derivation of Back Propagation Algorithm
by Mahesh Huddar
https://www.youtube.com/watch?v=XN5IRqtFhOY
Backpropagation algorithm
•Assume that a set of examples
• ={x(n),d(n)}, n=1,…,N is given.
x(n) is the input vector of dimension m0 and d(n) is the desired response
vector of dimension M
•Thus an error signal for the output neuron j will be
ej(n)=dj(n)-yj(n)
•We can derive a learning algorithm for an MLP by assuming an
optimisation approach which is based on the steepest descent direction,
I.e. w(n)=-g(n)
Where g(n) is the gradient vector of the cost function and  is the learning
rate.
Backpropagation algorithm
•The algorithm that it is derived from the steepest descent direction is
called back-propagation
•Define a instantaneous cost function as follows:
1
( n) = 
2
e j ( n)
2 jC
Where C is the set of all output neurons.
•If we assume that there are N examples in the set  then the
average squared error is:
N
1
 av =
N
 ( n)
n =1
•We need to calculate the gradient wrt Eav or wrt to
E(n). In the first case we calculate the gradient per
epoch (i.e. in all patterns N) while in the second the
gradient is calculated per pattern.
•In the case of Eav we have the Batch mode of the
algorithm. In the case of E(n) we have the Online or
Stochastic mode of the algorithm.
•Assume that we use the online mode for the rest of
the calculation. The gradient is defined as:
 (n)
g ( n) =
w ji (n)
•Using the chain rule of calculus we can write:
(n) (n) e j (n) y j (n) v j (n)
=
w ji (n) e j (n) y j (n) v j (n) w ji (n)

•We calculate the different partial derivatives as


follows:
(n)
= e j (n)
e j (n)

e j ( n)
= −1
y j ( n )
•And,
y j ( n)
=  j ' (v j ( n))
v j ( n)

v j ( n)
= yi ( n )
w ji ( n)

•Combining all the previous equations we get finally:

(n)
wij (n) = − = e j (n) j ' (v j (n)) yi (n)
w ji (n)
•The equation regarding the weight corrections can be
written as:

w ji (n) =  j (n) yi (n)


Where j(n) is defined as the local gradient and is
given by:

(n) (n) e j (n) y j (n)


 j ( n) = − =− = e j (n) j ' (v j (n))
v j (n) e j (n) y j (n) v j (n)

•We need to distinguish two cases:


•j is an output neuron
•j is a hidden neuron
•Thus the Back-Propagation algorithm is an error-
correction algorithm for supervised learning.

•If j is an output neuron, we have already a definition


of ej(n), so, j(n) is defined (after substitution) as:

 j ( n) = ( d j ( n) − y j ( n)) j ' (v j ( n))


•If j is a hidden neuron then j(n) is defined as:

(n) y j (n) (n)


 j ( n) = − =−  j ' (v j (n))
y j (n) v j (n) y j (n)
•To calculate the partial derivative of E(n) wrt to yj(n) we remember the
definition of E(n) and we change the index for the output neuron to k,
i.e.
1
( n ) =  ek (n)
2

2 kC

•Then we have:
(n) ek (n)
=  ek (n)
y j (n) kC y j (n)
•We use again the chain rule of differentiation to get
the partial derivative of ek(n) wrt yj(n):
(n) e (n) vk (n)
=  ek (n) k
y j (n) kC vk (n) y j (n)

•Remembering the definition of ek(n) we have:

ek (n) = d k (n) − yk (n) = d k (n) − k (vv (n))

•Hence: ek ( n)
= − k ' (vk ( n))
vk ( n)
•The local field vk(n) is defined as:
m
vk (n) =  wkj (n) y j (n)
j =0

Where m is the number of neurons (from the previous


layer) which connect to neuron k. Thus we get:
vk (n)
= wkj (n)
y j (n)
•Hence: ( n)
= −  ek ( n)k ' (vk ( n))wkj ( n)
y j ( n) kC

= −   k ( n)wkj ( n)
kC
•Putting all together we find for the local gradient of a
hidden neuron j the following formula:
 j (n) =  j ' (v j (n))   k (n)wkj (n)
kC

•It is useful to remember the special form of the


derivatives for the logistic and hyperbolic tangent
sigmoids:
•j’(vj(n))=yj(n)[1-yj(n)] (Logistic)
•j’(vj(n))=[1-yj(n)][1+yj(n)] (Hyp. Tangent)
Summary of BP Algorithm

1. Initialisation: Assuming that no prior infromation is


available, pick the synaptic weights and thresholds
from a uniform distribution whose mean is zero
and whose variance is chosen to make the std of
the local fields of the neurons lie at the transition
between the linear and saturated parts of the
sigmoid function
2. Presentation of training examples: Present the
network with an epoch of training examples. For
each example in the set, perform the sequence of
the forward and backward computations described
in points 3 & 4 below.
3. Forward Computation:
• Let the training example in the epoch be
denoted by (x(n),d(n)), where x is the input
vector and d is the desired vector.
• Compute the local fields by proceeding forward
through the network layer by layer. The local
field for neuron j at layer l is defined as:
m
v j (n) =  w ji (n) yi
(l ) (l ) ( l −1)
( n)
i =0

where m is the number of neurons which


connect to j and yi(l-1)(n) is the activation of
neuron i at layer (l-1). Wji(l)(n) is the weight
which connects the neurons j and i.
• For i=0, we have y0(l-1)(n)=+1 and
wj0(l)(n)=bj(l)(n) is the bias of neuron j.
• Assuming a sigmoid function, the output signal
of the neuron j is:
( n) =  j (v j
(l ) (l )
yj (n))

• If j is in the input layer we simply set:


y j ( n) = x j ( n)
( 0)

where xj(n) is the jth component of the input


vector x.
• If j is in the output layer we have:

( n ) = o j ( n)
( L)
yj
where oj(n) is the jth component of the output
vector o. L is the total number of layers in the
network.
• Compute the error signal:
e j ( n) = d j ( n) − o j ( n)

where dj(n) is the desired response for the jth


element.
4. Backward Computation:
• Compute the s of the network defined by:
 e
( L)
( n ) ' ( v
( L)
(n)) for neuron j in output layer L
 j j j
 j (n) =  ' (v (l ) (n))  (l +1) (n) w (l +1) (n) for neuron j in hidden layer l
(l )

 j j k k kj

where j(•) is the derivative of function j wrt


the argument.
• Adjust the weights using the generalised delta
rule:
( l −1)
w ji (n) = w ji (n − 1) +  j (n) yi
(l ) (l ) (l )
( n)
where  is the momentum constant
5. Iteration: Iterate the forward and backward computations of steps
3 & 4 by presenting new epochs of training examples until the
stopping criterion is met.

• The order of presentation of examples should be randomised from


epoch to epoch
• The momentum and the learning rate parameters typically change
(usually decreased) as the number of training iterations increases.
Stopping Criteria

• The BP algorithm is considered to have converged


when the Euclidean norm of the gradient vector
reaches a sufficiently small gradient threshold.
• The BP is considered to have converged when the
absolute value of the change in the average square
error per epoch is sufficiently small
XOR Example

• The XOR problem is defined by the following truth


table:

• The following network solves the problem. The


perceptron could not do this. (We use Sgn func.)
ARTIFICIAL NEURAL NETWORKS

• Session no: 17
• Topic: Radial Basis Function Networks
INTRODUCTION

Source: https://www.cs.cmu.edu/afs/cs/academic/class/15883-f19/slides/rbf.pdf

2
INTRODUCTION

Linear Perceptron

Source:

3
INTRODUCTION

RBFN

4
INTRODUCTION

RBFN

• Radial Basis Function Networks (RBFNs) are artificial neural


networks that use radial basis functions (example: Gaussian
function) as activation functions in the hidden layer.

• They are well-suited for function approximation, classification, and


clustering tasks.

• RBFNs involve center selection and weight calculation for training

5
BASIC ARCHITECTURE OF RBFN: 3 LAYERS

• Input layer
• Source nodes that connect to the network to its
environment

• Hidden layer
• Hidden units provide a set of basis function
• High dimensionality

• Output layer
• Linear combination of hidden functions
Source:

6
Source: https://www.cs.cmu.edu/afs/cs/academic/class/15883-f19/slides/rbf.pdf

7
RBF NETWORK

• The hidden layer provides a non-linear transformation of the input space


to the hidden space, which is assumed usually of high enough dimension.
• The output layer combines in a linear way the activations of the hidden
layer.

Note: Support Vector Machine also does non-linear transformation of the


input space

Source: https://www.cs.cmu.edu/afs/cs/academic/class/15883-f19/slides/rbf.pdf

8
BASIC ARCHITECTURE OF RBFN:
MULTIPLE OUTPUTS FOR CLASSIFICATION

Source:

9
RADIAL BASIS FUNCTION

Source:

10
LEARNING PROCESS

• Training an RBFN involves two main steps:

• center selection and

• weight calculation.

• Center Selection: The centers of the RBF neurons can be determined using
clustering techniques like k-means or through other strategies based on the
dataset.

• Weight Calculation: The weights connecting the hidden RBF layer to the
output layer are typically calculated using linear regression or other techniques
such as the Moore-Penrose pseudo-inverse.

Source:

11
LEARNING RULE FOR RADIAL BASIS NETWORK

G = Gaussian function

Source: https://www.csd.uoc.gr/~hy476/lectures/WK4%20-%20Radial%20Basis%20Function%20Networks.ppt

12
LEARNING RULE FOR RADIAL BASIS NETWORK

We need to find the free parameters wi, ti and -1 so


as to minimise E. Ci is a norm weighting matrix, i.e.:

||x||C2 = (Cx)T(Cx)=x CTC x

We use a weighted norm matrix when the individual


elements of x belong to different classes.
To calculate the update equations, we use gradient
descent on the instantaneous error function E. We
get the following update rules for the free
parameters:

Source: https://www.csd.uoc.gr/~hy476/lectures/WK4%20-%20Radial%20Basis%20Function%20Networks.ppt

13
LEARNING RULE FOR RADIAL BASIS NETWORK

Source: https://www.csd.uoc.gr/~hy476/lectures/WK4%20-%20Radial%20Basis%20Function%20Networks.ppt

14
LEARNING RULE FOR RADIAL BASIS NETWORK

Source: https://www.csd.uoc.gr/~hy476/lectures/WK4%20-%20Radial%20Basis%20Function%20Networks.ppt

15
Illustrative Example - XOR Problem

Source:

16
MLP VS RBFN

Global hyperplane Local receptive field


EBP LMS
Local minima Serious local minima
Smaller number of hidden Larger number of hidden
neurons neurons
Shorter computation time Longer computation time
Longer learning time Shorter learning time
Source:

17
RBFs for Classification

Source:

18
APPLICATIONS OF RBFN

i. Function Approximation:

• RBFNs are often used for function approximation tasks. Given input data and corresponding target
values, the network learns to approximate the underlying function.

• The RBF neurons capture the input-output relationship based on the chosen radial basis functions.

ii. Classification:

• RBFNs can be used for classification tasks by assigning classes based on the output of the network.

• Typically, a softmax function or other appropriate activation function is used in the output layer for
classification.

Other applications include time series prediction and system control.


Source: https://en.wikipedia.org/wiki/Radial_basis_function_network

19
ADVANTAGES AND LIMITATIONS OF RBFN
i. Advantages:

• RBFNs are suitable for approximating complex, nonlinear functions.

• They have the ability to generalize well to unseen data if properly trained.

• RBFNs can be interpretable, as the center locations of RBF neurons often correspond to meaningful features.

ii. Limitations:

• The number of RBF neurons and their spread parameters need to be carefully chosen, which can require
domain knowledge or hyperparameter tuning.

• RBFNs may suffer from overfitting if not regularized properly.

• Training RBFNs can be computationally expensive due to the need for center selection and weight calculation.

Source:

20
SELF ASSESSMENT QUESTIONS

1.What is a key characteristic of Radial Basis Function (RBF) networks compared to traditional feedforward neural
networks?
•A) They have fewer layers.
•B) They use linear activation functions.
•C) They use radial basis functions in the hidden layer.
•D) They require more training data.

2.Which of the following functions is commonly used as a radial basis function in RBF networks?
•A) Sigmoid
•B) ReLU
•C) Gaussian
•D) Softmax

3. Which training algorithm is commonly used for adjusting the parameters of an RBF network?
•A) Gradient Descent
•B) Backpropagation
•C) Evolutionary Algorithms
•D) K-means Clustering

21
TERMINAL QUESTIONS

1. What role do Gaussian radial basis functions play in an RBF network?


2. Describe the typical architecture of an RBF network.
3. How are the parameters of an RBF network trained?
4. What are advantages and limitations of using RBF networks?
5. In what types of applications are RBF networks particularly useful?

22
RESOURCES

Online Resources:
1.Tutorial on RBF Networks by Giuseppe Boccignone: A detailed tutorial covering the theory and practical aspects
of RBF networks, available on arXiv.
2.RBF Networks on Scholarpedia: A comprehensive entry providing an overview of RBF networks, their
mathematical foundations, and applications.
3.Wikipedia: Radial basis function - Wikipedia's entry on RBFs covers their mathematical formulation, applications,
and variations.
4.Stanford CS229 Lecture Notes: Radial Basis Function Networks - Lecture notes from Stanford's CS229 course
covering RBF networks and their applications.

GitHub Repositories and Implementations:


1.Scikit-learn Documentation: Provides examples and documentation on implementing RBF networks using
Python's Scikit-learn library.
2.Keras RBF Layer: Implementations and examples of RBF networks using Keras, a popular deep learning library.

23
ARTIFICIAL NEURAL NETWORKS

• Session no: 18
• Topic: Unsupervised Learning – Hamming network
Types of learning

• Supervised.
• Unsupervised.

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

2
Types of learning

Supervised learning : Unsupervised learning :

• Under control. • No control.


• Adapt the weights. • Learn by itself.
• Pick out the structure from
• Reduce the error. the input.

Labeled training data Example: Hamming Network
Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

3
Hamming Distance
Consider two bipolar vectors (i.e., each co-ordinate can two either of two values):

Source: https://www.youtube.com/watch?v=qR8-Ix_1Z3E

4
Hamming Distance
Consider two bipolar vectors (i.e., each co-ordinate can two either of two values):

= 1 -1 -1 -1 -1 +1 -1 = 2
=2
= No. of Similarities – No. of dissimilarities

Source: https://www.youtube.com/watch?v=qR8-Ix_1Z3E

5
Hamming Network
• Hamming network is used to classify an input vector to one of
the pre-stored vectors/patterns (error detection and correction).
It can also be used for clustering patterns into pre-defined
number of clusters.

• Hamming network has an input layer and an output layer.

• All the nodes in the input layer are connected to all the nodes in
the output layer.

Source: https://www.youtube.com/watch?v=qR8-Ix_1Z3E

6
Hamming Network
• Hamming network is used to classify an input vector to one of
the pre-stored vectors/patterns.
• Hamming network has an input layer and an output layer.
• All the nodes in the input layer are connected to all the nodes in
the output layer.

• The number of nodes in the input layer is the dimensionality (n)


of the vectors. The number of nodes in the output layer is the
number (p) of pre-stored vectors.

• In the network diagram, shown on right, there are


o n=4 input nodes, and
o p=3 output nodes.
o 12 connections with associated 12 weights

• Each node in the output layer stores one vector.


So, the network on the right stores 3 vectors in the output layer.

Source: https://www.youtube.com/watch?v=qR8-Ix_1Z3E

7
Hamming Network
• In the network diagram, shown on right, there are
o n=4 input nodes, and
o p=3 output nodes.
o 12 connections with associated 12 weights
• Each node in the output layer stores one vector.
So, the network on the right stores 3 vectors in the output layer.

• Let {Xi : i = 1, 2, 3, …, p} denote set of binary ‘stored’


vectors (or patterns) of dimensionality ‘n’.

In our case, p=3 and n=4.


So, three 4-dimensional vectors are stored in the 3 nodes of the
output layer.

Also, the co-ordinates can take either +1 or -1 value.

Source: https://www.youtube.com/watch?v=qR8-Ix_1Z3E

8
Hamming Network
In our case, p=3 and n=4. Let us denote the 3 stored vectors as

X1 = { x11 , x12, x13 , x14 }


X2 = { x21 , x22, x23 , x24 }
X3 = { x31 , x32, x33 , x34 }

In case of Hamming network, we want to set the weights


such that the network calculates the (negative) Hamming
distance between an input vector and the patterns stored in
the output nodes.

This will permit us to associate an input vector to that


pattern in the output layer which is most similar to the input
vector (=== classification / restoration of noisy patterns)
Source: https://www.youtube.com/watch?v=qR8-Ix_1Z3E

9
Hamming Network
Let X = (x1, x2, x3, x4) denote a 4-dimensional input vector.
Earlier, we had

10
Hamming Network
Let X = (x1, x2, x3, x4) denote a 4-dimensional input vector. Then, we have

This is in the form of O = W X + Ө

11
Hamming Network
Let X = (x1, x2, x3, x4) denote a 4-dimensional input vector.

Let wij denote the weight of the connection


to “i”th node in the output layer
from “j”th node in the input layer.

Then, assign wij = xij / 2, and bias = n/2

Now, this Hamming network computes (-ve) Hamming distance between input (X) and stored patterns.
This will permit us to associate an input vector to that pattern in the output layer which most similar to
the input vector (classification) or restore noisy patterns.

12
Hamming algorithm
• Step (1) : Specify the examples.
• Step (2) : Fix the weights matrix :

• Step (3) : Find Ѳ as Ѳ = n / 2.


• Step (4) : Specify the input vector :

• Step (5) : Find the output as :

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

13
Hamming net: solved problem
The Hamming net has three examples :
e(1) = [1 1 -1 -1], e(2) = [-1 -1 -1 1], e(3) = [1 -1 -1 1]

We are given the following vector, and we have to classify it to the closest example :
V = [1 -1 1 1]

The weight matrix = , Ѳ = [2 2 2]

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

14
Hamming net: solved problem
We consider here the Hamming net has three examples :
e(1) = [1 1 -1 -1], e(2) = [-1 -1 -1 1], e(3) = [1 -1 -1 1]
We are given the following vector and we have to classify it to the closest example :
V = [1 -1 1 1]

The weight matrix = , Ѳ = [2 2 2]

The output vector of this Hamming net is :


Y =V * +Ѳ
= [-1 0 1] + [2 2 2]
= [1 2 3],
We see that e(3) has the highest negative Hamming distance (=== highest similarity) with V.
So, V is classified to e(3)
Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

15
Hamming Net: Digit Recognition Task
Suppose the task is to recognize 1 of 8 signs recorded in the network “memory”. The picture
below shows 8 used signs:

The matrices that represent these signs are two-dimensional (3x5 pixels), but we write them as a
single row (15 positions) vectors (of 15 dimensions). Black squares are ones, white squares are
zeros.

Now, we can follow the Hamming net algorithm, shown in the previous slide, to classify any digit
to one of these 8 classes.

Source: https://home.agh.edu.pl/~vlsi/AI/hamming_en/

16
SELF ASSESSMENT QUESTIONS

1. What type of learning does a Hamming network typically employ?


A) Supervised learning
B) Reinforcement learning
C) Unsupervised learning
D) Semi-supervised learning
Answer: C) Unsupervised learning

2. What is the primary objective of a Hamming network in unsupervised learning?


A) Forgetting
B) Regression
C) Clustering
D) Feature extraction
Answer: C) Clustering

17
TERMINAL QUESTIONS

1. What is the primary objective of the Hamming network?


2. Describe the architecture of a Hamming network.
3. What is the relationship between the number of nodes and the number of patterns and the
dimensionality of patterns?
4. What are some applications of Hamming networks in unsupervised learning?

18
References

• Ahmed Hashmi, Chemoy Das. (2012). Neural Networks and its Application. www.slideshare.net

• K Ming Leung. (2007). Fixed Weight Competitive Nets : Hamming Net. Polytechnic University.

• Nilmani Singh. (2010). Neural Network. www.slideshare.net

• Youtube video: https://www.youtube.com/watch?v=czlaus1nGaU

Source:

19
ARTIFICIAL NEURAL NETWORKS

• Session no: 19
• Topic: Maxnet Architecture
RECAP: Types of learning

• Supervised.
• Unsupervised.

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

2
RECAP: Hamming Network
Let X = (x1, x2, x3, x4) denote a 4-dimensional input vector. Then, we have

This is in the form of O = W X + Ө

3
RECAP: Hamming Network
Let X = (x1, x2, x3, x4) denote a 4-dimensional input vector.

Let wij denote the weight of the connection


to “i”th node in the output layer
from “j”th node in the input layer.

Then, assign wij = xij / 2, and bias = n/2

Now, this Hamming network computes (-ve) Hamming distance between


input (X) and stored patterns.

This permitted us to associate an input vector to one pattern in the output


layer which most similar to the input vector (classification) or restore noisy
patterns.

4
MAXNET
• MAXNET is a Hamming
• One-layer Two-layer
• Recurrent Feed-forward
• Competitive network
• It conducts a competition to determine which node
has the highest initial value.
• It depends on Winner-Take-All (WTA) policy
(Only one nonzero output).

Source: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

5
MAXNET
It is a subnet with ‘n’ nodes, which are all
completely interconnected.
In other words, from one node, there will
connections to all ‘n’ nodes, including itself.

So, all n2 values of the weight matrix could be non-


zero.

In the network, shown on right, n=5.


So, there are 5*5 = 25 weight values.

Source: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

6
MAXNET
• Every node in MAXNET receives inhibitory
inputs from all other nodes via ‘lateral’ [intra-
layer] connections.
These connections will have negative weights (-ε).

• There are node self-excitation weights from a


node to itself. These weights (θ) are positive.

• Self excitation weights, θ ~= 1


• Mutual inhibition weights, ε ≤ 1/n
In the network (on right), there will be 20 negative weights and 5 positive weights.
Source: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

7
MAXNET
• Self excitation weights, θ ~= 1
• Mutual inhibition weights, ε ≤ 1/n

• Such weights are chosen so that a single node


whose value is initially maximum, can eventually
prevail as the only active or “winner” node while
the activation of all other nodes subsides to zero.
• Activation function used is Rectified Linear Unit
function,
f(net) = max(0, net)

Source: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

8
Hamming-Maxnet network

Input Output

Hamming Neural Network Maxnet

Maxnet is used as an upper subnet of a Hamming-Maxnet network.


Source: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

9
Hamming-Maxnet network

Inp Out
ut put

Hamming Neural Maxnet


Network

• Maxnet is used as an upper subnet of a Hamming-Maxnet network.


• Hamming net determines, via Hamming distance, which stored pattern is closest to the input vector.
• The output of the Hamming subnet is fed as input to the Maxnet.
• A few iterations (of value updating) happens in the recurrent Maxnet.
• Finally, the node with initial maximum value will have value=1, and the values of all other nodes
becomes nearly zero.
• The RELU activation function outputs the node label of the winner.
Source: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

10
Hamming-Maxnet network: solved problem

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

11
Hamming-Maxnet network: TB test

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

12
Hamming-Maxnet network: TB test

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

13
Hamming-Maxnet network: TB test

Inp Out
ut put

Hamming Neural Maxnet


Network

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

14
Hamming-Maxnet network: TB test

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

15
Hamming-Maxnet network: TB test

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

16
Hamming-Maxnet network: TB test

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

17
Hamming-Maxnet network: TB test

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

18
Hamming-Maxnet network:
another solved problem

Source: https://www.youtube.com/watch?v=n-J_mpgPf8k

19
Hamming-Maxnet network: conclusions

• High speed in classification of random inputs.


• Only one node which has the short Hamming distance from the
example will be the nonzero output at the end (The winner).
• Hamming-MAXNET network always give a closed output to the
examples.

Source: https://cs.wmich.edu/~elise/courses/cs6800/DC.pptx

20
SELF ASSESSMENT QUESTIONS

1. Which of the following best describes the learning rule used in Maxnet architecture?
A) Hebbian learning
B) Competitive learning
C) Backpropagation
D) None of the above
Answer: B) Competitive learning

2. What role does the winner-takes-all mechanism play in Maxnet architecture?


A) All neurons update their weights simultaneously
B) The neuron with the highest activation wins and inhibits others
C) Neurons share activation equally
D) None of the above
Answer: B) The neuron with the highest activation wins and inhibits others

21
TERMINAL QUESTIONS

1. What is Maxnet architecture primarily used for?


2. What role does the winner-takes-all mechanism play in Maxnet architecture?

3. What does the term "competitive learning" refer to in Maxnet architecture?

22
References

• Ahmed Hashmi, Chemoy Das. (2012). Neural Networks and its Application. www.slideshare.net

• K Ming Leung. (2007). Fixed Weight Competitive Nets : Hamming Net. Polytechnic University.

• Karam Hatim, Mohammed Hamdi. (2009). Detection of Tuberculosis by using Artificial Neural Networks.
University of Mosul

• Nilmani Singh. (2010). Neural Network. www.slideshare.net

• Youtube video: https://www.youtube.com/watch?v=T-b0HG9dEpM&t=4s

• Youtube video: https://www.youtube.com/watch?v=n-J_mpgPf8k&t=3s

Source:

23
ARTIFICIAL NEURAL NETWORKS
22AIP3204
CO-3
SESSION-15
https://www.nobelprize.org/prizes/physics/2024/popular-information/
Associative memory
Imagine that you are trying to remember a fairly unusual word that you rarely use, such
as one for that sloping floor often found in cinemas and lecture halls. You search your
memory. It’s something like ramp… perhaps rad…ial? No, not that. Rake, that’s it!

This process of searching through similar words to find the right one is reminiscent of
the associative memory that the physicist John Hopfield discovered in 1982.
The Hopfield network can store patterns and has a method for recreating them. When
the network is given an incomplete or slightly distorted pattern, the method can find
the stored pattern that is most similar.
https://www.nobelprize.org/prizes/physics/2024/popular-information/
The Hopfield network can be used to recreate data that contains noise or which has been partially erased

https://www.nobelprize.org/prizes/physics/2024/popular-information/
Hopfield Network
OVERVIEW

• Introduction

• Analysis of Linear Auto-Associative Networks

• Analysis Of Pattern Storage Networks

• The Hopfield Model

• Summary
Feedback Neural Networks - Introduction
➢Presents a detailed analysis of the pattern recognition tasks which can
be performed by feedback neural networks (FNN)

➢Most general form → For a set of neuron units, output of each unit is
fed as input to all other units and itself

➢General feedback network → absence of structure

➢By appropriate choice of parameters several pattern recognition tasks


become possible

➢Autoassociation task → linear processing units (neurons)


Feedback Neural Networks – Introduction-Contd …

➢Linear autoassociative network → produces input as output

➢If input is noisy in a linear autoassociative network → produces


noisy output in recall – even with optimal weight settings

➢No practical use of a linear autoassociative network

➢By altering the processing units to become non-linear, the


network can be used for pattern-storage
Analysis of Linear Auto-
Associative Networks

𝑰𝒏𝒑𝒖𝒕 𝑽𝒆𝒄𝒕𝒐𝒓: 𝑎𝑙 = [𝑎𝑙1 , 𝑎𝑙2 , … , 𝑎𝑙𝑀 ]𝑇


𝑶𝒖𝒕𝒑𝒖𝒕 𝒗𝒆𝒄𝒕𝒐𝒓: 𝑏𝑙 = [𝑏𝑙1 , 𝑏𝑙2 , … , 𝑏𝑙𝑁 ]𝑇
𝑨𝒄𝒕. 𝒐𝒇 𝒊𝒏𝒑 𝒍𝒂𝒚𝒆𝒓: 𝑥 = [𝑥1 , 𝑥2 , … , 𝑥𝑀 ]𝑇
𝑨𝒄𝒕. 𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓: 𝑦 = [𝑦1 , 𝑦2 , … , 𝑦𝑁 ]𝑇

Figure shows Linear autoassociative 𝑰𝒏𝒑𝒖𝒕 𝒎𝒂𝒕𝒓𝒊𝒙: 𝐴 = [𝑎1 , 𝑎2 , … , 𝑎𝐿 ] 𝑀𝑥𝐿 𝑚𝑎𝑡𝑟𝑖𝑥)


feedforward network 𝑶𝒖𝒕𝒑𝒖𝒕 𝒎𝒂𝒕𝒓𝒊𝒙: 𝐵 = [𝑏1 , 𝑏2 , … , 𝑏𝐿 ] 𝑁𝑥𝐿 𝑚𝑎𝑡𝑟𝑖𝑥)
𝑾𝒆𝒊𝒈𝒉𝒕 𝒎𝒂𝒕𝒓𝒊𝒙: 𝑊 = [𝒘1 , 𝒘𝟐 , … , 𝒘𝑵 ] 𝑇 𝑁𝑥𝑀 𝑚𝑎𝑡.

𝒘 is a vector
𝑾𝒆𝒊𝒈𝒉𝒕 𝒗𝒆𝒄𝒕𝒐𝒓 𝒇𝒐𝒓 𝒋𝒕𝒉 𝒖𝒏𝒊𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒐𝒖𝒕𝒑𝒖𝒕 𝒍𝒂𝒚𝒆𝒓: 𝒘𝑗 = [𝑤𝑗1 , 𝑤𝑗2 , … , 𝑤𝑗𝑀 ]𝑇
Analysis of Linear Auto-Associative Networks
➢ Objective → Associate a given pattern with itself during
training

➢ Recall the associated pattern when a noisy input pattern


is presented to it

➢ The associated output pattern 𝑏𝑙 is the same as the input


pattern 𝑎𝑙 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑙𝑡ℎ 𝑝𝑎𝑡𝑡𝑒𝑟𝑛

➢ So in autoassociation → 𝑏𝑙 = 𝑎𝑙 , 𝑙 = 1,2,3, … 𝐿

➢ Recall → Output desired is 𝑏𝑙 for an approximate input of


𝑎𝑙 + ∈ where ∈ is the noise

➢ Zero error in recall due to linear dependency in input


patterns (input comes out as output)
Analysis of Linear Auto-Associative Networks

➢ When noise is added to the input →


𝒄𝑙 = 𝒂𝑙 + ∈, for l = 1,2, … L (noise vector
∈ is uncorrelated to input vector 𝒂𝑙 )

Let 𝜎 2 → average power (variance) of the noise


vector ∈

Then the error term in recall is non-zero and


mainly due to the noise

➢ Difference from the true output is only due to


the noise
Analysis of Linear Auto-Associative Networks
➢ The linear auto-association task is can be realized by a single
layer feedback layer using linear processing units

➢ Condition for autoassociation;


➢ 𝑾𝒂𝒍 = 𝒂𝒍 𝑖𝑠 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑 𝑖𝑓 𝑾 = 𝑰 (𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦 𝑚𝑎𝑡𝑟𝑖𝑥)

➢ Such a choice of W is realized if input vectors are linearly


independent so that 𝑾 = 𝑨𝑨−𝟏 = 𝑰

➢ For this choice of W output of a noisy input 𝒂𝑙 + ∈ is given by:


𝑾(𝒂𝑙 + ∈) = 𝒂𝑙 + ∈ → which is the noisy input itself

➢ Lack of accretive behavior during recall → Therefore not useful for storing information

➢ Linear units replaced with non-linear units makes pattern storage possible for these
networks
Analysis Of Pattern Storage Networks

➢ Objective is to store a given set of patterns so that any of them can be recalled
exactly when an approximate pattern is presented for input to the network

➢ Pattern recall should happen despite disturbances to features and their spatial
relations occurring due to:
→ Noise and distortion or
→ Natural variation of the pattern generation process

➢ Outputs of the processing units (neurons) at any instant of time define the
output state of the network at that instant

➢ The state of the network at successive time instants is determined by the


activation dynamics model used for the network
Analysis Of Pattern Storage Networks

➢ Recall process of a stored pattern:


→ Starting at an initial state of the network depending on the input
→ Applying the activation dynamics (AD)
→ Keep applying the AD until equilibrium state is reached

→ Final equilibrium state corresponds to the stored pattern that is generated by


the network as output corresponding to the given input

→ Each output state has an associated energy which depends on network


parameters like weights, bias and the state of the network
Energy
• Weights
→ Energy Landscape Output • Bias
unit • State of the
network
The Hopfield network can be used to recreate data that contains noise or which has been partially erased

https://www.nobelprize.org/prizes/physics/2024/popular-information/
Analysis Of Pattern Storage Networks
➢ Feedback in the units and non-linear processing of the units → creates basins of
attraction

➢ Small deviations from the stable state can be measured using Hamming distance

➢ Fixed point of equilibrium → periodic or chaotic

➢ They are exploited for pattern storage

Figure alongside showing

a) Energy landscape with


basins of attraction
Deviation from stable state
b) Energy landscape Without
basins of attraction
Analysis Of Pattern Storage Networks
➢ Not normally possible to accurately determine the number of basins of attraction, their
relative spacings and depths in the state space of given network

➢ No of patterns that can be stored → capacity of the network

➢ Order of N → for a fully connected network [N → number of processing units]

➢ 2𝑁 different states of a network with 𝑁 binary units, but only 𝑁 energy minima – so can
store only 𝑁 binary patterns

Figure alongside showing

a) Energy landscape with basins of attraction

b) Energy landscape Without basins of attraction


Analysis Of Pattern Storage Networks
➢ 2𝑁 different states of a network with 𝑁 binary units, but only 𝑁 energy minima – so can
store only 𝑁 binary patterns

➢ If number patterns > number of basins of attraction → hard storage problem – patterns
cannot be stored in the given network

➢ If number patterns < number of basins of attraction → false wells or minima → state of
the network may settle into a false well → error in recall

Figure alongside showing

a) Energy landscape with basins of


attraction

b) Energy landscape Without basins of


attraction
The Hopfield Model
➢ Hopfield model of a feedback network for pattern storage

➢ Fully connected feedback network with symmetric weights

➢ Two types
→ Continuous
→ Discrete

→ Continuous model :
• State update determined by activation dynamics
• Units have continuous non-linear output functions

→ Discrete model :
• State update is asynchronous
• Units have binary/bipolar output functions
Hopfield Network
The Hopfield Model
CONSIDER THE MC-CULLOCH-PITTS MODEL
The Hopfield Model MC-CULLOCH-PITTS MODEL
The Hopfield Model
• A Hopfield net is composed of binary threshold units with
recurrent connections between them

• Recurrent networks with non-linear units are generally


extremely hard to analyze

• They can behave in many different ways


→ Settle to a stable state
→ Oscillate
→ Follow chaotic trajectories that cannot be predicted too far
into the future

→ JOHN HOPFIELD realized that if connections are symmetric,


there is a global energy function
The Hopfield Model

• JOHN HOPFIELD realized that if connections are


symmetric, there is a global energy function

• → Each binary “configuration” of the entire


network has an energy

• → The binary threshold decision rule causes the


network to settle to a minimum of this energy
function
Summary

Alteration of the units from linear to non-linear enables


pattern storage – memory

The hopfield model is an energy based model to collect


and retrieve memory just like the human brain

It is a single-layered recurrent network in which the


neurons are entirely connected
ARTIFICIAL NEURAL NETWORKS
22AIP3204
CO-3
SESSION-16
Overview

• Introduction
• The Hopfield Model
• The Energy Function
• Storing memories in a Hopfield Net
• Types of Hopfield Nets (Discrete, Cont)
• Storage Capacity of a Hopfield Net
• Solved Problem Example
• Summary
Introduction
• In 1982, John Hopfield introduced an artificial neural network to collect and
retrieve memory like the human brain. Here, a neuron is either on or off the
situation.

• The state of a neuron(on +1 or off 0) will be restored, relying on the input it


receives from the other neuron.

• A Hopfield network is at first prepared to store various patterns or memories.

• Afterward, it is ready to recognize any of the learned patterns by uncovering


partial or even some corrupted data about that pattern, i.e., it eventually settles
down and restores the closest pattern.

• Thus, similar to the human brain, the Hopfield model has stability in pattern
recognition.
The Hopfield Model

• JOHN HOPFIELD realized that if connections are


symmetric, there is a global energy function

• → Each binary “configuration” of the entire


network has an energy

• → The binary threshold decision rule causes the


network to settle to a minimum of this energy
function
The Hopfield network can be used to recreate data that contains noise or which has been partially erased

https://www.nobelprize.org/prizes/physics/2024/popular-information/
Energy Analysis of the Hopfield Network
• Discrete Hopfield Model
→ Associated with each state of the network, Hopfield proposed
an energy function whose values always either reduces or
remains the same with the change in the state of the network

Where
is the threshold value of the unit ‘i’
is the state of the ‘i’ the unit
is the state of the ‘j’ the unit
are the symmetric weights
is the total energy of the network
Energy Analysis of the Hopfield Network
• Discrete Hopfield Network
→ The energy profile (landscape) of the network is determined only by the
network architecture – no of units, output functions, threshold values,
connections between units and strength of the connections

→ Hopfield discovered that for symmetric weights (wij = wji)


with no self-feedback, (wii = 0) – and with binary {0,1} or bipolar {1,-1}
output functions :
1. The dynamics of the network with asynchronous update always leads
towards energy minima at equilibrium
2. The states corresponding to these energy minima are stable states
3. Small perturbations around these stable states lead to unstable states
and the dynamics of the network takes it back to a stable state again
4. The existence of these stable states enables the storage of patterns
(input data)
Energy Analysis of the Hopfield Network
• Continuous Model:
→ Fully connected feedback network with a continuous non-linear output
function in each unit

→ Output function is typically a sigmoid function


𝟏 − 𝒆−𝝀𝒙
𝒇 𝝀𝒙 =
𝟏 + 𝒆−𝝀𝒙
Figure (a):
Sigmoid function for different
values of the gain parameter

Figure (b):
The inverse function
The storage capacity of the Hopfield Net

➢ Using Hopfield’s storage rule, the capacity of a totally connected


net with N units is only about 0.15N memories

➢At N bits per memory, this is:

→ 0.15 N memories = 0.15N * (N bits per memory) = 0.15N 2


→ As can be seen, this does not make efficient use of the bits required
to store the weights
Noisy networks discover better energy minima
• A Hopfield Net tends to always make decisions that reduces the
energy
• → It creates a situation from which it is impossible to escape from local minima

• Random noise can be introduced to escape from poor minima


• → Start with a lot of noise making it simpler to cross energy barriers
• → Gradually reduce noise so that the system ends up in a deep energy
minimum – known as Simulated Annealing (Kirkpatrick et al. 1981)
Variation of Transition Probabilities with Temperature
Problem Example
➢ Problem:

Consider a 3 unit feedback network with symmetric weights as shown in


the figure on this slide. The units have a threshold value of 𝜃𝑖 , 𝑖 =
1,2,3 and a binary [0,1] output function. A binary function {0,1} is assumed
for convenience although the bipolar case {-1, +1} I also valid. Analyze the
energies and state transitions based on your computations.

Home Task - Repeat the same problem for


The bipolar output function {-1,1}
Problem Example – Continued …
➢ Solution: Each state of the network is given by : s = [𝑠1 , 𝑠2 , 𝑠3 ]
The energy at any single state s1, s2, s3 of the network is given by:

1 There are a total of 8


𝑣 𝑠1 , 𝑠2 , 𝑠3 = − ා ෍ 𝑤𝑖𝑗 𝑠𝑖 𝑠𝑗 + ෍ 𝑠𝑖 𝜃𝑖 different (23) states for
2 the 3 unit network – for
𝑗 𝑖 each state is 0 or 1
BASE-10
VALUE 𝒔𝟏 𝒔𝟐 𝒔𝟑 𝑖
0 0 0 0 → (2N) for N different
1 0 0 1 units of the network
2 0 1 0
𝐴𝑠𝑠𝑢𝑚𝑒 𝑤𝑖𝑗 = 𝑤𝑗𝑖 →
3 0 1 1 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑎𝑟𝑒 𝑠𝑦𝑚𝑚𝑒𝑡𝑟𝑖𝑐
4 1 0 0 𝑤31 = 𝑤13 = 0.5 𝜃1 = −0.1
5 1 0 1 𝑤12 = 𝑤21 = −0.5 𝜃2 = −0.2
6 1 1 0 𝑤23 = 𝑤32 = 0.4 𝜃3 = 0.7
7 1 1 1
➢ Solution: Each state of the network is given by : s = [𝑠1 , 𝑠2 , 𝑠3 ]
The energy at any single state s1, s2, s3 of the network is given by:
𝑣 𝑠1 , 𝑠2 , 𝑠3
𝑤11 𝑠1 𝑠1 + 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 +
1
= − ා ෍ 𝑤𝑖𝑗 𝑠𝑖 𝑠𝑗 + ෎ 𝑠𝑖 𝜃𝑖 = −0.5 𝑤21 𝑠2 𝑠1 + 𝑤22 𝑠2 𝑠2 + 𝑤23 𝑠2 𝑠3 + + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 ]
2 𝑤31 𝑠3 𝑠1 + 𝑤32 𝑠3 𝑠2 + 𝑤33 𝑠3 𝑠3
𝑗
𝑖
𝑖
𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 +
= −0.5 𝑤21 𝑠2 𝑠1 + 𝑤23 𝑠2 𝑠3 + + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 )
𝑤31 𝑠3 𝑠1 + 𝑤32 𝑠3 𝑠2
= −0.5(2) 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 )
𝑺𝒐 𝒕𝒉𝒆 𝒂𝒃𝒐𝒗𝒆 𝒆𝒏𝒆𝒓𝒈𝒚 𝒆𝒒𝒖𝒂𝒕𝒊𝒐𝒏 𝒓𝒆𝒅𝒖𝒄𝒆𝒔 𝒕𝒐 𝒕𝒉𝒊𝒔 𝒂𝒔 𝒔𝒉𝒐𝒘𝒏:
𝑣 𝑠1 , 𝑠2 , 𝑠3 = −(1) 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 )
𝜃1 = −0.1
𝜃2 = −0.2
𝜃3 = 0.7
So we get the following values for each of the 8 energy states:
So 𝑣 𝑠1 = 0, 𝑠2 = 0, 𝑠3 = 0 = 𝒗 𝟎, 𝟎, 𝟎 =
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + 𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 = 0
So 𝑣 𝑠1 = 0, 𝑠2 = 0, 𝑠3 = 1 = 𝒗 𝟎, 𝟎, 𝟏 =
−(1) 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 ]
= − 𝑤12 (0) + 𝑤13 (0) + 𝑤23 (0) + (0)𝜃1 + (0). 𝜃2 + (1)𝜃3
= 0.7
𝜃1 = −0.1
So 𝑣 𝑠1 = 0, 𝑠2 = 1, 𝑠3 = 0 = 𝒗 𝟎, 𝟏, 𝟎 = 𝜃2 = −0.2
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 ] 𝜃3 = 0.7
= − 𝑤12 (0)𝑠2 + 𝑤13 (0)𝑠3 + 𝑤23 𝑠2 (0) + [ 0 𝜃1 + 𝑠2 𝜃2 + (0)𝜃3 ]
= 𝑠2 𝜃2 = 𝜃2 = −0.2 = −0.2

So 𝑣 𝑠1 = 0, 𝑠2 = 1, 𝑠3 = 1 = 𝒗 𝟎, 𝟏, 𝟏 =
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + [𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3 ]
= − 𝑤12 (0)𝑠2 + 𝑤13 (0)𝑠3 + (0.4) + [𝜃2 + 𝜃3 ]
= −0.4 + 𝜃2 + 𝜃3 = −0.4 + −0.2 + 0.7 = 0.1
Problem Example – Continued …

So 𝑣 𝑠1 = 1, 𝑠2 = 0, 𝑠3 = 0 = 𝒗 𝟏, 𝟎, 𝟎 =
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + 𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3
= 0 + 𝑠1 𝜃1 = −0.1
𝜃1 = −0.1
𝜃2 = −0.2
So 𝑣 𝑠1 = 1, 𝑠2 = 0, 𝑠3 = 1 = 𝒗 𝟏, 𝟎, 𝟏 = 𝜃3 = 0.7
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3
+ −0.5 𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3
= −𝑤13 𝑠1 𝑠3 + 𝑠1 𝜃1 + 𝑠3 𝜃3
= −0.5 1 1 + −0.1 + 0.7 = −0.5 + 0.6
= 0.1
Problem Example – Continued … 𝜃1 = −0.1
𝜃2 = −0.2
𝜃3 = 0.7
So 𝑣 𝑠1 = 1, 𝑠2 = 1, 𝑠3 = 0 = 𝒗 𝟏, 𝟏, 𝟎 =
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + 𝑠1 𝜃1 + 𝑠2 𝜃2 + 𝑠3 𝜃3
= − 𝑤12 𝑠1 𝑠2 + 𝑠1 𝜃1 + 𝑠2 𝜃2 = − −0.5 + [ −0.1 + −0.2
= 0.5 − 0.3 = 0.2

So 𝑣 𝑠1 = 1, 𝑠2 = 1, 𝑠3 = 1 = 𝒗 𝟏, 𝟏, 𝟏 =
− 1 𝑤12 𝑠1 𝑠2 + 𝑤13 𝑠1 𝑠3 + 𝑤23 𝑠2 𝑠3 + [𝑠1 𝜃1 + 𝑠2 𝜃2 +
𝑠3 𝜃3 ] = − 1 𝑤12 + 𝑤13 + 𝑤23 + 𝜃1 + 𝜃2 +𝜃3 =
− [ −0.5 + 0.5 + (0.4)] + [ −0.1 + −0.2 + 0.7 =
− 0.4 + 0.4 = 0
Problem Example – Continued …

BASE-
10 𝒔𝟏 𝒔𝟐 𝒔𝟑 𝑣 𝑠1 , 𝑠2 , 𝑠3
VALUE

0 0 0 0 0
1 0 0 1 0.7
2 0 1 0 - 0.2
3 0 1 1 0.1
4 1 0 0 - 0.1
5 1 0 1 0.1
6 1 1 0 0.2
7 1 1 1 0
Problem Example – Continued …
BASE-
10 𝒔𝟏 𝒔𝟐 𝒔𝟑 𝑣 𝑠1, 𝑠2, 𝑠3
VALUE

0 0 0 0 0
1 0 0 1 0.7
2 0 1 0 - 0.2
3 0 1 1 0.1
4 1 0 0 - 0.1
5 1 0 1 0.1
6 1 1 0 0.2
7 1 1 1 0
The Hopfield Net - Conclusions and Summary

➢ To make a network useful for pattern storage, the output


functions of the units are made non-linear

➢For analysis in terms of as well as for recall of information from


the stable states, we have imposed Hopfield conditions of
symmetry of weights and asynchronous update
• https://www.nobelprize.org/prizes/physics/2024/popular-information/
END
HOPFIELD NETWORKS
AIM OF THE SESSION
To familiarize students with the concepts of Hopfield networks,
To make students apply Hopfield networks on a real-world problem

INSTRUCTIONAL OBJECTIVES

This unit is designed to:


1. Demonstrate Hopfield networks and its concepts
2. Describe the nature and features of Hopfield networks
3. List out the techniques of training for Hopfield networks
4. Demonstrate the process applying Hopfield networks

LEARNING OUTCOMES

At the end of this unit, you should be able to:


1. Define the functions of Hopfield networks
2. Summarize the techniques used for building Hopfield networks
3. Describe ways to build the Hopfield networks
INTRODUCTION

• Hopfield -starting point for the implementation of associative


(content addressable) memory by using a special structure of
recurrent neural networks.
• The associative memory - able to recognize newly presented
(noisy or incomplete) patterns using an already stored “complete”
version of that pattern.
• new pattern is “attracted” to the stable pattern already stored in
the network memories.
ASSOCIATIVE MEMORY

• Associative memory, or content-addressable memory, is a system


in which a memory recall is initiated by the associability of an
input pattern to a memorized one. It allows for the retrieval and
completion of a memory using only an incomplete or noisy
portion of it
INTRODUCTION HOPFIELD

• flat graph, where nodes represent magnetic dipole moments of regular


repeating arrangements. Each node can occupy one of two states (i.e. spin up
or spin down, +1 or -1)
• only one unit updates its activation at a time; also, each unit is found to
continuously receive an external signal along with the signals it receives from
the other units in the net.
• The first updated input forces the first updated output, which in turn acts as
the second updated input through the feedback interconnections and results
in second updated output. This transition process continues until no new,
updated responses are produced and the network reaches its equilibrium
COMPONENTS OF HN

• connections to other neurons — weights , stored inside a


matrix of weights. information, or memories as stored as weights
• activation — theta, The activation takes on a single scalar value.
• bipolar state — this is the output of the neuron, analogous to a
neuron’s ‘firing state.’ In this case, -1 and +1.
ARCHITECTURE OF HN

• neuron-a will contribute to the activation of neuron-b, and vice


versa. Connections are symmetric
HN AS DYNAMICAL SYSTEM

• the initial state configuration can be thought of as an input, and its


final state configuration as the output
• Updating analogus to a dynamical system, which, with each update
generation, takes its past output as its current input. The
parameters of this function, the weights, determine the way the
network evolves over time.

A dynamical system is a
system whose state evolves
with time over a state space
according to a fixed rule.
ACTIVATION OF NEURON

• Any neuron’s instantaneous activation is given by:

• yi -neuron-i
• yj a vector of the respective outputs of all neurons inputting to
neuron-i,
• wij the symmetric weight of the connection between neurons i and j.
ACTIVATION & UPDATE

• Activation of a neuron is used to determine the state, or output,


of the neuron according to a thresholding function
• Where si represents a neuron-i’s given state and output.

• If the sign of that field differs from the sign of the neuron’s current
output, its state will flip to align itself.

• If the sign of the field of input matches the sign of the neuron’s
current output, it will stay the same.
ACTIVATION & UPDATE

• If the sign of that field differs from the sign of the neuron’s current
output, its state will flip to align itself.

• If the sign of the field of input matches the sign of the neuron’s
current output, it will stay the same.
CALCULAITON OF PARAMETERS

• No of neurons: number of variables(n),


• No of synapses: (n*n)-n
• (8*8)=64-8=56
HOPFIELD EVOLUTION

Each neuron knows only its own state and incoming inputs, and yet
a distributed pattern emerges from the network’s collective activity.
ACTIVATION FOR HOPFILED

Hopfield used an energy function for the network given by below eqn and it
has monotonically decreasing behavior of the energy
HEBB RULE

• learning algorithm for the Hopfield network is based on the so-


called Hebbian learning rule – earliest procedure
• when two units are simultaneously activated, their interconnection
weight increase becomes proportional to the product of their two
activities.

• Also given as outer product rule of storage, as applied to a set of


q presented patterns p k (k = 1, . . . , q) each with dimension n
HOPFIELD ALGORITHM
WEIGHT UPDATE
• The updation here is carried out at random, but it should be noted each unit may
be updated at the same average rate
• under asynchronous operation of update network, each output node unit is
updated separately by taking into account the most recent values that have
already been updated.
• The net has the capacity to recognize a known vector by producing a pattern
of activations on the unit of the net that is same as the vector stored in the
net.
• For example, if the input vector is an unknown vector, the activation vectors
resulted during iteration will converge to an activation vector which is not
one of the stored patterns; such a pattern is called as spurious stable state
PROBLEM TO FIND WEIGHT
SUMMARY
SUMMARY

• In 1982, John J. Hopfield created a model that reflected the


asynchronous nature of real neurons.
• Auto associative, single-layer, fully linked feedback network
• Additionally, the network has symmetrical weights.
• Its architecture, which consists of a single-layer feedback network,
may be used for both discrete and continuous data
Self-Assessment Questions

1. Hopfiled network is a

(a) Auto-Associative
(b) Set- associative
(c) Non memory
(d) Unsupervised

2. Cover's Theorem is the motivation for using

(a) Linear kernel


(b) Non linear kernel
(c) Gaussian kernel
(d) Deep methods
TERMINAL QUESTIONS

1. Describe hopfield network

2. List out the layers of hopfield networks

3. Analyze the utilization of hopfield networks for time series data

4. Summarize the covers theorem


THANK YOU
So n) Sal Osganizing Maps-
Kohenen Mapo
ha a e o ANN. shioh 8 also inspived
17a
y bislagi ca mo deds neuvad yslema tem
#
faMouDA an Unauseyv ised. Jeavrning approach
tsained
SoM o Used e r chustenin 4 mabpimg

dimenai emal educ hion


4
he im put Jayev
Son hos hso Jayeva One

othex one the output ayex.


Vecby 9uan hzahon one ohe pmperties o
tohieh is a Compyesdjom echntu
SelOyganizmg dimensioma
bobvides a ay epresens multi-
dumen41onal space.
picallyin
dota m a owes
One so dumendien,

number of clusters, m = 2
Out ut ayex
W2 m = # of nodes in the output layer

Tnputyer
dimensionality of vector, n = 4 X
n = # of nodes in the input layer
Algovithm
Tihalize the oejghts (). random Values
So-
may be addumeol. i = 1,2,...n (dimensionality of input vector)
j = 1,2,...,m (# of clusters [nodes in output layer])

Sae K.
> Tnihalize he Wearming

he Euclidean Distance
Sep1 Calcalate square
e. or each- 1b m

o)- (-
s
Ao tot D)
jm Umit mdex J,
md ing min)mum

seh3 x all units otAin aspeciic meigh bouvhod


i Calculae nes oeigh.
J + alli
(ald) +x (X;- ; (old)
ynes)
=

Repeat steps 1, 2 and 3 for all input vectors

the krmulh
Updae lesrning rate XUsmg
(t) 05 (t) -

The current epoch of training is over.


Repeat steps 1, 2, 3 and 4 for as many epochs as needed (i.e., till the termination criterion is satisfied).
nblem
vec-bs
Consuc KSo FM b cluster fouy given

o o [hoso] o 6 o o . No
inihial
CRustess o ev med 2. Assume om

Jesrm'ng Tae o 0'3.

features (Vectorvecy
No onpu-tex dimension) n-4
m 22
cdutevo

bu 2

betoee O*1.

Inihalize weighta sandoml


0-2 0.9
0-7
06 O5
O8 03

CaueI Fvat Input Vec-by X = lo o

Calculare Euclideamn istance

D)- (62 --) (o-o+ (ot-1- os-1

D)- (o9--)* (o7-) +(o-5-)+(63-1


O 8) F 0.49+0 25 tO:49 2.64
D) D()
heveAex inin. c e s
is D(1).

inming clwes Unit J-


Updae weighta on

ayCove)-t, (oa) +* x- y (ta


Nols ) Hence

Wg,(neu)-
N, (o))
, (n)
=
w, (o) + o.s [ -
0-2t05[o- o2)- 0.

A a ) - , () + 0-s Lh- , (J
-+05 0 4-0-2

) 0-6t-5)-oc) -08

09 +05 1-o3] -
o9
u 7)-
Nod opolated 9
Weh Mataix. 0-2 7
0-9 O5
O-9 o-3
Caae ) 2d /p Vec-os

DO) (o-1o2- (o3-o+ lo1-


08140 04+0 64 +6.o) =23 2

De) -1+ lo-7-o(os- o+ (3-)


=
(oq
0-0) d49+ 025tO 09
-
.g4

a)d() Winning O2

pdared weights on cluster 2


Wa (o)
Hi n) (o) + xX,-
0-90.5 (1-69) 095

u, (o) + «
[x- M (o)|
0-7+05 (o-o-7)O35
Maa) 05+0.5 (0-0:5) - 25

Nug) 3 +05 (0-a.3) 15

ubdaed Weiget O.95


ladhia O2 O35
0-8 O25
O9 0 5
Caoe 7 d p Veck 6
D) 5
))9)
imi'n Clugrey ()
Wi (n) 05
a () .6
Ng ( ) -6-9 065 o95
35
(7)0 45 -9 O-25
New 0pda-leok adia Oy5 0.15

Cse 7 /p Vecox oo!


D) 475
()- 8

Winning CDuerey io O)
W, (n) 0.025

Nay (n)= C3
Wgn)= 6:45
0.02.5 0-95
(7)=-475 O3 0-35
045 625
Nes updo-ed Motin O475 0.15
Optimization in Neural
Network
Session 25, 26
• Backpropagation is the most common method for optimization.
• Other methods like genetic algorithm, Tabu search, and simulated
annealing can be also used.
• when we talk about ANN optimization, the objective function is mean
square error function (loss/cost function).
• We have to find optimize values weights of neural network to
minimize the objective function.
• Although, gradient based search techniques such as back-propagation
are currently the most widely used optimization techniques for
training neural networks, it has been shown that these gradient
techniques are severely limited in their ability to find global solutions.
• Global search techniques have been identified as a potential solution
to this problem.
• Two well-known global search techniques, Simulated Annealing and
the Genetic Algorithm can be used.
• Because of its ease of use, an overwhelming majority of these
applications have used some variation of the gradient technique,
backpropagation (BP) for optimizing the networks.
• Although, backpropagation has unquestionably been a major factor
for the success of past neural network applications, it is plagued with
inconsistent and unpredictable performances.
• for a variety of complex functions the genetic algorithm was able to
achieve superior solutions for neural network optimization than
backpropagation.
• another global search heuristic, Tabu Search (TS), also able to
systematically achieve superior solutions for optimizing the neural
network than those achieved by backpropagation.
• In addition to GA and TS another well-known global search heuristic is
simulated annealing.
• Simulated annealing has been shown to perform well for optimizing a
wide variety of complex problems.
• although the most frequently used algorithm for optimizing neural
networks is backpropagation, it is likely to obtain local solutions.
• simulated annealing, a global search algorithm, performs better than
backpropagation, but is also uses a point to point search.
• Both BP and SA have multiple user determined parameters which may
significantly impact the solution.
• Since there are no established rules for selecting these parameters,
solution outcome is based on chance.
• The genetic algorithm appears to be able to systematically obtain superior
solutions to simulated annealing for optimizing neural networks.
• The genetic algorithm’s process of moving from one population of
points to another enables it to discard potential local solutions and
also to achieve the superior solutions in a computationally more
efficient manner.
Genetic Algorithm
• Heuristic search algorithm inspired by Charles Darwin’s theory of
natural evolution.
• Genetic algorithms are based on the ideas of natural selection and
genetics.
• These are intelligent exploitation of random searches provided with
historical data to direct the search into the region of better
performance in solution space.
• They are commonly used to generate high-quality solutions for
optimization problems and search problems.
• Genetic algorithms simulate the process of natural selection which
means those species that can adapt to changes in their environment
can survive and reproduce and go to the next generation.
• In simple words, they simulate “survival of the fittest” among
individuals of consecutive generations to solve a problem.
• Each generation consists of a population of individuals and each
individual represents a point in search space and possible solution.
• Each individual is represented as a string of
character/integer/float/bits.
• This string is analogous to the Chromosome.
• Genetic algorithms are based on an analogy with the genetic
structure and behavior of chromosomes of the population.
• Following is the foundation of GAs based on this analogy –
1.Individuals in the population compete for resources and mate
2.Those individuals who are successful (fittest) then mate to create
more offspring than others
3.Genes from the “fittest” parent propagate throughout the
generation, that is sometimes parents create offspring which is better
than either parent.
4.Thus each successive generation is more suited for their environment.
Search space

• The population of individuals are maintained within search space.


• Each individual represents a solution in search space for given
problem.
• Each individual is coded as a finite length vector (analogous to
chromosome) of components.
• These variable components are analogous to Genes.
• Thus a chromosome (individual) is composed of several genes
(variable components).
Fitness Score

• A Fitness Score is given to each individual which shows the ability of an


individual to “compete”.
• The individual having optimal fitness score (or near optimal) are sought.
• The GAs maintains the population of n individuals (chromosome/solutions)
along with their fitness scores.
• The individuals having better fitness scores are given more chance to
reproduce than others.
• The individuals with better fitness scores are selected who mate and
produce better offspring by combining chromosomes of parents.
• The population size is static so the room has to be created for new arrivals.
• So, some individuals die and get replaced by new arrivals eventually
creating new generation when all the mating opportunity of the old
population is exhausted.
• It is hoped that over successive generations better solutions will arrive
while least fit die.
• Each new generation has on average more “better genes” than the
individual (solution) of previous generations.
• Thus each new generations have better “partial solutions” than previous
generations.
• Once the offspring produced having no significant difference from offspring
produced by previous populations, the population is converged.
• The algorithm is said to be converged to a set of solutions for the problem.
Genetic Algorithm Operations
• Selection operation
• Crossover operation
• Mutation operation
• Once the initial generation is created, the algorithm evolves the
generation using following operators:
1) Selection Operator:
• The idea is to give preference to the individuals with good fitness
scores and allow them to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals.
• Two individuals are selected using selection operator and crossover
sites are chosen randomly.
• Then the genes at these crossover sites are exchanged thus creating a
completely new individual (offspring). For example –
3) Mutation Operator: The key idea is to insert random genes in
offspring to maintain the diversity in the population to avoid premature
convergence. For example –
Genetic Algorithm steps
• Randomly initialize population as p.
• Determine fitness of population.
• Until convergence, repeat:
1. select parents from population
2. crossover and generate new population
3. perform mutation on new population
4. calculate fitness for new population
LEARNING VECTOR
QUANTIZATION
22AIP3204 ARTIFICIAL NEURAL NETWORKS
LVQ

• In Learning Vector Quantization (LVQ), a reference (or codebook) vector


represents a prototype or exemplar for each class in the dataset.
• Each reference vector is associated with a class label, and these vectors are used to
classify new data points based on their proximity to the reference vectors.
• Purpose: Reference vectors are intended to approximate the centroids or characteristic
points of each class in the feature space.
• Each reference vector represents a cluster of data points that share similar features and
belong to the same class.
LVQ

• LVQ is a supervised learning algorithm. Although it shares some similarities with


unsupervised clustering algorithms (e.g., it uses distance metrics and prototype vectors),
it requires labeled data to train.
• The labels guide the adjustment of reference vectors to ensure they represent the
correct classes, making LVQ suitable for classification tasks rather than pure clustering.
LVQ

• The architecture of the Learning Vector Quantization with the number of classes in an
input data and n number of input features for any sample is given below:
LVQ

• Learning Vector Quantization ( or LVQ ) is a type of Artificial Neural Network


which also inspired by biological models of neural systems.
• It is based on prototype supervised learning classification algorithm and trained its
network through a competitive learning algorithm similar to Self Organizing Map.
• It can also deal with the multiclass classification problem.
• LVQ has two layers, one is the Input layer and the other one is the Output layer.
LVQ1
LVQ2.1
THANKS
ANN TEAM

8
DYNAMICALLY DRIVEN
RECURRENT
NETWORKS (RNN)
22AIP3204 ARTIFICIAL NEURAL NETWORKS
RNN

• Dynamically Driven Recurrent Networks (DD-RNNs) are a variation of traditional


Recurrent Neural Networks (RNNs) that aim to incorporate more dynamic and flexible
mechanisms for processing sequential data. These networks typically introduce additional
components or changes to the architecture of standard RNNs, allowing them to adapt
better to complex, time-varying patterns or dependencies in the input data.
KEY CONCEPTS IN DD-RNNS:

1. Dynamic Computation:

1. DD-RNNs can incorporate dynamic decision-making processes at each time step, allowing the network to adjust how it processes inputs
based on context or historical information.
2. This dynamic aspect could come from using additional attention mechanisms, gating mechanisms, or self-organization techniques that modify
the flow of information.

2. Improved Memory Mechanisms:

1. Like traditional RNNs, DD-RNNs aim to maintain memory over time, but they often include mechanisms that help the network forget or
retain information more effectively based on the context of the input sequence.
2. Techniques such as long short-term memory (LSTM) or gated recurrent units (GRUs) are commonly used in DD-RNNs to handle vanishing
gradient problems and improve memory retention.
KEY CONCEPTS IN DD-RNNS:

1. Adaptive and Flexible Architectures:

1. DD-RNNs can adjust the depth, width, or complexity of the network dynamically, allowing them to learn more efficient
representations of sequential data.This adaptability may allow the network to process time series data more efficiently.

2. Learning and Adaptation:

1. These networks can learn to modify their structure or the way information flows over time. The key idea is that the
network is not fixed, but can adapt during training to better handle the temporal relationships and dynamics of the
data.
APPLICATIONS OF DD-RNNS:

1. Time Series Forecasting: DD-RNNs can be particularly useful for tasks where sequences
have complex temporal patterns, like financial forecasting, weather prediction, or
demand forecasting.
2. Speech Recognition and Natural Language Processing (NLP): DD-RNNs can process
dynamic sequences of text or audio more efficiently, allowing for better performance in
speech-to-text systems or language modeling.
3. Robotics and Control Systems: In robotics, DD-RNNs can be used to model dynamic
environments or predict sequences of actions to improve decision-making in real-time
systems.
COMPARISON WITH TRADITIONAL RNNS:

1. Traditional RNNs are simple in design and have difficulty handling long-term
dependencies due to the vanishing gradient problem. They use a fixed structure to
propagate information through time.
2. DD-RNNs, on the other hand, introduce more complex and adaptable components to
handle dynamic and complex temporal dependencies. They can adjust their architecture
or processing strategy to better fit the nature of the input sequence.
CONCLUSION

In essence, DD-RNNs are more flexible and capable of handling complex, dynamic systems
compared to traditional RNNs. They are useful in situations where the relationships within
sequential data evolve over time or require more sophisticated memory and attention
mechanisms.

You might also like