0% found this document useful (0 votes)
24 views59 pages

ANN-unit 1

This document provides an overview of neural networks, including their structure, types, advantages, and disadvantages. It explains the components of a neural network, such as input, hidden, and output layers, and discusses various learning processes and architectures. Additionally, it highlights the importance of neural networks in artificial intelligence and their applications across different fields.

Uploaded by

Neelesh Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views59 pages

ANN-unit 1

This document provides an overview of neural networks, including their structure, types, advantages, and disadvantages. It explains the components of a neural network, such as input, hidden, and output layers, and discusses various learning processes and architectures. Additionally, it highlights the importance of neural networks in artificial intelligence and their applications across different fields.

Uploaded by

Neelesh Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

UNIT-1

Outline
Introduction: A Neural Network
Human Brain, Models of a Neuron
Neural Networks viewed as Directed
Graphs
Network Architectures

Knowledge Representation, Artificial


Intelligence and Neural Networks
Learning Process: Error Correction
Learning
Memory Based Learning, Hebbian
Learning
Competitive, Boltzmann Learning,
Credit Assignment Problem, Memory,
Adaption,
Statistical Nature of the Learning
Process

1
UNIT 1

A Neural Network

A neural network is a type of machine learning model inspired by the structure and function of
the human brain. It consists of interconnected nodes, or artificial neurons, organized in layers to
process and analyze data. Each neuron receives inputs, applies a mathematical operation to them,
and produces an output that is passed on to other neurons in the network.

Neural networks can be used for a wide range of tasks, such as image and speech recognition,
natural language processing, and predictive modeling. They are particularly effective for tasks
that involve large amounts of complex data and patterns that are difficult to discern with
traditional algorithms.

Training a neural network involves adjusting the weights and biases of the connections between
neurons to minimize the difference between the predicted output and the actual output. This
process is typically done using an optimization algorithm, such as gradient descent, and a loss
function, which measures the difference between the predicted output and the actual output.

The architecture of a neural network can vary depending on the task at hand. Some common
types of neural networks include feed forward neural networks, recurrent neural networks, and
convolution neural networks.

Types of Neural Networks

There are several types of neural networks, each designed for specific tasks and data structures.
Some of the most common types of neural networks are:

Feed forward Neural Networks: These are the simplest and most common type of neural
network. They consist of input, output, and hidden layers, with data flowing only in one direction
from input to output. They are often used for tasks such as classification and regression.

Recurrent Neural Networks: These neural networks have connections between nodes that form
a directed cycle. They are often used for tasks such as natural language processing and speech
recognition, where the sequence of input data is important.

Convolution Neural Networks: These neural networks are designed for processing images and
other multi-dimensional data. They use convolution layers to extract features from the input data,
and pooling layers to reduce the size of the data.

Auto encoders: These neural networks are designed for unsupervised learning, where the goal is
to learn a compressed representation of the input data. They consist of an encoder and a decoder,
which learn to map the input data to a lower-dimensional space and back again.

2
Generative Adversarial Networks: These neural networks consist of a generator and a
discriminator, which are trained together in a game-like setup. The generator learns to generate
new data that is similar to the training data, while the discriminator learns to distinguish between
real and generated data.

Long Short-Term Memory Networks: These neural networks are a type of recurrent neural
network that is designed to handle long-term dependencies. They use memory cells to remember
information from earlier in the sequence and avoid the vanishing gradient problem.

There are also many other types of neural networks, including radial basis function networks,
self-organizing maps, and deep belief networks, each designed for specific tasks and data
structures.

Advantages and Disadvantages of Neural Networks

Advantages of Neural Networks


Neutral networks that can work continuously and are more efficient than humans or simpler
analytical models. Neural networks can also be programmed to learn from prior outputs to
determine future outcomes based on the similarity to prior inputs.

Neural networks that leverage cloud of online services also have the benefit of risk mitigation
compared to systems that rely on local technology hardware. In addition, neural networks can
often perform multiple tasks simultaneously (or at least distribute tasks to be performed by
modular networks at the same time).

Last, neural networks are continually being expanded into new applications. While early,
theoretical neural networks were very limited to its applicability into different fields, neural
networks today are leveraged in medicine, science, finance, agriculture, or security.

Disadvantages of Neural Networks


Though neutral networks may rely on online platforms, there is still a hardware component that
is required to create the neural network. This creates a physical risk of the network that relies on
complex systems, set-up requirements, and potential physical maintenance.

Though the complexity of neural networks is a strength, this may mean it takes months (if not
longer) to develop a specific algorithm for a specific task. In addition, it may be difficult to spot
any errors or deficiencies in the process, especially if the results are estimates or theoretical
ranges.

Neural networks may also be difficult to audit. Some neural network processes may feel "like a
black box" where input is entered, networks perform complicated processes, and output is
reported. It may also be difficult for individuals to analyze weaknesses within the calculation or

3
learning process of the network if the network lacks general transparency on how a model
learns upon prior activity.

Neural Networks
Pros

• Can often work more efficiently and for longer than humans
• Can be programmed to learn from prior outcomes to strive to make smarter future
calculations
• Often leverage online services that reduce (but do not eliminate) systematic risk
• Are continually being expanded in new fields with more difficult problems

Cons

• Still rely on hardware that may require labor and expertise to maintain
• May take long periods of time to develop the code and algorithms
• May be difficult to assess errors or adoptions to the assumptions if the system is self-
learning but lacks transparency
• Usually report an estimated range or estimated amount that may not actualize

What Are the Components of a Neural Network?

There are three main components: an input later, a processing layer, and an output layer. The
inputs may be weighted based on various criteria. Within the processing layer, which is hidden
from view, there are nodes and connections between these nodes, meant to be analogous to the
neurons and synapses in an animal brain.

What Is a Deep Neural Network?

Also known as a deep learning network, a deep neural network, at its most basic, is one that
involves two or more processing layers. Deep neural networks rely on machine learning
networks that continually evolve by compared estimated outcomes to actual results, then
modifying future projections.

What Are the 3 Components of a Neural Network?

All neural networks have three main components. First, the input is the data entered into the
network that is to be analyzed. Second, the processing layer utilizes the data (and prior
knowledge of similar data sets) to formulate an expected outcome. That outcome is the third
component, and this third component is the desired end product from the analysis.

4
The Bottom Line

Neural networks are complex, integrated systems that can perform analytics much deeper and
faster than human capability. There are different types of neural networks, often best suited for
different purposes and target outputs. In finance, neural networks are used to analyze
transaction history, understand asset movement, and predict financial market outcomes.

The term "Artificial Neural Network" is derived from Biological neural networks that develop
the structure of a human brain. Similar to the human brain that has neurons interconnected to one
another; artificial neural networks also have neurons that are interconnected to one another in
various layers of the networks. These neurons are known as nodes.

The given figure illustrates the typical diagram of Biological Neural Network.

Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.

Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

5
Cell nucleus Nodes

Synapse Weights

Axon Output

An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic
the network of neurons makes up a human brain so that computers will have an option to
understand things and make decisions in a human-like manner. The artificial neural network is
designed by programming computers to behave simply like interconnected brain cells.

There are around 1000 billion neurons in the human brain. Each neuron has an association point
somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in such a
manner as to be distributed, and we can extract more than one piece of this data when necessary
from our memory parallels. We can say that the human brain is made up of incredibly amazing
parallel processors.

We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs. If
one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off," then we
get "Off" in output. Here the output depends upon input. Our brain does not perform the same
task. The outputs to inputs relationship keep changing because of the neurons in our brain, which
are "learning."

The architecture of an artificial neural network:

To understand the concept of the architecture of an artificial neural network, we have to


understand what a neural network consists of. In order to define a neural network that consists of
a large number of artificial neurons, which are termed units arranged in a sequence of layers.
Lets us look at various types of layers available in an artificial neural network.

Artificial Neural Network primarily consists of three layers:

Input Layer:

6
As the name suggests, it accepts inputs in several different formats provided by the programmer.

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which finally results in
output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should fire or not. Only those who are fired make it
to the output layer. There are distinctive activation functions available that can be applied upon
the sort of task we are performing.

Advantages of Artificial Neural Network (ANN)

Parallel processing capability:

Artificial neural networks have a numerical value that can perform more than one task
simultaneously.

Storing data on the entire network:

Data that is used in traditional programming is stored on the whole network, not on a database.
The disappearance of a couple of pieces of data in one place doesn't prevent the network from
working.

Capability to work with incomplete knowledge:

After ANN training, the information may produce output even with inadequate data. The loss of
performance here relies upon the significance of missing data.

Having a memory distribution:

7
For ANN is to be able to adapt, it is important to determine the examples and to encourage the
network according to the desired output by demonstrating these examples to the network. The
succession of the network is directly proportional to the chosen instances, and if the event can't
appear to the network in all its aspects, it can produce false output.

Having fault tolerance:

Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:

Assurance of proper network structure:

There is no particular guideline for determining the structure of artificial neural networks. The
appropriate network structure is accomplished through experience, trial, and error.

Unrecognized behavior of the network:

It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.

Hardware dependence:

Artificial neural networks need processors with parallel processing power, as per their structure.
Therefore, the realization of the equipment is dependent.

Difficulty of showing the issue to the network:

ANNs can work with numerical data. Problems must be converted into numerical values before
being introduced to ANN. The presentation mechanism to be resolved here will directly impact
the performance of the network. It relies on the user's abilities.

Artificial Neural Networks


Artificial Neural Network is biologically inspired by the neural network, which constitutes after
the human brain. Neural networks are modeled in accordance with the human brain so as to
imitate their functionality. The human brain can be defined as a neural network that is made up
of several neurons, so is the Artificial Neural Network is made of numerous perceptions.

8
A neural network comprises of three main layers, which are as follows;

o Input layer: The input layer accepts all the inputs that are provided by the programmer.
o Hidden layer: In between the input and output layer; there is a set of hidden layers on
which computations are performed that further results in the output.
o Output layer: After the input layer undergoes a series of transformations while passing
through the hidden layer, it results in output that is delivered by the output layer.

Motivation behind Neural Network

Basically, the neural network is based on the neurons, which are nothing but the brain cells. A
biological neuron receives input from other sources, combines them in some way, followed by
performing a nonlinear operation on the result, and the output is the final result.

The dendrites will act as a receiver that receives signals from other neurons, which are then
passed on to the cell body. The cell body will perform some operations that can be a summation,

9
multiplication, etc. After the operations are performed on the set of input, then they are
transferred to the next neuron via axion, which is the transmitter of the signal for the neuron.

Importance of Neural Network:


o Without Neural Network: Let's have a look at the example given below. Here we have
a machine, such that we have trained it with four types of cats, as you can see in the
image below. And once we are done with the training, we will provide a random image to
that particular machine that has a cat. Since this cat is not similar to the cats through
which we have trained our system, so without the neural network, our machine would not
identify the cat in the picture. Basically, the machine will get confused in figuring out
where the cat is.

o With Neural Network: However, when we talk about the case with a neural network,
even if we have not trained our machine with that particular cat. But still, it can identify
certain features of a cat that we have trained on, and it can match those features with the
cat that is there in that particular image and can also identify the cat. So, with the help of
this example, you can clearly see the importance of the concept of a neural network.

Working of Artificial Neural Networks

Instead of directly getting into the working of Artificial Neural Networks, lets breakdown and try
to understand Neural Network's basic unit, which is called a Perception.

So, a perception can be defined as a neural network with a single layer that classifies the linear
data. It further constitutes four major components, which are as follows;

1. Inputs
2. Weights and Bias

10
3. Summation Functions
4. Activation or transformation function

The main logic behind the concept of Perception is as follows:

The inputs (x) are fed into the input layer, which undergoes multiplication with the allotted
weights (w) followed by experiencing addition in order to form weighted sums. Then these
inputs weighted sums with their corresponding weights are executed on the pertinent activation
function.

Weights and Bias

As and when the input variable is fed into the network, a random value is given as a weight of
that particular input, such that each individual weight represents the importance of that input in
order to make correct predictions of the result.

However, bias helps in the adjustment of the curve of activation function so as to accomplish a
precise output.

Summation Function

After the weights are assigned to the input, it then computes the product of each input and
weights. Then the weighted sum is calculated by the summation function in which all of the
products are added.

Activation Function

The main objective of the activation function is to perform a mapping of a weighted sum upon
the output. The transformation function comprises of activation functions such as tanh, ReLU,
sigmoid, etc.

The activation function is categorized into two main parts:

1. Linear Activation Function

11
2. Non-Linear Activation Function

Linear Activation Function

In the linear activation function, the output of functions is not restricted in between any range. Its
range is specified from -infinity to infinity. For each individual neuron, the inputs get multiplied
with the weight of each respective neuron, which in turn leads to the creation of output signal
proportional to the input. If all the input layers are linear in nature, then the final activation of the
last layer will actually be the linear function of the initial layer's input.

Non- linear function

These are one of the most widely used activation function. It helps the model in generalizing and
adapting any sort of data in order to perform correct differentiation among the output. It solves
the following problems faced by linear activation functions:

o Since the non-linear function comes up with derivative functions, so the problems related
to back propagation has been successfully solved.
o For the creation of deep neural networks, it permits the stacking up of several layers of
the neurons.

The non-linear activation function is further divided into the following parts:

1. Sigmoid or Logistic Activation Function


It provides a smooth gradient by preventing sudden jumps in the output values. It has an

12
output value range between 0 and 1 that helps in the normalization of each neuron's
output. For X, if it has a value above 2 or below -2, then the values of y will be much
steeper. In simple language, it means that even a small change in the X can bring a lot of
change in Y.
It's value ranges between 0 and 1 due to which it is highly preferred by binary
classification whose result is either 0 or 1.

2. Tanh or Hyperbolic Tangent Activation Function


the tanh activation function works much better than that of the sigmoid function, or
simply we can say it is an advanced version of the sigmoid activation function. Since it
has a value range between -1 to 1, so it is utilized by the hidden layers in the neural
network, and because of this reason, it has made the process of learning much easier.

3. ReLU (Rectified Linear Unit) Activation Function


ReLU is one of the most widely used activation function by the hidden layer in the neural
network. Its value ranges from 0 to infinity. It clearly helps in solving out the problem of
back propagation. It tends out to be more expensive than the sigmoid, as well as the tanh
activation function. It allows only a few neurons to get activated at a particular instanc

13
that leads to effectual as well as easier computations.

4. Soft max Function


it is one of a kind of sigmoid function whereby solving the problems of classifications. It
is mainly used to handle multiple classes for which it squeezes the output of each class
between 0 and 1, followed by dividing it by the sum of outputs. This kind of function is
specially used by the classifier in the output layer.

Gradient Descent Algorithm

Gradient descent is an optimization algorithm that is utilized to minimize the cost function used
in various machine learning algorithms so as to update the parameters of the learning model. In
linear regression, these parameters are coefficients, whereas, in the neural network, they are
weights.

Procedure:

It all starts with the coefficient's initial value or function's coefficient that may be either 0.0 or
any small arbitrary value.

Coefficient = 0.0

For estimating the cost of the coefficients, they are plugged into the function that helps in
evaluating.

Cost = f (coefficient)
or, cost = evaluate(f(coefficient))

Next, the derivate will be calculated, which is one of the concepts of calculus that relates to the
function's slope at any given instance. In order to know the direction in which the values of the
coefficient will move, we need to calculate the slope so as to accomplish a low cost in the next
iteration.

Delta = derivative (cost)

14
Now that we have found the downhill direction, it will further help in updating the values of
coefficients. Next, we will need to specify alpha, which is a learning rate parameter, as it handles
the amount of amendments made by coefficients on each update.

Coefficient = coefficient - (alpha * delta)

Until the cost of the coefficient reaches 0.0 or somewhat close enough to it, the whole process
will reiterate again and again.

It can be concluded that gradient descent is a very simple as well as straightforward concept. It
just requires you to know about the gradient of the cost function or simply the function that you
are willing to optimize.

Algorithm for Batch Gradient Descent

Let m be the number of training examples and n be the number of features.

Now assume that hƟ represents the hypothesis for linear regression and ∑ computes the sum of
all training examples from i=1 to m. Then the cost of function will be computed by:

Jtrain (Ɵ) = (1/2m) ∑ (hƟ(x(i)) - (y(i))2

Repeat {

Ɵj = Ɵj - (learning rate/m) * ∑ (hƟ(x(i)) - y(i)) xj(i)

For every j = 0...n

Here x(i) indicates the jth feature of the ith training example. In case if m is very large, then
derivative will fail to converge at a global minimum

Stochastic Gradient Descent

At a single repetition, the stochastic gradient descent processes only one training example, which
means it necessitates for all the parameters to update after the one single training example is
processed per single iteration. It tends to be much faster than that of the batch gradient descent,
but when we have a huge number of training examples, and then also it processes a single
example due to which system may undergo a large no of repetitions. To evenly train the
parameters provided by each type of data, properly shuffle the dataset.

15
Algorithm for Stochastic Gradient Descent

Suppose that (x(i), y(i)) be the training example

Cost (Ɵ, (x (i), y (i))) = (1/2) ∑ (hƟ(x (i)) - (y (i)) 2

Jtrain (Ɵ) = (1/m) ∑ Cost (Ɵ, (x(i), y(i)))

Repeat {

For i=1 to m {

Ɵj = Ɵj - (learning rate) * ∑ (hƟ(x(i)) - y(i)) xj(i)

For every j=0...n

Convergence trends in different variants of Gradient Descent

The Batch Gradient Descent algorithm follows a straight-line path towards the minimum. The
algorithm converges towards the global minimum, in case the cost function is convex, else
towards the local minimum, if the cost function is not convex. Here the learning rate is typically
constant.

However, in the case of Stochastic Gradient Descent, the algorithm fluctuates all over the global
minimum rather than converging. The learning rate is changed slowly so that it can converge.
Since it processes only one example in one iteration, it tends out to be noisy.

Back propagation

The back propagation consists of an input layer of neurons, an output layer, and at least one
hidden layer. The neurons perform a weighted sum upon the input layer, which is then used by
the activation function as an input, especially by the sigmoid activation function. It also makes
use of supervised learning to teach the network. It constantly updates the weights of the network
until the desired output is met by the network. It includes the following factors that are
responsible for the training and performance of the network:

o Random (initial) values of weights.

16
o A number of training cycles.
o A number of hidden neurons.
o The training set.
o Teaching parameter values such as learning rate and momentum.

Working of Back propagation

Consider the diagram given below.

1. The reconnected paths transfer the inputs X.


2. Then the weights W are randomly selected, which are used to model the input.
3. After then, the output is calculated for every individual neuron that passes from the input
layer to the hidden layer and then to the output layer.
4. Lastly, the errors are evaluated in the outputs. Error= Actual Output - Desired Output
5. The errors are sent back to the hidden layer from the output layer for adjusting the
weights to lessen the error.
6. Until the desired result is achieved, keep iterating all of the processes.

Need of Back propagation


o Since it is fast as well as simple, it is very easy to implement.
o Apart from no of inputs, it does not encompass of any other parameter to perform tuning.
o As it does not necessitate any kind of prior knowledge, so it tends out to be more flexible.
o It is a standard method that results well.

17
Building an ANN

Before starting with building an ANN model, we will require a dataset on which our model is
going to work. The dataset is the collection of data for a particular problem, which is in the form
of a CSV file.

CSV stands for Comma-separated values that save the data in the tabular format. We are using
a fictional dataset of banks. The bank dataset contains data of its 10,000 customers with their
details. This whole thing is undergone because the bank is seeing some unusual churn rates,
which is nothing but the customers are leaving at an unusual high rate, and they want to know
the reason behind it so that they can assess and address that particular problem.

So, we will start with installing the Keras library, Tensor Flow library, as well as
the Theano library on Anaconda Prompt, and for that, you need to open it as administrator
followed by running the commands one after other as given below.

Network Architecture

Computer Network Architecture is defined as the physical and logical design of the software,
hardware, protocols, and media of the transmission of data. Simply we can say that how
computers are organized and how tasks are allocated to the computer.

The two types of network architectures are used:

o Peer-To-Peer network
o Client/Server network

Peer-To-Peer network

o Peer-To-Peer network is a network in which all the computers are linked together with
equal privilege and responsibilities for processing the data.
o Peer-To-Peer network is useful for small environments, usually up to 10 computers.
o Peer-To-Peer network has no dedicated server.
o Special permissions are assigned to each computer for sharing the resources, but this can
lead to a problem if the computer with the resource is down.

18
Advantages of Peer-To-Peer Network:
o It is less costly as it does not contain any dedicated server.
o If one computer stops working but, other computers will not stop working.
o It is easy to set up and maintain as each computer manages itself.

Disadvantages of Peer-To-Peer Network:


o In the case of Peer-To-Peer network, it does not contain the centralized system .
Therefore, it cannot back up the data as the data is different in different locations.
o It has a security issue as the device is managed itself.

Client/Server Network

o Client/Server network is a network model designed for the end users called clients, to
access the resources such as songs, video, etc. from a central computer known as Server.
o The central controller is known as a server while all other computers in the network are
called clients.
o A server performs all the major operations such as security and network management.
o A server is responsible for managing all the resources such as files, directories, printer,
etc.
o All the clients communicate with each other through a server. For example, if client1
wants to send some data to client 2, then it first sends the request to the server for the
permission. The server sends the response to the client 1 to initiate its communication
with the client 2.

19
Advantages of Client/Server network:
o A Client/Server network contains the centralized system. Therefore we can back up the
data easily.
o A Client/Server network has a dedicated server that improves the overall performance of
the whole system.
o Security is better in Client/Server network as a single server administers the shared
resources.
o It also increases the speed of the sharing resources.

Disadvantages of Client/Server network:


o Client/Server network is expensive as it requires the server with large memory.
o A server has a Network Operating System (NOS) to provide the resources to the clients,
but the cost of NOS is very high.
o It requires a dedicated network administrator to manage all the resources.

What is knowledge representation?

Humans are best at understanding, reasoning, and interpreting knowledge. Human knows things,
which is knowledge and as per their knowledge they perform various actions in the real
world. But how machines do all these things comes under knowledge representation and
reasoning. Hence we can describe Knowledge representation as following:

o Knowledge representation and reasoning (KR, KRR) is the part of Artificial intelligence
which concerned with AI agents thinking and how thinking contributes to intelligent
behavior of agents.
o It is responsible for representing information about the real world so that a computer can
understand and can utilize this knowledge to solve the complex real world problems such
as diagnosis a medical condition or communicating with humans in natural language.
o It is also a way which describes how we can represent knowledge in artificial
intelligence. Knowledge representation is not just storing data into some database, but it

20
also enables an intelligent machine to learn from that knowledge and experiences so that
it can behave intelligently like a human.

What to Represent:

Following are the kind of knowledge which needs to be represented in AI systems:

o Object: All the facts about objects in our world domain. E.g., Guitars contains strings,
trumpets are brass instruments.
o Events: Events are the actions which occur in our world.
o Performance: It describes behavior which involves knowledge about how to do things.
o Meta-knowledge: It is knowledge about what we know.
o Facts: Facts are the truths about the real world and what we represent.
o Knowledge-Base: The central component of the knowledge-based agents is the
knowledge base. It is represented as KB. The Knowledgebase is a group of the Sentences
(Here, sentences are used as a technical term and not identical with the English
language).

Knowledge: Knowledge is awareness or familiarity gained by experiences of facts, data,


and situations. Following are the types of knowledge in artificial intelligence

Types of knowledge

Following are the various types of knowledge:

21
1. Declarative Knowledge:

o Declarative knowledge is to know about something.


o It includes concepts, facts, and objects.
o It is also called descriptive knowledge and expressed in declarative sentences.
o It is simpler than procedural language.

2. Procedural Knowledge

o It is also known as imperative knowledge.


o Procedural knowledge is a type of knowledge which is responsible for knowing how to
do something.
o It can be directly applied to any task.
o It includes rules, strategies, procedures, agendas, etc.
o Procedural knowledge depends on the task on which it can be applied.

3. Meta-knowledge:

o Knowledge about the other types of knowledge is called Meta-knowledge.

4. Heuristic knowledge:

o Heuristic knowledge is representing knowledge of some experts in a field or subject.


Heuristic knowledge is rules of thumb based on previous experiences, awareness of
approaches, and which are good to work but not guaranteed.

5. Structural knowledge:

o Structural knowledge is basic knowledge to problem-solving.


o It describes relationships between various concepts such as kind of, part of, and grouping
of something.
o It describes the relationship that exists between concepts or objects.

The relation between knowledge and intelligence:

22
Knowledge of real-worlds plays a vital role in intelligence and same for creating artificial
intelligence. Knowledge plays an important role in demonstrating intelligent behavior in AI
agents. An agent is only able to accurately act on some input when he has some knowledge or
experience about that input.

Let's suppose if you met some person who is speaking in a language which you don't know, then
how you will able to act on that. The same thing applies to the intelligent behavior of the agents.

As we can see in below diagram, there is one decision maker which act by sensing the
environment and using knowledge. But if the knowledge part will not present then, it cannot
display intelligent behavior.

AI knowledge cycle:

An Artificial intelligence system has the following components for displaying intelligent
behavior:

o Perception
o Learning
o Knowledge Representation and Reasoning
o Planning
o Execution

23
The above diagram is showing how an AI system can interact with the real world and what
components help it to show intelligence. AI system has Perception component by which it
retrieves information from its environment. It can be visual, audio or another form of sensory
input. The learning component is responsible for learning from data captured by Perception
comportment. In the complete cycle, the main components are knowledge representation and
Reasoning. These two components are involved in showing the intelligence in machine-like
humans. These two components are independent with each other but also coupled together. The
planning and execution depend on analysis of Knowledge representation and reasoning.

Approaches to knowledge representation:

There are mainly four approaches to knowledge representation, which are given below:

1. Simple relational knowledge:


o It is the simplest way of storing facts which uses the relational method, and each fact
about a set of the object is set out systematically in columns.
o This approach of knowledge representation is famous in database systems where the
relationship between different entities is represented.
o This approach has little opportunity for inference.

Example: The following is the simple relational knowledge representation.

Player Weight Age

Player1 65 23

Player2 58 18

Player3 75 24

2. Inheritable knowledge:
o In the inheritable knowledge approach, all data must be stored into a hierarchy of classes.
o All classes should be arranged in a generalized form or a hierarchal manner.
o In this approach, we apply inheritance property.
o Elements inherit values from other members of a class.

24
o This approach contains inheritable knowledge which shows a relation between instance
and class, and it is called instance relation.
o Every individual frame can represent the collection of attributes and its value.
o In this approach, objects and values are represented in Boxed nodes.
o We use Arrows which point from objects to their values.
o Example:

3. Inferential knowledge:
o Inferential knowledge approach represents knowledge in the form of formal logics.
o This approach can be used to derive more facts.
o It guaranteed correctness.
o Example: Let's suppose there are two statements:
a. Marcus is a man
b. All men are mortal
Then it can represent as;

man(Marcus)
∀x = man (x) ----------> mortal (x)s

4. Procedural knowledge:
o Procedural knowledge approach uses small programs and codes which describes how to
do specific things, and how to proceed.
o In this approach, one important rule is used which is If-Then rule.
o In this knowledge, we can use various coding languages such as LISP
language and Prolog language.
o We can easily represent heuristic or domain-specific knowledge using this approach.

25
o But it is not necessary that we can represent all cases in this approach.

Requirements for knowledge Representation system:

A good knowledge representation system must possess the following properties.

1. 1. Representational Accuracy:
KR system should have the ability to represent all kind of required knowledge.
2. 2. Inferential Adequacy:
KR system should have ability to manipulate the representational structures to produce
new knowledge corresponding to existing structure.
3. 3. Inferential Efficiency:
The ability to direct the inferential knowledge mechanism into the most productive
directions by storing appropriate guides.
4. 4. Acquisitioned efficiency- The ability to acquire the new knowledge easily using
automatic methods.

What Is Learning in ANN?


Basically, learning means to do and adapt the change in itself as and when there is a change in
environment. ANN is a complex system or more precisely we can say that it is a complex
adaptive system, which can change its internal structure based on the information passing
through it.
Why is it important?
Being a complex adaptive system, learning in ANN implies that a processing unit is capable of
changing its input/output behavior due to the change in environment. The importance of learning
in ANN increases because of the fixed activation function as well as the input/output vector,
when a particular network is constructed. Now to change the input/output behavior, we need to
adjust the weights.
Classification
It may be defined as the process of learning to distinguish the data of samples into different
classes by finding common features between the samples of the same classes. For example, to
perform training of ANN, we have some training samples with unique features, and to perform
its testing we have some testing samples with other unique features. Classification is an example
of supervised learning.

Neural Network Learning Rules

We know that, during ANN learning, to change the input/output behavior, we need to adjust the
weights. Hence, a method is required with the help of which the weights can be modified. These

26
methods are called Learning rules, which are simply algorithms or equations. Following are
some learning rules for the neural network −
Hebbian Learning Rule
This rule, one of the oldest and simplest, was introduced by Donald Hebb in his book The
Organization of Behavior in 1949. It is a kind of feed-forward, unsupervised learning.
Basic Concept − this rule is based on a proposal given by Hebb, who wrote −
“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes
part in firing it, some growth process or metabolic change takes place in one or both cells such
that A’s efficiency, as one of the cells firing B, is increased.”
From the above postulate, we can conclude that the connections between two neurons might be
strengthened if the neurons fire at the same time and might weaken if they fire at different times.
Mathematical Formulation − According to Hebbian learning rule, following is the formula to
increase the weight of connection at every time step.
Perception Learning Rule
This rule is an error correcting the supervised learning algorithm of single layer feed forward
networks with linear activation function, introduced by Rosenblatt.
Mathematical Formulation − to explain its mathematical formulation, suppose we have ‘n’
number of finite input vectors, x, along with its desired/target output vector tn, where n = 1 to N.
Now the output ‘y’ can be calculated, as explained earlier on the basis of the net input, and
activation function being applied over that net input can be expressed as follows −
y=f(yin)={1,0,yin>θyin⩽θ

Where θ is threshold.
The updating of weight can be done in the following two cases −
Case I − when t ≠ y, then
W (new) =w (old) +TX

Case II − when t = y, then


No change in weight
Delta Learning Rule Windrow−Hoff Rule
It is introduced by Bernard Windrow and Mercian Hoff, also called Least Mean
Square LMS method, to minimize the error over all training patterns. It is kind of supervised
learning algorithm with having continuous activation function.
Basic Concept − the base of this rule is gradient-descent approach, which continues forever.
Delta rule updates the synaptic weights so as to minimize the net input to the output unit and the
target value.
Mathematical Formulation − to update the synaptic weights, delta rule is given by

27
Competitive Learning Rule Winner−takes−all
It is concerned with unsupervised training in which the output nodes try to compete with each
other to represent the input pattern. To understand this learning rule, we must understand the
competitive network which is given as follows −
Basic Concept of Competitive Network − this network is just like a single layer feed forward
network with feedback connection between outputs. The connections between outputs are
inhibitory type, shown by dotted lines, which means the competitors never support themselves.

Basic Concept of Competitive Learning Rule − as said earlier, there will be a competition
among the output nodes. Hence, the main concept is that during training, the output unit with the
highest activation to a given input pattern will be declared the winner. This rule is also called
Winner-takes-all because only the winning neuron is updated and the rest of the neurons are left
unchanged.
Mathematical formulation − Following are the three important factors for mathematical
formulation of this learning rule −
• Condition to be a winner − Suppose if a neuron y ⁡wants to be the winner then there
would be the following condition −
YK= {10ifvk>vjforallj, j≠kotherwise
It means that if any neuron, say YK wants to win, and then its induced local field the output of
sum motion unit says vk must be the largest among all the other neurons in the network.
• Condition of sum total of weight − another constraint over the competitive learning rule
is, the sum total of weights to a particular output neuron is going to be 1. For example, if
we consider neuron k then −
∑jwkj=1forallk
• Change of weight for winner − If a neuron does not respond to the input pattern, then no
learning takes place in that neuron. However, if a particular neuron wins, then the
corresponding weights are adjusted as follows
Δwkj= {−α (xj−wkj), 0, ifneuronkwinsifneuronklossesΔ
This clearly shows that we are favoring the winning neuron by adjusting its weight and if
there is a neuron loss, then we need not bother to re-adjust its weight.
Outstare Learning Rule
This rule, introduced by Grasberg, is concerned with supervised learning because the desired
outputs are known. It is also called Grasberg learning.

28
Basic Concept − this rule is applied over the neurons arranged in a layer. It is specially designed
to produce a desired output d of the layer of p neurons.
Mathematical Formulation − the weight adjustments in this rule are computed as follows
Δwj=α (d−wj)

Here d is the desired neuron output and α is the learning rate.

We define learning in the context of neural networks as:


Learning is a process by which the free parameters of a neural network are adapted through a
process of stimulation by the environment in which the network is embedded. The type of
learning is determined by the manner in which the parameter changes take place.

This definition of the learning process implies the following sequence of events:
1. The neural network is stimulated by an environment.

2. The neural network undergoes changes in its free parameters as a result of this stimulation.

3. The neural network responds in a new way to the environment because of the changes which
have occurred in its internal structure.

A prescribed set of well-defined rules for the solution of a learning problem is called a learning
algorithm. There is no unique learning algorithm for the design of neural networks. Rather, we
have a kit of tools represented by a diverse variety of learning algorithms, each of which offers
advantages of its own. Basically, learning algorithms differ from each other in the way in which
the adjustment to a synaptic weight of a neuron is formulated.

Another factor to be considered is the manner in which a neural network (learning machine),
made up of a set of interconnected neurons, reacts to its environment. In this latter context we
speak of a learning paradigm which refers to a model of the environment in which the neural
network operates.

The five learning rules:


1. Error-correction learning,
2. Memory-based learning,
3. Serbian learning,
4. Competitive learning and
5. Boltzmann learning is basic to design of neural networks.

29
Some of these algorithms require the use of a teacher and some do not called supervised and
non-supervised learning respectively.
In the study of supervised learning, a key provision is a ‘teacher’ capable of supplying exact
corrections to the network outputs when an error occurs. Such a method is not possible in
biological organism which has neither the exact reciprocal nervous connections needed for the
back propagation of error corrections nor the nervous means for the in position of behavior from
outside.
Nevertheless, supervised learning has established itself as a powerful par diagram for the design
of artificial neural networks. In contrast self-organized (unsupervised) learning is motivated by
neurobiological considerations.

Learning Rules of Neurons in Neural Networks:


Five basic learning rules of Neuron are:
1. Error correctional earning,
2. Memory based- learning,
3. Hebbian learning,
4. Competitive learning and
5. Boltzmann learning.
Error correction learning is rooted in optimum filtering; Memory-based learning and competitive
learning are both inspired by neurobiological considerations. Boltzmann learning is different and
is based on ideas borrowed from statistical mechanics. Also two learning paradigms, learning
with a teacher and learning without a teacher, including the credit-assignment problem, so basic
to learning process have been discussed.
1. Error-Correction Learning:
To illustrate our first learning rule of learning process consider the simple case of a neuron k
constituting the only computational node in the output layer of a feed forward neural network, as
depicted in Fig. 11.21. Neuron k is driven by a signal vector x (n) produced by one or more
layers of hidden neurons, which are themselves driven by an input vector (stimulus) applied to
the source nodes (i.e., input layer) of the neural network.

The argument n denotes discrete time, or more precisely, the time step of an iterative process
involved in adjusting the synaptic weights of neuron k. The output signal of neuron k is denoted
yk (n). This output signal, representing the only output of the neural network, is compared to a
desired response or target output, denoted by yk(n). Consequently, an error signal, denoted by
ek(n), is produced. By definition, we thus have

30
The error signal ek (n) actuates a control mechanism, the purpose of which is to apply a sequence
of corrective adjustments to the synaptic weights of neuron k. The corrective adjustments are
designed to make the output signal yk(n) come closer to the desired response dm(n) in a step-by-
step manner.
This objective is achieved by minimizing a cost function or index of performance ɛ(n)
defined in terms of the error signal ek(n) as:

That is, ԑ (n) is the instantaneous value of the error energy. The step-by-step adjustments to the
synaptic weights of neuron k are continued until the system reaches a steady state (i.e., the
synaptic weights are essentially stabilized. At that point the learning process is terminated.

The learning process described herein is obviously referred to as error correction learning. In
particular, minimization of the cost function ԑ(n) leads to a learning rule commonly referred to as
the delta rule or Windrow-Hoff rule, named in honor of its originators. Let ωkj (n) denote the
value of synaptic weight ωkj. of neuron k excited by element xj (n) of the signal vector x(n) at
time step n. According to the delta rule, the adjustment Δωkj(n) applied to the synaptic weight
ωkj at time step n is defined by
Δ ωkj (n) = ƞek (n) xj (n)
Where, ƞ is a positive constant which determines the rate of learning as we proceed from one
step in the learning process to another. It is therefore natural that we refer to n as the learning-
rate parameter.

In other words, the data rule maybe stated as:


The adjustment made to a synaptic weight of a neuron is proportional to the product of the error
signal and the input signal of the synapse in question.

The delta rule, as stated herein, presumes that the error signal is directly measurable. For this
measurement to be feasible we clearly need a supply of desired response from some external
source, which is directly accessible to neuron k.

In other words, neuron k is visible to the outside world, and depicted in Fig. 11.21(a). From this
figure we also observe that error-correction learning is in fact local in nature. This amounts to
saying that the synaptic adjustments made by the delta rule are localized around neuron k.

31
Having computed the synaptic adjustment Δωkj (n), the updated value of synaptic weight Δωkj, is
given by equation 11.26.

Effectively, ωkj(n) and ωkj(n + 1) may be viewed as the old and new values of synaptic weight
ωkj, respectively.
In computational terms we may also write:

Where, z-1 is the unit-delay operator. That is, z-1 represents a storage element.
Fig. 11.21(b) shows a signal-flow graph representation of the error-correction learning process,
with regard to neuron k. The input signal xj and the induced local field vk of the neuron k are
referred to as presynaptic and postsynaptic signals of the jth synapse of neuron k, respectively.
Also, the Fig. shows that the error-correction learning is an example of a closed-loop feedback
system.
But from the control theory we know that the stability of such a system is determined by those
parameters which constitute the feedback loops of the system. In this case there is a single
feedback loop and the one of the parameters of interest is ƞ, the learning rate. So to ensure the
stability of convergence of iterative learning ƞ should be selected judiciously.

2. Memory-Based Learning:
In memory-based learning, all (or most) of the past experiences are explicitly stored in a large
memory of correctly classified input-output examples: [(xi, di)]Ni =1, where xi denotes an input

32
vector and di denotes the corresponding desired response. Without loss of generality, we have
restricted the desired response to be a scalar.
For example, in a binary pattern classification problem there are two classes of hypotheses,
denoted by ԑ1and ԑ2, to be considered. In this example, the desired response d i take the value 0
(or -1) for class ԑ1 and the value 1 for class ԑ 2. When classification of a test vector test (not seen
before) is required, the algorithm responds by retrieving and analyzing the training data in a
“local neighborhood” of xtest.
All memory-based learning algorithms involve two essential ingredients:
a. Criterion used for defining the local neighborhood of the test vector xtest.
b. Learning rule applied to the training examples in the local neighborhood of xtest.
The algorithms differ from each other in the way in which these two ingredients are defined.

In a simple yet effective type of memory-based learning known as the nearest neighbor rule, the
local neighborhoods is defined as the training example which lies in the immediate
neighborhoods of the test vector xtest. In particular, the vector.

Where, d(xi, xtest ) is the Euclidean distance between the vectors xi and xtest. The class associated
with the minimum distance, that is, vector x’N is reported as the classification of xtest. This rule is
independent of the underlying distribution responsible for generating the training examples.
Cover and Hart (1967) have formally studied the nearest neighbor rule as a tool for pattern
classification.

The analysis is based on two assumptions:


a. The classified examples (xi, di) are independently and identically distributed (iid), according to
the joint probability distribution of the example (x, d).
b. The sample size N is infinitely large.

Under these two assumptions, it is shown that the probability of classification error incurred by
the nearest neighbor rule is bounded about by twice the Bays probability of error, that is, the
minimum probability of error over all decision rules. In this sense, it may be said that half the
classification information in a training set of infinite size is contained in the nearest neighbor,
which is a surprising result.

33
A variant of the nearest neighbor classifier is the k-nearest neighbor classifier, which
proceeds as:
a. Identify the k classified patterns which lie nearest to the test vector xtest for some integer k.
b. Assign xtest to class (hypothesis) which is most frequently represented in the k nearest
neighbors to xtest (i.e., use a majority vote to make the classification).
Thus, the k-nearest neighbor classifier acts like an averaging device.

3. Hebbian Learning (Generalized Learning) Supervised Learning:


Hebb’s postulate of learning is the oldest and the most famous of all learning rules; it is named in
honor of the neuropsychologist Hebb (1949).

When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part
in firing it, some growth process or metabolic changes take place in one or both cells such that
A’s efficiency as one of the cells firing B, is increased.

Hebb proposed this change as a basis of associative learning (at the cellular level), which would
result in an enduring modification in the activity pattern of a spatially distributed “assembly of
nerve cells”.

This statement is made in a neurobiological context. We may expand and rephrase it as a


two-part rule:
a. If two neurons on either side of a synapse are activated simultaneously (i.e., synchronously),
then the strength of that synapse is selectively increased.

b. If two neurons on either side of a synapse are activated asynchronously, then that synapse is
selectively weakened or eliminated.

Such a synapse is called Hebbian synapse. More precisely, we define a Hebbian synapse as a
synapse which uses a time-dependent, highly local, and strongly interactive mechanism to
increase synaptic efficiency as a function of the correlation between the presynaptic and
postsynaptic activities.

From this definition we may deduce the following four key properties which characterise a
Hebbian synapse:
i. Time-Dependent Mechanism:
This mechanism refers to the facts that the modifications in a Hebbian synapse depend on the
exact time of occurrence of the presynaptic and postsynaptic signals.

34
ii. Local Mechanism:
By its very nature, a synapse is the transmission site where information-bearing signals
(representing ongoing activity in the presynaptic and postsynaptic units) are in spatio temporal
contiguity. This locally available information is used by a Hebbian synapse to produce a local
synaptic modification which is input specific.

iii. Interactive Mechanism:


The occurrence of a change in a Hebbian synapse depends on signals on both sides of the
synapse. That is, a Hebbian form of learning depends on a “true interaction” between presynaptic
and postsynaptic signals in the sense that we cannot make a prediction from either one of these
two activities by itself.

iv. Conjunctional or Correlation Mechanism:


One interpretation of Hebb’s postulate of learning is that the condition for a change in synaptic
efficiency is the conjunction of presynaptic and postsynaptic signals. Thus, according to this
interpretation, the co-occurrence, of presynaptic and postsynaptic signals (within a short interval
of time) is sufficient to produce the synaptic modification. It is for this reason that a Hebbian
synapse is sometimes referred to as a conjunctional synapse or correlation synapse.

4. Competitive Learning Unsupervised Learning:


In competitive learning, as the name implies, the output neurons of a neural network compete
among themselves to become active. Whereas in a neural network based on Hebbian learning
several output neurons may be active simultaneously, in competitive learning only a single
output neuron is active at any one time. It is this feature which makes competitive learning
highly suited to discover statistically salient features which may be used to classify a set of input
patterns.

There are three basic elements to a competitive learning rule:


i. A set of neurons which are all the same except for some randomly distributed synaptic weights,
and which therefore, respond differently to a given get of input patterns.

ii. A limit imposed on the ‘strength’ of each neuron.

iii. A mechanism which permits the neurons to compete for the right to respond to a given subset
of inputs, such that only one output neuron or only one neuron per group, is active (i.e., ‘on’) at a
time. The neuron which wins the competition is called a winner-takes-all neuron.

35
Accordingly the individual neurons of the network learn to specialize on ensembles of similar
patterns; in so doing they become feature detectors for different classes of input patterns.

In the simplest form of competitive learning, the neural network has a single layer of output
neurons, each of which is fully connected to the input nodes. The network may include feedback
connections among the neurons, as indicated in Fig. 11.22. In the network architecture described
herein, the feedback connections perform lateral inhibition, with each neuron tending to inhibit
the neuron to which it is laterally connected. In contrast, the feed forward synaptic connections
in the network of Fig. 11.15 all are excitatory.

For a neuron k to be the winning neuron, it’s induced local field vk for a specified input pattern x
must be the largest among all the neurons in the network. The output signal yk of winning neuron
k is set equal to one; the output signals of all the neurons which lose the competition are set
equal to zero.
We thus write:

Where, the induced local field yk represents the combined action of all the forward and feedback
inputs to neuron k.
Let ωkj denote the synaptic weight connecting input node j to neuron k. Suppose that each neuron
is allotted a fixed amount of synaptic weight (i.e., all synaptic weights are positive), which is
distributed among its input nodes that is, for all k

36
A neuron then learns by shifting synaptic weights from its inactive to active input nodes. If a
neuron does not respond to a particular input pattern, no learning takes place in that neuron.

If a particular neuron wins the competition, each input node of that neuron relinquishes some
proportion of its synaptic weight, and the weight relinquished is then distributed equally among
the active input nodes. According to the standard competitive learning rule, the change
Δωkj applied to synaptic weight ωkj is defined by

Where, ƞ is the learning rate parameter. This rule has the overall effect of moving the synaptic
weight vector ωk of winning neuron k towards the input pattern
5. Boltzmann Learning:
The Boltzmann learning rule, named in honor of Ludwig Boltzmann, is a stochastic learning
algorithm derived from idea rooted in statistical mechanics. A neural network designed on the
basis of the Boltzmann learning rule is called a Boltzmann machine.

In a Boltzmann machine the neurons constitute a recurrent structure, and they operate in a binary
manner since, for example, they are either in an ‘on’ state denoted by + 1 or in an ‘off’ state
denoted by -1. The machine is characterized by an energy function; E the value of which is
determined by the particular states occupied by the individual neurons of the machine, as shown
by

Where xj is the state of neuron j and ωkj is the synaptic weight connecting neuron j to neuron k.
The fact that j ≠ k means simply that none of the neurons in the machine has self-feedback. The
machine operates by choosing a neuron at random for example, neuron k is at some step of the
learning process, then flipping the state of neuron k from state xk to state – xk at some
temperature T with probability

Where, ΔEk is the energy change (i.e., the change in the energy function of the machine)
resulting from such a flip. We may note that T is not a physical temperature, but rather a pseudo

37
temperature under stochastic Model of a Neuron. If this rule is applied repeatedly, the machine
will reach thermal equilibrium.
The neurons of a Boltzmann machine partition into two functional groups:
a. Visible and

b. Hidden.

The visible neurons provide an interface between the network and the environment in which it
operates, whereas the hidden neurons always operate freely.

There are two modes of operation to be considered:


I. Clamped condition, in which the visible neurons are all clamped onto specific states
determined by the environment.

II. Free-running condition, in which all the neurons (visible and hidden) are allowed to operate
freely.

Let P+kj denote the correlation between the states of neurons j and k, with the network in its
clamped condition and P–kj the correlation between the states of neurons j and k with the network
in its free-running condition. Both correlations are averaged over all possible states of the
machine when it is in thermal equilibrium.
Then, according to the Boltzmann learning rule, the change Δωkj applied to the synaptic
weight ωkj from neuron j to neuron k is defined by:
Δωkj = ƞ (ρk+j – ρ–kj), j≠ k….(11.35)
where, ƞ is learning-rate. Moreover both ρkj+ and ρ–kj range in value from -1 to +1.

Hebbian Learning Rule with Implementation of AND Gate


Hebbian Learning Rule, also known as Hebb Learning Rule, was proposed by Donald O Hebb. It
is one of the first and also easiest learning rules in the neural network. It is used for pattern
classification. It is a single layer neural network, i.e. it has one input layer and one output layer.
The input layer can have many units, say n. The output layer only has one unit. Hebbian rule
works by updating the weights between neurons in the neural network for each training sample.
Hebbian Learning Rule Algorithm:
1. Set all weights to zero, WI = 0 for i=1 to n, and bias to zero.
2. For each input vector, S (input vector): t (target output pair), repeat steps 3-5.
3. Set activations for input units with the input vector Xi = Si for i = 1 to n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:

38
Implementing AND Gate:

Truth Table of AND Gate using bipolar sigmoid function

There are 4 training samples, so there will be 4 iterations. Also, the activation function
used here is Bipolar Sigmoid Function so the range is [-1,1].
Step 1:
Set weight and bias to zero, w = [ 0 ]T and b = 0.
Step 2:
Set input vector Xi = Si for i = 1 to 4.
X1 = [-1 -1 1] T
X2 = [-1 1 1] T
X3 = [1 -1 1] T
X4 = [1 1 1] T
Step 3:
Output value is set to y = t.
Step 4:
Modifying weights using Hebbian Rule:
First iteration –
W (new) = w (old) + x1y1 = [0] T + [-1 -1 1] T. [-1] = [1 1 -1] T
For the second iteration, the final weight of the first one will be used and so on.
Second iteration –
W (new) = [1 1 -1] T + [-1 1] T. [-1] = [2 0 -2] T

39
Third iteration –
W (new) = [2 0 -2] T + [1 -1 1] T. [-1] = [1 1 -3] T
Fourth iteration –
W (new) = [1 1 -3] T + [1 1] T. [1] = [2 2 -2] T
So, the final weight matrix is [2 2 -2] T

Testing the network:

The network with the final weights

For x1 = -1, x2 = -1, b = 1, Y = (-1)(2) + (-1)(2) + (1)(-2) = -6


For x1 = -1, x2 = 1, b = 1, Y = (-1)(2) + (1)(2) + (1)(-2) = -2
For x1 = 1, x2 = -1, b = 1, Y = (1)(2) + (-1)(2) + (1)(-2) = -2
For x1 = 1, x2 = 1, b = 1, Y = (1)(2) + (1)(2) + (1)(-2) = 2
The results are all compatible with the original table.
Decision Boundary:
2x1 + 2x2 – 2b = y
Replacing y with 0, 2x1 + 2x2 – 2b = 0
Since bias, b = 1, so 2x1 + 2x2 – 2(1) = 0
2( x1 + x2 ) = 2
The final equation, x2 = -x1 + 1

Decision Boundary of AND Function

40
Neural Networks with Memory

We always heard that Neural Networks (NNs) are inspired by biological neural networks. This

huge representation was done in a fantastic way. Figure 1 shows the anatomy of a single neuron.

The central part is called the cell body where the nucleus resides. There are various wires which

pass the stimulus to the cell body and few wires which send the output to the other neurons. The

thickness of the dendrites implies the weight/bias/power of the stimulus. Many neurons with

various cell bodies are stacked up which forms the biological neural network.

Figure 1: Anatomy of Single Neuron (Source, Edited by author)

Figure 2: Single neuron neural network (Image created by author)

Advantages of neural networks over traditional machine learning algorithms

• Various types and sizes of data can be handled

• Multiple functions can be easily configured

• Non-linear data can be efficiently handled

41
Neural Networks with memory

The main difference between the functioning of neural networks and the biological neural

network is memory. While both the human brain and neural networks have the ability to read and

write from the memory available, the brain can create/store the memory as

well. Researchers identified that this key difference is the major roadblock for today’s AI systems

to reach human-level intelligence. Researchers at Deep Mind aimed to build a differentiable

computer, by putting together a neural network and linking it to external memory. The neural

network would act as a CPU with a memory attached. Such differentiable computers aim to learn

programs (algorithms) from input and output data. The neural networks are used when the amount

of data is huge. For example, text data has an enormous amount of dimensions or the image data

which is split into a huge number of pixels.

Recurrent Neural Network

A movie consists of a sequence of scenes. When we watch a particular scene, we don’t try to

understand it in isolation, but rather in connection with previous scenes. In a similar fashion, a

machine learning model has to understand the text by utilizing already-learned text, just like in a

human neural network. In traditional machine learning models, we cannot store a model’s

previous stages. However, Recurrent Neural Networks (commonly called RNN) can do this for

us. Let’s take a closer look at RNNs below.

Figure 3: Working of a basic RNN (Image by Author)

42
An RNN has a repeating module that takes input from the previous stage and gives its output as

input to the next stage. However, in RNNs we can only retain information from the most recent

stage. That’s where LSTM comes to the picture.

Long Short Term Memory Networks

To learn long-term dependencies, our network needs memorization power. LSTMs are a special

case of RNNs which can do that. They have the same chain-like structure as RNNs, but with a

different repeating module structure.

Figure 4: Working of LSTM (Image by author)

LSTM has a wide range of applications in Sequence-to-Sequence modeling tasks like Speech

Recognition, Text Summarization, Video Classification, and so on. To understand how these

networks can be adopted in real-life applications in a quick glance, do check the article below. A

spam detection model can be achieved by converting text data into vectors, creating an LSTM
model, and fitting the model with the vectors.
Deep Learning models are broadly classified into supervised and unsupervised models.
Supervised DL models:
• Artificial Neural Networks (ANNs)
• Recurrent Neural Networks (RNNs)
• Convolution Neural Networks (CNNs)
Unsupervised DL models:
• Self Organizing Maps (SOMs)
• Boltzmann Machines
• Auto encoders

43
Let us learn what exactly Boltzmann machines are, how they work and also implement a
recommender system which recommends whether the user likes a movie or not based on the
previous movies watched.
Boltzmann Machines is an unsupervised DL model in which every node is connected to every
other node. That is, unlike the ANNs, CNNs, RNNs and SOMs, the Boltzmann Machines
are undirected (or the connections are bidirectional). Boltzmann Machine is not a deterministic
DL model but a stochastic or generative DL model. It is rather a representation of a certain
system. There are two types of nodes in the Boltzmann Machine — Visible nodes — those
nodes which we can and do measure, and the Hidden nodes – those nodes which we cannot or
do not measure. Although the node types are different, the Boltzmann machine considers them as
the same and everything works as one single system. The training data is fed into the Boltzmann
Machine and the weights of the system are adjusted accordingly. Boltzmann machines help us
understand abnormalities by learning about the working of the system in normal conditions.

Boltzmann Machine

Energy-Based Models:
Boltzmann Distribution is used in the sampling distribution of the Boltzmann Machine. The
Boltzmann distribution is governed by the equation –

Pi = e (-∈i/kT)/ ∑e (-∈j/kT)
Pi - probability of system being in state i
∈i - Energy of system in state i
T - Temperature of the system
k - Boltzmann constant
∑e (-∈j/kT) - Sum of values for all possible states of the system
Boltzmann Distribution describes different states of the system and thus Boltzmann machines
create different states of the machine using this distribution. From the above equation, as the
energy of system increases, the probability for the system to be in state ‘i’ decreases. Thus, the
system is the most stable in its lowest energy state (a gas is most stable when it spreads). Here, in
Boltzmann machines, the energy of the system is defined in terms of the weights of synapses.
Once the system is trained and the weights are set, the system always tries to find the lowest
energy state for itself by adjusting the weights.

44
Types of Boltzmann Machines:
• Restricted Boltzmann Machines (RBMs)
• Deep Belief Networks (DBNs)
• Deep Boltzmann Machines (DBMs)
Restricted Boltzmann Machines (RBMs):
In a full Boltzmann machine, each node is connected to every other node and hence the
connections grow exponentially. This is the reason we use RBMs. The restrictions in the
node connections in RBMs are as follows –
• Hidden nodes cannot be connected to one another.
• Visible nodes connected to one another.
Energy function example for Restricted Boltzmann Machine –
E (v, h) = -∑ alive - ∑ bjhj - ∑∑ viwi,jhj
A, v - biases in the system - constants
VI, hj - visible node, hidden node
P (v, h) = Probability of being in a certain state
P (v, h) = e (-E (v, h))/Z
Z - sum if values for all possible states
Suppose that we are using our RBM for building a recommender system that works on six (6)
movies. RBM learns how to allocate the hidden nodes to certain features. By the process
of Contrastive Divergence, we make the RBM close to our set of movies that is our case or
scenario. RBM identifies which features are important by the training process. The training data
is either 0 or 1 or missing data based on whether a user liked that movie (1), disliked that movie
(0) or did not watch the movie (missing data). RBM automatically identifies important features.

Contrastive Divergence:
RBM adjusts its weights by this method. Using some randomly assigned initial weights, RBM
calculates the hidden nodes, which in turn use the same weights to reconstruct the input nodes.
Each hidden node is constructed from all the visible nodes and each visible node is reconstructed
from the entire hidden node and hence, the input is different from the reconstructed input, though
the weights are the same. The process continues until the reconstructed input matches the
previous input. The process is said to be converged at this stage. This entire procedure is known
as Gibbs Sampling.

Gibb’s sampling

45
The Gradient Formula gives the gradient of the log probability of the certain state of the system
with respect to the weights of the system. It is given as follows –
D/dig (log (P (v0))) = <vi0 * hj0> - <vi∞ * hj∞>
v - Visible state, h- hidden state
<vi0 * hj0> - initial state of the system
<vi∞ * hj∞> - final state of the system
P (v0) - probability that the system is in state v0
Wij - weights of the system
The above equations tell us – how the change in weights of the system will change the log
probability of the system to be a particular state. The system tries to end up in the lowest possible
energy state (most stable). Instead of continuing the adjusting of weights process until the current
input matches the previous one, we can also consider the first few pauses only. It is sufficient to
understand how to adjust our curve so as to get the lowest energy state. Therefore, we adjust the
weights; redesign the system and energy curve such that we get the lowest energy for the current
position. This is known as the Hinton’s shortcut.

Hinton’s Shortcut

Working of RBM – Illustrative Example –


Consider – Mary watches four movies out of the six available movies and rates four of them.
Say, she watched m1, m3, m4 and m5 and likes m3, m5 (rated 1) and dislikes the other two, that is
m1, m4 (rated 0) whereas the other two movies – m2, m6 are unrated. Now, using our RBM, we
will recommend one of these movies for her to watch next. Say –
• m3, m5 are of ‘Drama’ genre.
• m1, m4 are of ‘Action’ genre.
• ‘Vicario’ played a role in m5.
• m3, m5 have won ‘Oscar.’
• ‘Tarantino’ directed m4.
• M2 is of the ‘Action’ genre.
• M6 is of both the genres ‘Action’ and ‘Drama’, ‘Vicario’ acted in it and it has won an
‘Oscar’.
We have the following observations –
• Mary likes m3, m5 and they are of genre ‘Drama,’ she probably likes ‘Drama’ movies.
• Mary dislikes m1, m4 and they are of action genre, she probably dislikes ‘Action’ movies.
• Mary likes m3, m5 and they have won an ‘Oscar’, she probably likes an ‘Oscar’ movie.
• Since ‘Vicario’ acted in m5 and Mary likes it, she will probably like a movie in
which ‘Vicario’ acted.

46
• Mary does not like m4 which is directed by Tarantino, she probably dislikes any movie
directed by ‘Tarantino’.
Therefore, based on the observations and the details of m2, m6; our RBM recommends m6 to
Mary (‘Drama’, ‘Vicario’ and ‘Oscar’ matches both Mary’s interests and m6). This is how an
RBM works and hence is used in recommender systems.

Working of RBM

Thus, RBMs are used to build Recommender Systems.


Deep Belief Networks (DBNs):
Suppose we stack several RBMs on top of each other so that the first RBM outputs are the input
to the second RBM and so on. Such networks are known as Deep Belief Networks. The
connections within each layer are undirected (since each layer is an RBM). Simultaneously,
those in between the layers are directed (except the top two layers – the connection between the
top two layers is undirected). There are two ways to train the DBNs-
1. Greedy Layer-wise Training Algorithm – The RBMs are trained layer by layer. Once the
individual RBMs are trained (that is, the parameters – weights, biases are set), the direction is
set up between the DBN layers.
2. Wake-Sleep Algorithm – The DBN is trained all the way up (connections going up – wake)
and then down the network (connections going down — sleep).
Therefore, we stack the RBMs, train them, and once we have the parameters trained, we make
sure that the connections between the layers only work downwards (except for the top two
layers).
Deep Boltzmann Machines (DBMs):
DBMs are similar to DBNs except that apart from the connections within layers, the connections
between the layers are also undirected (unlike DBN in which the connections between layers are
directed). DBMs can extract more complex or sophisticated features and hence can be used for
more complex tasks.

What is learning?

Learning rule in Neural Network means machine learning i.e. how a machine learns. It is a
mathematical logic to improve the performance of Artificial Neural Network. Learning rule is
one of the most important factors to decide the accuracy of ANN.

47
Synapse

Dictionary meaning of synapse is togetherness, conjunction. The purpose of the synapse is to


pass the signals from one neuron to target neuron. It is the basis through which neurons interact
with each other.

Presynaptic and Postsynaptic Neuron

Information flow is directional. The neuron which fires the chemical called neurotransmitter is
presynaptic neuron and the neuron which receives neurotransmitter is postsynaptic neuron.

Learning Rules

Hebbian Learning

It is one of the oldest learning algorithms. A synapse (connection) between two neurons is
strengthened if the neuron A on either side of the synapse is near enough to excite neuron B, and
repeatedly or persistently takes part in firing it. It leads to some growth process or metabolic
changes in one or both cells such that A’s efficiency as one of the cells firing B, is increased.

Time-dependent mechanism: - In it, modification in the Hebbian Synapse depend on the exact
time of the occurrence of the presynaptic and postsynaptic activities.

Local mechanism:- Since synapse holds information-bearing signals. This locally available
information is used by Hebbian synapse to produce local synaptic modification that is input
specific.

48
Interactive mechanism: - In it change in Hebbian synapse depends upon the activity levels on
both sides of the synapse (i.e. presynaptic and postsynaptic activities).

Conjunctional or correlation mechanism:- The co-occurrence of presynaptic and postsynaptic


activities is sufficient to produce synaptic modification. That is why it is referred to as a
conjunctional synapse. The synaptic change also depends upon the co-relation between
presynaptic and postsynaptic activities due to which it is called correlation synapse.

Competitive Learning

In competitive learning, the output neuron competes among them for being fired. Unlike
Hebbian Learning, only one neuron can be active at a time. It is the form of unsupervised
learning.

It plays an important role in the formation of topographic maps.

The three basic elements to a competitive learning rule:-

1. A set of neurons that is same except some neurons and which therefore respond differently to the
given input set.
2. A limit imposed on the “strength” of each neuron.
3. A mechanism that allows a neuron to compete among themselves for a given set of inputs. The
winning neuron is called a winner-takes-all-neuron.

The internal activity vj of the winning neuron must be the largest among all the neurons for a
specified input pattern x. The output signal vj of the winning neuron is set equal to one; the
output signal of all other neurons that lose the competition is set to zero.

49
Boltzmann Learning

Boltzmann learning is statistical in nature. It is derived from the field of thermodynamics. It is


similar to an error-correction learning rule. However, in Boltzmann learning, we take a
difference between the probability distribution of the system instead of the direct difference
between the actual value and desired output.

Boltzmann learning rule is slower than the error-correction learning rule because in it the state of
each individual neuron, in addition to the system output is taken into account.

The neurons operate in a binary manner representing +1 for on state and -1 for off state. The
machine is characterized by an energy function E

The neurons of Boltzmann machine is divided into two functional groups, visible and hidden.
The visible neurons act as an interface between the network and the environment in which it
operates, whereas the hidden neurons always act freely.

The Boltzmann machine works on two modes of operation:

Linear Regression in Statistics

For a linear regression of statistical data with multiple predictors, let’s begin with a linear

equation to represent the relationship between y= (yᵢ) and X= (xᵢⱼ):

Where yᵢ is the dependent response variable and xᵢⱼ are the observed values of each independent

variable j, of which there are p for each statistical unit i, of which there are n. The error term is εᵢ.

The predictors are βⱼ, of which there are p+1. Here’s a view of linear data when there is one

predicting variable (p=1).

50
Linear Regression, Vicariate (image by author)

We can also use vectors and matrices to represent the linear equation. The vector y =

(y₁,…,yᵢ,…,yₙ)⊤ represents the values taken by the response variable. X of dimension n×(p+1) is

the matrix of xᵢⱼ predictor values, with the first column defined as a constant, meaning that xᵢ₀ ≔

1.

Representing the linear equation with vectors and matrices gives us:

For the linear regression of y on X with error vector ε, the coefficient vector β is derived by

minimizing the sum of squares of the residuals, or errors:

Or in vector and matrix form:

51
Taking the partial derivatives with respect to the vector β, and then setting it equal to zero derives

the minimum value for β, which we will name β^ₒₗₛ because we are using the ordinary least

squares (OLS) method to derive the estimator for β:

At this value, β is a true minimum because the Hessian matrix of the second derivatives is definite

positive. From β^, we can predict y, ŷ, using the following equation:

Statisticians use the above geometric derivations when investigating a linear statistical model, a

model that is tested before being put to use to make predictions. An example basic model is (Tile,

2019):

Where this model is formalized as follows:

• y is the vector of constants of n observed outcomes

• X is a n×(p+1) full-rank matrix of non-random constants containing the observed


independent data, xᵢⱼ, with an added first column of 1s

• β is the vector of the unknown coefficients (that is, the estimators) in ℝ⁽ᵖ⁺¹⁾

• ε is a vector of size n containing unknown random variables, or error terms, εᵢ

Typical hypotheses of the model are as follows:

52
• The matrix X is not random and is full rank. If the matrix X is not full rank, then at least one
of the columns of the matrix (that is, of the covariates) can be written as a linear combination
of other columns, suggesting a reconsideration of the data

• The expectation of the error terms is zero: 𝔼(ε) = 0

• The variance of the error terms is constant: Var(εᵢ) = σ² for all i, that is, homoscedastic

• The covariance of the error terms is zero: Cov(εᵢ, εⱼ) = 0 for all i≠j

The Gauss-Markov theorem states that, for this model with Normal distribution error terms, the

ordinary-least-squares derived estimator for β is the best linear unbiased estimator. So we get:

A prediction can then be made for a new set of independent variables xₖ:

After testing to ensure that the model fits the data, statistical theory then defines other important

values, such as confidence intervals for variance of the estimators and prediction intervals for the

model’s predictions.

Logistic Regression in Statistics

We can generalize the above linear statistical model, with Normal (Gaussian) error terms, via

mathematical transformations into the generalized linear model (GLM) in statistics, allowing

regression, estimator tests, and analysis of the exponential family of conditional distributions
of y given X, such as Binomial, Multinomial, Exponential, Gamma, and Poisson.

53
The parameters are estimated using the maximum likelihood method. For logistic regression, when

there is a binomial response, y ∈ {0,1}, the logistic function defines the probability of a successful

outcome, 𝜋 = P(y=1|x), where x is the vector of observed predictive variables, of which there

are p. If β is the vector of the unknown predictors, of which there are p+1, and using z = xβ,

then (Matei, 2019):

We can apply this function via the log-odds, or logit, to a linear model as follows:

Where 𝜋ᵢ = P (yᵢ=1|xᵢ) and xᵢ is the ith observed outcome, of which there are n. The logistic

function above allows us to apply the theory behind linear regression to a probability between 0

and 1 of predicting a successful outcome. Statistical tests and data measures, such as deviance,

goodness of fit measures, Wald test, and Pearson 𝜒² statistic, can be applied using this model.

54
Linear Regression using Machine Learning

From the machine learning point of view, predictive models are considered too complicated or

computationally intensive to solve mathematically. Instead, very small steps are taken on portions

of the data and iteratively cycled through to derive the solution. We’ll walk through the solution

to linear regression using machine learning, before we proceed; however, it is important to first

understand that in machine learning, the function to be solved isn’t typically predefined. In our

case, we already know that we want to perform only a linear regression, but typically in machine

learning, various models (or functions) of the data are compared until the best trade-off between

being too general and imprecise, on the one hand, and over fitting the data, on the other hand, is

found empirically.

Taking the derivative of the loss function and setting it equal to zero yields the coefficient values,

but we’re going to perform the calculation stepwise, with one calculation for each training

instance since the machine learning algorithm will pass through the data multiple times.

To find the direction of the stepwise updates, we’ll take the derivative of the loss function, and

use that direction to move our learning a step towards the minimum:

This process is known as gradient descent, and 𝛼 defines the length of the small step, which is

the learning rate.

55
In machine learning, we consider training in pairs (x, y₁)… (X, yₙ) and we cycle through updates

to each pair multiple times when optimizing θ. Let’s look at the derivative of the squared error for

each training instance:

This equation gives us the direction in which to move the β values, which are also known

as weights, towards their minimum. The constant 2 is typically disregarded because it doesn’t

affect the optimal values of β (Ng, 2018). So in our case, for each mth training instance, β is
updated as follows:

We start the learning process by establishing random values for each weight value of β and begin

the algorithm. The learning rate 𝛼 should be set so that progress towards the minimum of the loss

function is sufficiently fast without overshooting and making the minimum impossible to reach. A
dynamic learning rate, where 𝛼 is reduced as the function nears its minimum, is often

implemented. Assuming the support of a good learning rate, this machine learning algorithm will

calculate the values of the coefficients β as precisely as desired, reaching the same values derived

mathematically in the section above.

56
Logistic Regression with a Neural Network

The idea of neural networks came from the concept of how neurons work in living animals: a

nerve signal is either amplified or dampened by each neuron the signal passes through, and it is

the sum of multiple neurons in series and in parallel, each filtering multiple inputs and feeding

that signal to additional neurons to eventually provide the desired output. The feed-forward neural

network is the simplest form of a neural network, where the calculation is done only in the

forward direction, from input to output. Neural networks allow for the use of multiple layers of

neurons where each layer provides specific functions a simple linear regression neural network,

however, can be constructed with a single layer of neurons operating linearly.

The figure below shows the framework for a simple feed-forward neural network that provides

logistic regression:

In a simple feed-forward neural network for classification, the weights wⱼ and ‘bias’ term w₀

represent the coefficients of β from the linear regression method and are trained by the network

using the Error (ε) as shown in the figure.

The general neural network function takes the following form (Bishop, 2006):

57
Where f (·) is a nonlinear activation function and φⱼ(x) is a basis function. The basis function can

transform the inputs x before the weights w are determined. In the case of logistic regression, the

basis function is set to 1 so that the inputs remain linear.

The activation function f(·) is also set to 1 for linear regression. However, with logistic regression,

a specific activation function is needed to convert the output of the linearly determined weights to

a predicted probability of the binomial response, 0 or 1. The activation function is

the sigmoid function, which is equivalent to the logistic function defined for logistic regression for

statistics. The sigmoid function, in contrast to the logistic function 𝜋(z), is mathematically

converted to have only one exponent to simplify programming as shown in the following

equation:

Where z = xβ. The sigmoid activation function provides the probability of the prediction.

Multinomial Logistic Regression

Previously we used the generalized linear model in statistics to expand linear regression to logistic

regression for a binomial response. We can do a similar transformation for situations where the

response is multinomial, i.e., multiclass. The key difference is that instead of using the sigmoid
activation function to provide a probability to the prediction, the soft ax function is used.

58
Where z = xβ and K is the number of classes.

59

You might also like