ANN-unit 1
ANN-unit 1
Outline
Introduction: A Neural Network
Human Brain, Models of a Neuron
Neural Networks viewed as Directed
Graphs
Network Architectures
1
UNIT 1
A Neural Network
A neural network is a type of machine learning model inspired by the structure and function of
the human brain. It consists of interconnected nodes, or artificial neurons, organized in layers to
process and analyze data. Each neuron receives inputs, applies a mathematical operation to them,
and produces an output that is passed on to other neurons in the network.
Neural networks can be used for a wide range of tasks, such as image and speech recognition,
natural language processing, and predictive modeling. They are particularly effective for tasks
that involve large amounts of complex data and patterns that are difficult to discern with
traditional algorithms.
Training a neural network involves adjusting the weights and biases of the connections between
neurons to minimize the difference between the predicted output and the actual output. This
process is typically done using an optimization algorithm, such as gradient descent, and a loss
function, which measures the difference between the predicted output and the actual output.
The architecture of a neural network can vary depending on the task at hand. Some common
types of neural networks include feed forward neural networks, recurrent neural networks, and
convolution neural networks.
There are several types of neural networks, each designed for specific tasks and data structures.
Some of the most common types of neural networks are:
Feed forward Neural Networks: These are the simplest and most common type of neural
network. They consist of input, output, and hidden layers, with data flowing only in one direction
from input to output. They are often used for tasks such as classification and regression.
Recurrent Neural Networks: These neural networks have connections between nodes that form
a directed cycle. They are often used for tasks such as natural language processing and speech
recognition, where the sequence of input data is important.
Convolution Neural Networks: These neural networks are designed for processing images and
other multi-dimensional data. They use convolution layers to extract features from the input data,
and pooling layers to reduce the size of the data.
Auto encoders: These neural networks are designed for unsupervised learning, where the goal is
to learn a compressed representation of the input data. They consist of an encoder and a decoder,
which learn to map the input data to a lower-dimensional space and back again.
2
Generative Adversarial Networks: These neural networks consist of a generator and a
discriminator, which are trained together in a game-like setup. The generator learns to generate
new data that is similar to the training data, while the discriminator learns to distinguish between
real and generated data.
Long Short-Term Memory Networks: These neural networks are a type of recurrent neural
network that is designed to handle long-term dependencies. They use memory cells to remember
information from earlier in the sequence and avoid the vanishing gradient problem.
There are also many other types of neural networks, including radial basis function networks,
self-organizing maps, and deep belief networks, each designed for specific tasks and data
structures.
Neural networks that leverage cloud of online services also have the benefit of risk mitigation
compared to systems that rely on local technology hardware. In addition, neural networks can
often perform multiple tasks simultaneously (or at least distribute tasks to be performed by
modular networks at the same time).
Last, neural networks are continually being expanded into new applications. While early,
theoretical neural networks were very limited to its applicability into different fields, neural
networks today are leveraged in medicine, science, finance, agriculture, or security.
Though the complexity of neural networks is a strength, this may mean it takes months (if not
longer) to develop a specific algorithm for a specific task. In addition, it may be difficult to spot
any errors or deficiencies in the process, especially if the results are estimates or theoretical
ranges.
Neural networks may also be difficult to audit. Some neural network processes may feel "like a
black box" where input is entered, networks perform complicated processes, and output is
reported. It may also be difficult for individuals to analyze weaknesses within the calculation or
3
learning process of the network if the network lacks general transparency on how a model
learns upon prior activity.
Neural Networks
Pros
• Can often work more efficiently and for longer than humans
• Can be programmed to learn from prior outcomes to strive to make smarter future
calculations
• Often leverage online services that reduce (but do not eliminate) systematic risk
• Are continually being expanded in new fields with more difficult problems
Cons
• Still rely on hardware that may require labor and expertise to maintain
• May take long periods of time to develop the code and algorithms
• May be difficult to assess errors or adoptions to the assumptions if the system is self-
learning but lacks transparency
• Usually report an estimated range or estimated amount that may not actualize
There are three main components: an input later, a processing layer, and an output layer. The
inputs may be weighted based on various criteria. Within the processing layer, which is hidden
from view, there are nodes and connections between these nodes, meant to be analogous to the
neurons and synapses in an animal brain.
Also known as a deep learning network, a deep neural network, at its most basic, is one that
involves two or more processing layers. Deep neural networks rely on machine learning
networks that continually evolve by compared estimated outcomes to actual results, then
modifying future projections.
All neural networks have three main components. First, the input is the data entered into the
network that is to be analyzed. Second, the processing layer utilizes the data (and prior
knowledge of similar data sets) to formulate an expected outcome. That outcome is the third
component, and this third component is the desired end product from the analysis.
4
The Bottom Line
Neural networks are complex, integrated systems that can perform analytics much deeper and
faster than human capability. There are different types of neural networks, often best suited for
different purposes and target outputs. In finance, neural networks are used to analyze
transaction history, understand asset movement, and predict financial market outcomes.
The term "Artificial Neural Network" is derived from Biological neural networks that develop
the structure of a human brain. Similar to the human brain that has neurons interconnected to one
another; artificial neural networks also have neurons that are interconnected to one another in
various layers of the networks. These neurons are known as nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Dendrites Inputs
5
Cell nucleus Nodes
Synapse Weights
Axon Output
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic
the network of neurons makes up a human brain so that computers will have an option to
understand things and make decisions in a human-like manner. The artificial neural network is
designed by programming computers to behave simply like interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association point
somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in such a
manner as to be distributed, and we can extract more than one piece of this data when necessary
from our memory parallels. We can say that the human brain is made up of incredibly amazing
parallel processors.
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs. If
one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off," then we
get "Off" in output. Here the output depends upon input. Our brain does not perform the same
task. The outputs to inputs relationship keep changing because of the neurons in our brain, which
are "learning."
Input Layer:
6
As the name suggests, it accepts inputs in several different formats provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations to
find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally results in
output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should fire or not. Only those who are fired make it
to the output layer. There are distinctive activation functions available that can be applied upon
the sort of task we are performing.
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
Data that is used in traditional programming is stored on the whole network, not on a database.
The disappearance of a couple of pieces of data in one place doesn't prevent the network from
working.
After ANN training, the information may produce output even with inadequate data. The loss of
performance here relies upon the significance of missing data.
7
For ANN is to be able to adapt, it is important to determine the examples and to encourage the
network according to the desired output by demonstrating these examples to the network. The
succession of the network is directly proportional to the chosen instances, and if the event can't
appear to the network in all its aspects, it can produce false output.
Extortion of one or more cells of ANN does not prohibit it from generating output, and this
feature makes the network fault-tolerance.
There is no particular guideline for determining the structure of artificial neural networks. The
appropriate network structure is accomplished through experience, trial, and error.
It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their structure.
Therefore, the realization of the equipment is dependent.
ANNs can work with numerical data. Problems must be converted into numerical values before
being introduced to ANN. The presentation mechanism to be resolved here will directly impact
the performance of the network. It relies on the user's abilities.
8
A neural network comprises of three main layers, which are as follows;
o Input layer: The input layer accepts all the inputs that are provided by the programmer.
o Hidden layer: In between the input and output layer; there is a set of hidden layers on
which computations are performed that further results in the output.
o Output layer: After the input layer undergoes a series of transformations while passing
through the hidden layer, it results in output that is delivered by the output layer.
Basically, the neural network is based on the neurons, which are nothing but the brain cells. A
biological neuron receives input from other sources, combines them in some way, followed by
performing a nonlinear operation on the result, and the output is the final result.
The dendrites will act as a receiver that receives signals from other neurons, which are then
passed on to the cell body. The cell body will perform some operations that can be a summation,
9
multiplication, etc. After the operations are performed on the set of input, then they are
transferred to the next neuron via axion, which is the transmitter of the signal for the neuron.
o With Neural Network: However, when we talk about the case with a neural network,
even if we have not trained our machine with that particular cat. But still, it can identify
certain features of a cat that we have trained on, and it can match those features with the
cat that is there in that particular image and can also identify the cat. So, with the help of
this example, you can clearly see the importance of the concept of a neural network.
Instead of directly getting into the working of Artificial Neural Networks, lets breakdown and try
to understand Neural Network's basic unit, which is called a Perception.
So, a perception can be defined as a neural network with a single layer that classifies the linear
data. It further constitutes four major components, which are as follows;
1. Inputs
2. Weights and Bias
10
3. Summation Functions
4. Activation or transformation function
The inputs (x) are fed into the input layer, which undergoes multiplication with the allotted
weights (w) followed by experiencing addition in order to form weighted sums. Then these
inputs weighted sums with their corresponding weights are executed on the pertinent activation
function.
As and when the input variable is fed into the network, a random value is given as a weight of
that particular input, such that each individual weight represents the importance of that input in
order to make correct predictions of the result.
However, bias helps in the adjustment of the curve of activation function so as to accomplish a
precise output.
Summation Function
After the weights are assigned to the input, it then computes the product of each input and
weights. Then the weighted sum is calculated by the summation function in which all of the
products are added.
Activation Function
The main objective of the activation function is to perform a mapping of a weighted sum upon
the output. The transformation function comprises of activation functions such as tanh, ReLU,
sigmoid, etc.
11
2. Non-Linear Activation Function
In the linear activation function, the output of functions is not restricted in between any range. Its
range is specified from -infinity to infinity. For each individual neuron, the inputs get multiplied
with the weight of each respective neuron, which in turn leads to the creation of output signal
proportional to the input. If all the input layers are linear in nature, then the final activation of the
last layer will actually be the linear function of the initial layer's input.
These are one of the most widely used activation function. It helps the model in generalizing and
adapting any sort of data in order to perform correct differentiation among the output. It solves
the following problems faced by linear activation functions:
o Since the non-linear function comes up with derivative functions, so the problems related
to back propagation has been successfully solved.
o For the creation of deep neural networks, it permits the stacking up of several layers of
the neurons.
The non-linear activation function is further divided into the following parts:
12
output value range between 0 and 1 that helps in the normalization of each neuron's
output. For X, if it has a value above 2 or below -2, then the values of y will be much
steeper. In simple language, it means that even a small change in the X can bring a lot of
change in Y.
It's value ranges between 0 and 1 due to which it is highly preferred by binary
classification whose result is either 0 or 1.
13
that leads to effectual as well as easier computations.
Gradient descent is an optimization algorithm that is utilized to minimize the cost function used
in various machine learning algorithms so as to update the parameters of the learning model. In
linear regression, these parameters are coefficients, whereas, in the neural network, they are
weights.
Procedure:
It all starts with the coefficient's initial value or function's coefficient that may be either 0.0 or
any small arbitrary value.
Coefficient = 0.0
For estimating the cost of the coefficients, they are plugged into the function that helps in
evaluating.
Cost = f (coefficient)
or, cost = evaluate(f(coefficient))
Next, the derivate will be calculated, which is one of the concepts of calculus that relates to the
function's slope at any given instance. In order to know the direction in which the values of the
coefficient will move, we need to calculate the slope so as to accomplish a low cost in the next
iteration.
14
Now that we have found the downhill direction, it will further help in updating the values of
coefficients. Next, we will need to specify alpha, which is a learning rate parameter, as it handles
the amount of amendments made by coefficients on each update.
Until the cost of the coefficient reaches 0.0 or somewhat close enough to it, the whole process
will reiterate again and again.
It can be concluded that gradient descent is a very simple as well as straightforward concept. It
just requires you to know about the gradient of the cost function or simply the function that you
are willing to optimize.
Now assume that hƟ represents the hypothesis for linear regression and ∑ computes the sum of
all training examples from i=1 to m. Then the cost of function will be computed by:
Repeat {
Here x(i) indicates the jth feature of the ith training example. In case if m is very large, then
derivative will fail to converge at a global minimum
At a single repetition, the stochastic gradient descent processes only one training example, which
means it necessitates for all the parameters to update after the one single training example is
processed per single iteration. It tends to be much faster than that of the batch gradient descent,
but when we have a huge number of training examples, and then also it processes a single
example due to which system may undergo a large no of repetitions. To evenly train the
parameters provided by each type of data, properly shuffle the dataset.
15
Algorithm for Stochastic Gradient Descent
Repeat {
For i=1 to m {
The Batch Gradient Descent algorithm follows a straight-line path towards the minimum. The
algorithm converges towards the global minimum, in case the cost function is convex, else
towards the local minimum, if the cost function is not convex. Here the learning rate is typically
constant.
However, in the case of Stochastic Gradient Descent, the algorithm fluctuates all over the global
minimum rather than converging. The learning rate is changed slowly so that it can converge.
Since it processes only one example in one iteration, it tends out to be noisy.
Back propagation
The back propagation consists of an input layer of neurons, an output layer, and at least one
hidden layer. The neurons perform a weighted sum upon the input layer, which is then used by
the activation function as an input, especially by the sigmoid activation function. It also makes
use of supervised learning to teach the network. It constantly updates the weights of the network
until the desired output is met by the network. It includes the following factors that are
responsible for the training and performance of the network:
16
o A number of training cycles.
o A number of hidden neurons.
o The training set.
o Teaching parameter values such as learning rate and momentum.
17
Building an ANN
Before starting with building an ANN model, we will require a dataset on which our model is
going to work. The dataset is the collection of data for a particular problem, which is in the form
of a CSV file.
CSV stands for Comma-separated values that save the data in the tabular format. We are using
a fictional dataset of banks. The bank dataset contains data of its 10,000 customers with their
details. This whole thing is undergone because the bank is seeing some unusual churn rates,
which is nothing but the customers are leaving at an unusual high rate, and they want to know
the reason behind it so that they can assess and address that particular problem.
So, we will start with installing the Keras library, Tensor Flow library, as well as
the Theano library on Anaconda Prompt, and for that, you need to open it as administrator
followed by running the commands one after other as given below.
Network Architecture
Computer Network Architecture is defined as the physical and logical design of the software,
hardware, protocols, and media of the transmission of data. Simply we can say that how
computers are organized and how tasks are allocated to the computer.
o Peer-To-Peer network
o Client/Server network
Peer-To-Peer network
o Peer-To-Peer network is a network in which all the computers are linked together with
equal privilege and responsibilities for processing the data.
o Peer-To-Peer network is useful for small environments, usually up to 10 computers.
o Peer-To-Peer network has no dedicated server.
o Special permissions are assigned to each computer for sharing the resources, but this can
lead to a problem if the computer with the resource is down.
18
Advantages of Peer-To-Peer Network:
o It is less costly as it does not contain any dedicated server.
o If one computer stops working but, other computers will not stop working.
o It is easy to set up and maintain as each computer manages itself.
Client/Server Network
o Client/Server network is a network model designed for the end users called clients, to
access the resources such as songs, video, etc. from a central computer known as Server.
o The central controller is known as a server while all other computers in the network are
called clients.
o A server performs all the major operations such as security and network management.
o A server is responsible for managing all the resources such as files, directories, printer,
etc.
o All the clients communicate with each other through a server. For example, if client1
wants to send some data to client 2, then it first sends the request to the server for the
permission. The server sends the response to the client 1 to initiate its communication
with the client 2.
19
Advantages of Client/Server network:
o A Client/Server network contains the centralized system. Therefore we can back up the
data easily.
o A Client/Server network has a dedicated server that improves the overall performance of
the whole system.
o Security is better in Client/Server network as a single server administers the shared
resources.
o It also increases the speed of the sharing resources.
Humans are best at understanding, reasoning, and interpreting knowledge. Human knows things,
which is knowledge and as per their knowledge they perform various actions in the real
world. But how machines do all these things comes under knowledge representation and
reasoning. Hence we can describe Knowledge representation as following:
o Knowledge representation and reasoning (KR, KRR) is the part of Artificial intelligence
which concerned with AI agents thinking and how thinking contributes to intelligent
behavior of agents.
o It is responsible for representing information about the real world so that a computer can
understand and can utilize this knowledge to solve the complex real world problems such
as diagnosis a medical condition or communicating with humans in natural language.
o It is also a way which describes how we can represent knowledge in artificial
intelligence. Knowledge representation is not just storing data into some database, but it
20
also enables an intelligent machine to learn from that knowledge and experiences so that
it can behave intelligently like a human.
What to Represent:
o Object: All the facts about objects in our world domain. E.g., Guitars contains strings,
trumpets are brass instruments.
o Events: Events are the actions which occur in our world.
o Performance: It describes behavior which involves knowledge about how to do things.
o Meta-knowledge: It is knowledge about what we know.
o Facts: Facts are the truths about the real world and what we represent.
o Knowledge-Base: The central component of the knowledge-based agents is the
knowledge base. It is represented as KB. The Knowledgebase is a group of the Sentences
(Here, sentences are used as a technical term and not identical with the English
language).
Types of knowledge
21
1. Declarative Knowledge:
2. Procedural Knowledge
3. Meta-knowledge:
4. Heuristic knowledge:
5. Structural knowledge:
22
Knowledge of real-worlds plays a vital role in intelligence and same for creating artificial
intelligence. Knowledge plays an important role in demonstrating intelligent behavior in AI
agents. An agent is only able to accurately act on some input when he has some knowledge or
experience about that input.
Let's suppose if you met some person who is speaking in a language which you don't know, then
how you will able to act on that. The same thing applies to the intelligent behavior of the agents.
As we can see in below diagram, there is one decision maker which act by sensing the
environment and using knowledge. But if the knowledge part will not present then, it cannot
display intelligent behavior.
AI knowledge cycle:
An Artificial intelligence system has the following components for displaying intelligent
behavior:
o Perception
o Learning
o Knowledge Representation and Reasoning
o Planning
o Execution
23
The above diagram is showing how an AI system can interact with the real world and what
components help it to show intelligence. AI system has Perception component by which it
retrieves information from its environment. It can be visual, audio or another form of sensory
input. The learning component is responsible for learning from data captured by Perception
comportment. In the complete cycle, the main components are knowledge representation and
Reasoning. These two components are involved in showing the intelligence in machine-like
humans. These two components are independent with each other but also coupled together. The
planning and execution depend on analysis of Knowledge representation and reasoning.
There are mainly four approaches to knowledge representation, which are given below:
Player1 65 23
Player2 58 18
Player3 75 24
2. Inheritable knowledge:
o In the inheritable knowledge approach, all data must be stored into a hierarchy of classes.
o All classes should be arranged in a generalized form or a hierarchal manner.
o In this approach, we apply inheritance property.
o Elements inherit values from other members of a class.
24
o This approach contains inheritable knowledge which shows a relation between instance
and class, and it is called instance relation.
o Every individual frame can represent the collection of attributes and its value.
o In this approach, objects and values are represented in Boxed nodes.
o We use Arrows which point from objects to their values.
o Example:
3. Inferential knowledge:
o Inferential knowledge approach represents knowledge in the form of formal logics.
o This approach can be used to derive more facts.
o It guaranteed correctness.
o Example: Let's suppose there are two statements:
a. Marcus is a man
b. All men are mortal
Then it can represent as;
man(Marcus)
∀x = man (x) ----------> mortal (x)s
4. Procedural knowledge:
o Procedural knowledge approach uses small programs and codes which describes how to
do specific things, and how to proceed.
o In this approach, one important rule is used which is If-Then rule.
o In this knowledge, we can use various coding languages such as LISP
language and Prolog language.
o We can easily represent heuristic or domain-specific knowledge using this approach.
25
o But it is not necessary that we can represent all cases in this approach.
1. 1. Representational Accuracy:
KR system should have the ability to represent all kind of required knowledge.
2. 2. Inferential Adequacy:
KR system should have ability to manipulate the representational structures to produce
new knowledge corresponding to existing structure.
3. 3. Inferential Efficiency:
The ability to direct the inferential knowledge mechanism into the most productive
directions by storing appropriate guides.
4. 4. Acquisitioned efficiency- The ability to acquire the new knowledge easily using
automatic methods.
We know that, during ANN learning, to change the input/output behavior, we need to adjust the
weights. Hence, a method is required with the help of which the weights can be modified. These
26
methods are called Learning rules, which are simply algorithms or equations. Following are
some learning rules for the neural network −
Hebbian Learning Rule
This rule, one of the oldest and simplest, was introduced by Donald Hebb in his book The
Organization of Behavior in 1949. It is a kind of feed-forward, unsupervised learning.
Basic Concept − this rule is based on a proposal given by Hebb, who wrote −
“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes
part in firing it, some growth process or metabolic change takes place in one or both cells such
that A’s efficiency, as one of the cells firing B, is increased.”
From the above postulate, we can conclude that the connections between two neurons might be
strengthened if the neurons fire at the same time and might weaken if they fire at different times.
Mathematical Formulation − According to Hebbian learning rule, following is the formula to
increase the weight of connection at every time step.
Perception Learning Rule
This rule is an error correcting the supervised learning algorithm of single layer feed forward
networks with linear activation function, introduced by Rosenblatt.
Mathematical Formulation − to explain its mathematical formulation, suppose we have ‘n’
number of finite input vectors, x, along with its desired/target output vector tn, where n = 1 to N.
Now the output ‘y’ can be calculated, as explained earlier on the basis of the net input, and
activation function being applied over that net input can be expressed as follows −
y=f(yin)={1,0,yin>θyin⩽θ
Where θ is threshold.
The updating of weight can be done in the following two cases −
Case I − when t ≠ y, then
W (new) =w (old) +TX
27
Competitive Learning Rule Winner−takes−all
It is concerned with unsupervised training in which the output nodes try to compete with each
other to represent the input pattern. To understand this learning rule, we must understand the
competitive network which is given as follows −
Basic Concept of Competitive Network − this network is just like a single layer feed forward
network with feedback connection between outputs. The connections between outputs are
inhibitory type, shown by dotted lines, which means the competitors never support themselves.
Basic Concept of Competitive Learning Rule − as said earlier, there will be a competition
among the output nodes. Hence, the main concept is that during training, the output unit with the
highest activation to a given input pattern will be declared the winner. This rule is also called
Winner-takes-all because only the winning neuron is updated and the rest of the neurons are left
unchanged.
Mathematical formulation − Following are the three important factors for mathematical
formulation of this learning rule −
• Condition to be a winner − Suppose if a neuron y wants to be the winner then there
would be the following condition −
YK= {10ifvk>vjforallj, j≠kotherwise
It means that if any neuron, say YK wants to win, and then its induced local field the output of
sum motion unit says vk must be the largest among all the other neurons in the network.
• Condition of sum total of weight − another constraint over the competitive learning rule
is, the sum total of weights to a particular output neuron is going to be 1. For example, if
we consider neuron k then −
∑jwkj=1forallk
• Change of weight for winner − If a neuron does not respond to the input pattern, then no
learning takes place in that neuron. However, if a particular neuron wins, then the
corresponding weights are adjusted as follows
Δwkj= {−α (xj−wkj), 0, ifneuronkwinsifneuronklossesΔ
This clearly shows that we are favoring the winning neuron by adjusting its weight and if
there is a neuron loss, then we need not bother to re-adjust its weight.
Outstare Learning Rule
This rule, introduced by Grasberg, is concerned with supervised learning because the desired
outputs are known. It is also called Grasberg learning.
28
Basic Concept − this rule is applied over the neurons arranged in a layer. It is specially designed
to produce a desired output d of the layer of p neurons.
Mathematical Formulation − the weight adjustments in this rule are computed as follows
Δwj=α (d−wj)
This definition of the learning process implies the following sequence of events:
1. The neural network is stimulated by an environment.
2. The neural network undergoes changes in its free parameters as a result of this stimulation.
3. The neural network responds in a new way to the environment because of the changes which
have occurred in its internal structure.
A prescribed set of well-defined rules for the solution of a learning problem is called a learning
algorithm. There is no unique learning algorithm for the design of neural networks. Rather, we
have a kit of tools represented by a diverse variety of learning algorithms, each of which offers
advantages of its own. Basically, learning algorithms differ from each other in the way in which
the adjustment to a synaptic weight of a neuron is formulated.
Another factor to be considered is the manner in which a neural network (learning machine),
made up of a set of interconnected neurons, reacts to its environment. In this latter context we
speak of a learning paradigm which refers to a model of the environment in which the neural
network operates.
29
Some of these algorithms require the use of a teacher and some do not called supervised and
non-supervised learning respectively.
In the study of supervised learning, a key provision is a ‘teacher’ capable of supplying exact
corrections to the network outputs when an error occurs. Such a method is not possible in
biological organism which has neither the exact reciprocal nervous connections needed for the
back propagation of error corrections nor the nervous means for the in position of behavior from
outside.
Nevertheless, supervised learning has established itself as a powerful par diagram for the design
of artificial neural networks. In contrast self-organized (unsupervised) learning is motivated by
neurobiological considerations.
The argument n denotes discrete time, or more precisely, the time step of an iterative process
involved in adjusting the synaptic weights of neuron k. The output signal of neuron k is denoted
yk (n). This output signal, representing the only output of the neural network, is compared to a
desired response or target output, denoted by yk(n). Consequently, an error signal, denoted by
ek(n), is produced. By definition, we thus have
30
The error signal ek (n) actuates a control mechanism, the purpose of which is to apply a sequence
of corrective adjustments to the synaptic weights of neuron k. The corrective adjustments are
designed to make the output signal yk(n) come closer to the desired response dm(n) in a step-by-
step manner.
This objective is achieved by minimizing a cost function or index of performance ɛ(n)
defined in terms of the error signal ek(n) as:
That is, ԑ (n) is the instantaneous value of the error energy. The step-by-step adjustments to the
synaptic weights of neuron k are continued until the system reaches a steady state (i.e., the
synaptic weights are essentially stabilized. At that point the learning process is terminated.
The learning process described herein is obviously referred to as error correction learning. In
particular, minimization of the cost function ԑ(n) leads to a learning rule commonly referred to as
the delta rule or Windrow-Hoff rule, named in honor of its originators. Let ωkj (n) denote the
value of synaptic weight ωkj. of neuron k excited by element xj (n) of the signal vector x(n) at
time step n. According to the delta rule, the adjustment Δωkj(n) applied to the synaptic weight
ωkj at time step n is defined by
Δ ωkj (n) = ƞek (n) xj (n)
Where, ƞ is a positive constant which determines the rate of learning as we proceed from one
step in the learning process to another. It is therefore natural that we refer to n as the learning-
rate parameter.
The delta rule, as stated herein, presumes that the error signal is directly measurable. For this
measurement to be feasible we clearly need a supply of desired response from some external
source, which is directly accessible to neuron k.
In other words, neuron k is visible to the outside world, and depicted in Fig. 11.21(a). From this
figure we also observe that error-correction learning is in fact local in nature. This amounts to
saying that the synaptic adjustments made by the delta rule are localized around neuron k.
31
Having computed the synaptic adjustment Δωkj (n), the updated value of synaptic weight Δωkj, is
given by equation 11.26.
Effectively, ωkj(n) and ωkj(n + 1) may be viewed as the old and new values of synaptic weight
ωkj, respectively.
In computational terms we may also write:
Where, z-1 is the unit-delay operator. That is, z-1 represents a storage element.
Fig. 11.21(b) shows a signal-flow graph representation of the error-correction learning process,
with regard to neuron k. The input signal xj and the induced local field vk of the neuron k are
referred to as presynaptic and postsynaptic signals of the jth synapse of neuron k, respectively.
Also, the Fig. shows that the error-correction learning is an example of a closed-loop feedback
system.
But from the control theory we know that the stability of such a system is determined by those
parameters which constitute the feedback loops of the system. In this case there is a single
feedback loop and the one of the parameters of interest is ƞ, the learning rate. So to ensure the
stability of convergence of iterative learning ƞ should be selected judiciously.
2. Memory-Based Learning:
In memory-based learning, all (or most) of the past experiences are explicitly stored in a large
memory of correctly classified input-output examples: [(xi, di)]Ni =1, where xi denotes an input
32
vector and di denotes the corresponding desired response. Without loss of generality, we have
restricted the desired response to be a scalar.
For example, in a binary pattern classification problem there are two classes of hypotheses,
denoted by ԑ1and ԑ2, to be considered. In this example, the desired response d i take the value 0
(or -1) for class ԑ1 and the value 1 for class ԑ 2. When classification of a test vector test (not seen
before) is required, the algorithm responds by retrieving and analyzing the training data in a
“local neighborhood” of xtest.
All memory-based learning algorithms involve two essential ingredients:
a. Criterion used for defining the local neighborhood of the test vector xtest.
b. Learning rule applied to the training examples in the local neighborhood of xtest.
The algorithms differ from each other in the way in which these two ingredients are defined.
In a simple yet effective type of memory-based learning known as the nearest neighbor rule, the
local neighborhoods is defined as the training example which lies in the immediate
neighborhoods of the test vector xtest. In particular, the vector.
Where, d(xi, xtest ) is the Euclidean distance between the vectors xi and xtest. The class associated
with the minimum distance, that is, vector x’N is reported as the classification of xtest. This rule is
independent of the underlying distribution responsible for generating the training examples.
Cover and Hart (1967) have formally studied the nearest neighbor rule as a tool for pattern
classification.
Under these two assumptions, it is shown that the probability of classification error incurred by
the nearest neighbor rule is bounded about by twice the Bays probability of error, that is, the
minimum probability of error over all decision rules. In this sense, it may be said that half the
classification information in a training set of infinite size is contained in the nearest neighbor,
which is a surprising result.
33
A variant of the nearest neighbor classifier is the k-nearest neighbor classifier, which
proceeds as:
a. Identify the k classified patterns which lie nearest to the test vector xtest for some integer k.
b. Assign xtest to class (hypothesis) which is most frequently represented in the k nearest
neighbors to xtest (i.e., use a majority vote to make the classification).
Thus, the k-nearest neighbor classifier acts like an averaging device.
When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part
in firing it, some growth process or metabolic changes take place in one or both cells such that
A’s efficiency as one of the cells firing B, is increased.
Hebb proposed this change as a basis of associative learning (at the cellular level), which would
result in an enduring modification in the activity pattern of a spatially distributed “assembly of
nerve cells”.
b. If two neurons on either side of a synapse are activated asynchronously, then that synapse is
selectively weakened or eliminated.
Such a synapse is called Hebbian synapse. More precisely, we define a Hebbian synapse as a
synapse which uses a time-dependent, highly local, and strongly interactive mechanism to
increase synaptic efficiency as a function of the correlation between the presynaptic and
postsynaptic activities.
From this definition we may deduce the following four key properties which characterise a
Hebbian synapse:
i. Time-Dependent Mechanism:
This mechanism refers to the facts that the modifications in a Hebbian synapse depend on the
exact time of occurrence of the presynaptic and postsynaptic signals.
34
ii. Local Mechanism:
By its very nature, a synapse is the transmission site where information-bearing signals
(representing ongoing activity in the presynaptic and postsynaptic units) are in spatio temporal
contiguity. This locally available information is used by a Hebbian synapse to produce a local
synaptic modification which is input specific.
iii. A mechanism which permits the neurons to compete for the right to respond to a given subset
of inputs, such that only one output neuron or only one neuron per group, is active (i.e., ‘on’) at a
time. The neuron which wins the competition is called a winner-takes-all neuron.
35
Accordingly the individual neurons of the network learn to specialize on ensembles of similar
patterns; in so doing they become feature detectors for different classes of input patterns.
In the simplest form of competitive learning, the neural network has a single layer of output
neurons, each of which is fully connected to the input nodes. The network may include feedback
connections among the neurons, as indicated in Fig. 11.22. In the network architecture described
herein, the feedback connections perform lateral inhibition, with each neuron tending to inhibit
the neuron to which it is laterally connected. In contrast, the feed forward synaptic connections
in the network of Fig. 11.15 all are excitatory.
For a neuron k to be the winning neuron, it’s induced local field vk for a specified input pattern x
must be the largest among all the neurons in the network. The output signal yk of winning neuron
k is set equal to one; the output signals of all the neurons which lose the competition are set
equal to zero.
We thus write:
Where, the induced local field yk represents the combined action of all the forward and feedback
inputs to neuron k.
Let ωkj denote the synaptic weight connecting input node j to neuron k. Suppose that each neuron
is allotted a fixed amount of synaptic weight (i.e., all synaptic weights are positive), which is
distributed among its input nodes that is, for all k
36
A neuron then learns by shifting synaptic weights from its inactive to active input nodes. If a
neuron does not respond to a particular input pattern, no learning takes place in that neuron.
If a particular neuron wins the competition, each input node of that neuron relinquishes some
proportion of its synaptic weight, and the weight relinquished is then distributed equally among
the active input nodes. According to the standard competitive learning rule, the change
Δωkj applied to synaptic weight ωkj is defined by
Where, ƞ is the learning rate parameter. This rule has the overall effect of moving the synaptic
weight vector ωk of winning neuron k towards the input pattern
5. Boltzmann Learning:
The Boltzmann learning rule, named in honor of Ludwig Boltzmann, is a stochastic learning
algorithm derived from idea rooted in statistical mechanics. A neural network designed on the
basis of the Boltzmann learning rule is called a Boltzmann machine.
In a Boltzmann machine the neurons constitute a recurrent structure, and they operate in a binary
manner since, for example, they are either in an ‘on’ state denoted by + 1 or in an ‘off’ state
denoted by -1. The machine is characterized by an energy function; E the value of which is
determined by the particular states occupied by the individual neurons of the machine, as shown
by
Where xj is the state of neuron j and ωkj is the synaptic weight connecting neuron j to neuron k.
The fact that j ≠ k means simply that none of the neurons in the machine has self-feedback. The
machine operates by choosing a neuron at random for example, neuron k is at some step of the
learning process, then flipping the state of neuron k from state xk to state – xk at some
temperature T with probability
Where, ΔEk is the energy change (i.e., the change in the energy function of the machine)
resulting from such a flip. We may note that T is not a physical temperature, but rather a pseudo
37
temperature under stochastic Model of a Neuron. If this rule is applied repeatedly, the machine
will reach thermal equilibrium.
The neurons of a Boltzmann machine partition into two functional groups:
a. Visible and
b. Hidden.
The visible neurons provide an interface between the network and the environment in which it
operates, whereas the hidden neurons always operate freely.
II. Free-running condition, in which all the neurons (visible and hidden) are allowed to operate
freely.
Let P+kj denote the correlation between the states of neurons j and k, with the network in its
clamped condition and P–kj the correlation between the states of neurons j and k with the network
in its free-running condition. Both correlations are averaged over all possible states of the
machine when it is in thermal equilibrium.
Then, according to the Boltzmann learning rule, the change Δωkj applied to the synaptic
weight ωkj from neuron j to neuron k is defined by:
Δωkj = ƞ (ρk+j – ρ–kj), j≠ k….(11.35)
where, ƞ is learning-rate. Moreover both ρkj+ and ρ–kj range in value from -1 to +1.
38
Implementing AND Gate:
There are 4 training samples, so there will be 4 iterations. Also, the activation function
used here is Bipolar Sigmoid Function so the range is [-1,1].
Step 1:
Set weight and bias to zero, w = [ 0 ]T and b = 0.
Step 2:
Set input vector Xi = Si for i = 1 to 4.
X1 = [-1 -1 1] T
X2 = [-1 1 1] T
X3 = [1 -1 1] T
X4 = [1 1 1] T
Step 3:
Output value is set to y = t.
Step 4:
Modifying weights using Hebbian Rule:
First iteration –
W (new) = w (old) + x1y1 = [0] T + [-1 -1 1] T. [-1] = [1 1 -1] T
For the second iteration, the final weight of the first one will be used and so on.
Second iteration –
W (new) = [1 1 -1] T + [-1 1] T. [-1] = [2 0 -2] T
39
Third iteration –
W (new) = [2 0 -2] T + [1 -1 1] T. [-1] = [1 1 -3] T
Fourth iteration –
W (new) = [1 1 -3] T + [1 1] T. [1] = [2 2 -2] T
So, the final weight matrix is [2 2 -2] T
40
Neural Networks with Memory
We always heard that Neural Networks (NNs) are inspired by biological neural networks. This
huge representation was done in a fantastic way. Figure 1 shows the anatomy of a single neuron.
The central part is called the cell body where the nucleus resides. There are various wires which
pass the stimulus to the cell body and few wires which send the output to the other neurons. The
thickness of the dendrites implies the weight/bias/power of the stimulus. Many neurons with
various cell bodies are stacked up which forms the biological neural network.
41
Neural Networks with memory
The main difference between the functioning of neural networks and the biological neural
network is memory. While both the human brain and neural networks have the ability to read and
write from the memory available, the brain can create/store the memory as
well. Researchers identified that this key difference is the major roadblock for today’s AI systems
computer, by putting together a neural network and linking it to external memory. The neural
network would act as a CPU with a memory attached. Such differentiable computers aim to learn
programs (algorithms) from input and output data. The neural networks are used when the amount
of data is huge. For example, text data has an enormous amount of dimensions or the image data
A movie consists of a sequence of scenes. When we watch a particular scene, we don’t try to
understand it in isolation, but rather in connection with previous scenes. In a similar fashion, a
machine learning model has to understand the text by utilizing already-learned text, just like in a
human neural network. In traditional machine learning models, we cannot store a model’s
previous stages. However, Recurrent Neural Networks (commonly called RNN) can do this for
42
An RNN has a repeating module that takes input from the previous stage and gives its output as
input to the next stage. However, in RNNs we can only retain information from the most recent
To learn long-term dependencies, our network needs memorization power. LSTMs are a special
case of RNNs which can do that. They have the same chain-like structure as RNNs, but with a
LSTM has a wide range of applications in Sequence-to-Sequence modeling tasks like Speech
Recognition, Text Summarization, Video Classification, and so on. To understand how these
networks can be adopted in real-life applications in a quick glance, do check the article below. A
spam detection model can be achieved by converting text data into vectors, creating an LSTM
model, and fitting the model with the vectors.
Deep Learning models are broadly classified into supervised and unsupervised models.
Supervised DL models:
• Artificial Neural Networks (ANNs)
• Recurrent Neural Networks (RNNs)
• Convolution Neural Networks (CNNs)
Unsupervised DL models:
• Self Organizing Maps (SOMs)
• Boltzmann Machines
• Auto encoders
43
Let us learn what exactly Boltzmann machines are, how they work and also implement a
recommender system which recommends whether the user likes a movie or not based on the
previous movies watched.
Boltzmann Machines is an unsupervised DL model in which every node is connected to every
other node. That is, unlike the ANNs, CNNs, RNNs and SOMs, the Boltzmann Machines
are undirected (or the connections are bidirectional). Boltzmann Machine is not a deterministic
DL model but a stochastic or generative DL model. It is rather a representation of a certain
system. There are two types of nodes in the Boltzmann Machine — Visible nodes — those
nodes which we can and do measure, and the Hidden nodes – those nodes which we cannot or
do not measure. Although the node types are different, the Boltzmann machine considers them as
the same and everything works as one single system. The training data is fed into the Boltzmann
Machine and the weights of the system are adjusted accordingly. Boltzmann machines help us
understand abnormalities by learning about the working of the system in normal conditions.
Boltzmann Machine
Energy-Based Models:
Boltzmann Distribution is used in the sampling distribution of the Boltzmann Machine. The
Boltzmann distribution is governed by the equation –
Pi = e (-∈i/kT)/ ∑e (-∈j/kT)
Pi - probability of system being in state i
∈i - Energy of system in state i
T - Temperature of the system
k - Boltzmann constant
∑e (-∈j/kT) - Sum of values for all possible states of the system
Boltzmann Distribution describes different states of the system and thus Boltzmann machines
create different states of the machine using this distribution. From the above equation, as the
energy of system increases, the probability for the system to be in state ‘i’ decreases. Thus, the
system is the most stable in its lowest energy state (a gas is most stable when it spreads). Here, in
Boltzmann machines, the energy of the system is defined in terms of the weights of synapses.
Once the system is trained and the weights are set, the system always tries to find the lowest
energy state for itself by adjusting the weights.
44
Types of Boltzmann Machines:
• Restricted Boltzmann Machines (RBMs)
• Deep Belief Networks (DBNs)
• Deep Boltzmann Machines (DBMs)
Restricted Boltzmann Machines (RBMs):
In a full Boltzmann machine, each node is connected to every other node and hence the
connections grow exponentially. This is the reason we use RBMs. The restrictions in the
node connections in RBMs are as follows –
• Hidden nodes cannot be connected to one another.
• Visible nodes connected to one another.
Energy function example for Restricted Boltzmann Machine –
E (v, h) = -∑ alive - ∑ bjhj - ∑∑ viwi,jhj
A, v - biases in the system - constants
VI, hj - visible node, hidden node
P (v, h) = Probability of being in a certain state
P (v, h) = e (-E (v, h))/Z
Z - sum if values for all possible states
Suppose that we are using our RBM for building a recommender system that works on six (6)
movies. RBM learns how to allocate the hidden nodes to certain features. By the process
of Contrastive Divergence, we make the RBM close to our set of movies that is our case or
scenario. RBM identifies which features are important by the training process. The training data
is either 0 or 1 or missing data based on whether a user liked that movie (1), disliked that movie
(0) or did not watch the movie (missing data). RBM automatically identifies important features.
Contrastive Divergence:
RBM adjusts its weights by this method. Using some randomly assigned initial weights, RBM
calculates the hidden nodes, which in turn use the same weights to reconstruct the input nodes.
Each hidden node is constructed from all the visible nodes and each visible node is reconstructed
from the entire hidden node and hence, the input is different from the reconstructed input, though
the weights are the same. The process continues until the reconstructed input matches the
previous input. The process is said to be converged at this stage. This entire procedure is known
as Gibbs Sampling.
Gibb’s sampling
45
The Gradient Formula gives the gradient of the log probability of the certain state of the system
with respect to the weights of the system. It is given as follows –
D/dig (log (P (v0))) = <vi0 * hj0> - <vi∞ * hj∞>
v - Visible state, h- hidden state
<vi0 * hj0> - initial state of the system
<vi∞ * hj∞> - final state of the system
P (v0) - probability that the system is in state v0
Wij - weights of the system
The above equations tell us – how the change in weights of the system will change the log
probability of the system to be a particular state. The system tries to end up in the lowest possible
energy state (most stable). Instead of continuing the adjusting of weights process until the current
input matches the previous one, we can also consider the first few pauses only. It is sufficient to
understand how to adjust our curve so as to get the lowest energy state. Therefore, we adjust the
weights; redesign the system and energy curve such that we get the lowest energy for the current
position. This is known as the Hinton’s shortcut.
Hinton’s Shortcut
46
• Mary does not like m4 which is directed by Tarantino, she probably dislikes any movie
directed by ‘Tarantino’.
Therefore, based on the observations and the details of m2, m6; our RBM recommends m6 to
Mary (‘Drama’, ‘Vicario’ and ‘Oscar’ matches both Mary’s interests and m6). This is how an
RBM works and hence is used in recommender systems.
Working of RBM
What is learning?
Learning rule in Neural Network means machine learning i.e. how a machine learns. It is a
mathematical logic to improve the performance of Artificial Neural Network. Learning rule is
one of the most important factors to decide the accuracy of ANN.
47
Synapse
Information flow is directional. The neuron which fires the chemical called neurotransmitter is
presynaptic neuron and the neuron which receives neurotransmitter is postsynaptic neuron.
Learning Rules
Hebbian Learning
It is one of the oldest learning algorithms. A synapse (connection) between two neurons is
strengthened if the neuron A on either side of the synapse is near enough to excite neuron B, and
repeatedly or persistently takes part in firing it. It leads to some growth process or metabolic
changes in one or both cells such that A’s efficiency as one of the cells firing B, is increased.
Time-dependent mechanism: - In it, modification in the Hebbian Synapse depend on the exact
time of the occurrence of the presynaptic and postsynaptic activities.
Local mechanism:- Since synapse holds information-bearing signals. This locally available
information is used by Hebbian synapse to produce local synaptic modification that is input
specific.
48
Interactive mechanism: - In it change in Hebbian synapse depends upon the activity levels on
both sides of the synapse (i.e. presynaptic and postsynaptic activities).
Competitive Learning
In competitive learning, the output neuron competes among them for being fired. Unlike
Hebbian Learning, only one neuron can be active at a time. It is the form of unsupervised
learning.
1. A set of neurons that is same except some neurons and which therefore respond differently to the
given input set.
2. A limit imposed on the “strength” of each neuron.
3. A mechanism that allows a neuron to compete among themselves for a given set of inputs. The
winning neuron is called a winner-takes-all-neuron.
The internal activity vj of the winning neuron must be the largest among all the neurons for a
specified input pattern x. The output signal vj of the winning neuron is set equal to one; the
output signal of all other neurons that lose the competition is set to zero.
49
Boltzmann Learning
Boltzmann learning rule is slower than the error-correction learning rule because in it the state of
each individual neuron, in addition to the system output is taken into account.
The neurons operate in a binary manner representing +1 for on state and -1 for off state. The
machine is characterized by an energy function E
The neurons of Boltzmann machine is divided into two functional groups, visible and hidden.
The visible neurons act as an interface between the network and the environment in which it
operates, whereas the hidden neurons always act freely.
For a linear regression of statistical data with multiple predictors, let’s begin with a linear
Where yᵢ is the dependent response variable and xᵢⱼ are the observed values of each independent
variable j, of which there are p for each statistical unit i, of which there are n. The error term is εᵢ.
The predictors are βⱼ, of which there are p+1. Here’s a view of linear data when there is one
50
Linear Regression, Vicariate (image by author)
We can also use vectors and matrices to represent the linear equation. The vector y =
(y₁,…,yᵢ,…,yₙ)⊤ represents the values taken by the response variable. X of dimension n×(p+1) is
the matrix of xᵢⱼ predictor values, with the first column defined as a constant, meaning that xᵢ₀ ≔
1.
Representing the linear equation with vectors and matrices gives us:
For the linear regression of y on X with error vector ε, the coefficient vector β is derived by
51
Taking the partial derivatives with respect to the vector β, and then setting it equal to zero derives
the minimum value for β, which we will name β^ₒₗₛ because we are using the ordinary least
At this value, β is a true minimum because the Hessian matrix of the second derivatives is definite
Statisticians use the above geometric derivations when investigating a linear statistical model, a
model that is tested before being put to use to make predictions. An example basic model is (Tile,
2019):
• β is the vector of the unknown coefficients (that is, the estimators) in ℝ⁽ᵖ⁺¹⁾
52
• The matrix X is not random and is full rank. If the matrix X is not full rank, then at least one
of the columns of the matrix (that is, of the covariates) can be written as a linear combination
of other columns, suggesting a reconsideration of the data
• The variance of the error terms is constant: Var(εᵢ) = σ² for all i, that is, homoscedastic
• The covariance of the error terms is zero: Cov(εᵢ, εⱼ) = 0 for all i≠j
The Gauss-Markov theorem states that, for this model with Normal distribution error terms, the
ordinary-least-squares derived estimator for β is the best linear unbiased estimator. So we get:
A prediction can then be made for a new set of independent variables xₖ:
After testing to ensure that the model fits the data, statistical theory then defines other important
values, such as confidence intervals for variance of the estimators and prediction intervals for the
model’s predictions.
We can generalize the above linear statistical model, with Normal (Gaussian) error terms, via
mathematical transformations into the generalized linear model (GLM) in statistics, allowing
regression, estimator tests, and analysis of the exponential family of conditional distributions
of y given X, such as Binomial, Multinomial, Exponential, Gamma, and Poisson.
53
The parameters are estimated using the maximum likelihood method. For logistic regression, when
there is a binomial response, y ∈ {0,1}, the logistic function defines the probability of a successful
outcome, 𝜋 = P(y=1|x), where x is the vector of observed predictive variables, of which there
are p. If β is the vector of the unknown predictors, of which there are p+1, and using z = xβ,
We can apply this function via the log-odds, or logit, to a linear model as follows:
Where 𝜋ᵢ = P (yᵢ=1|xᵢ) and xᵢ is the ith observed outcome, of which there are n. The logistic
function above allows us to apply the theory behind linear regression to a probability between 0
and 1 of predicting a successful outcome. Statistical tests and data measures, such as deviance,
goodness of fit measures, Wald test, and Pearson 𝜒² statistic, can be applied using this model.
54
Linear Regression using Machine Learning
From the machine learning point of view, predictive models are considered too complicated or
computationally intensive to solve mathematically. Instead, very small steps are taken on portions
of the data and iteratively cycled through to derive the solution. We’ll walk through the solution
to linear regression using machine learning, before we proceed; however, it is important to first
understand that in machine learning, the function to be solved isn’t typically predefined. In our
case, we already know that we want to perform only a linear regression, but typically in machine
learning, various models (or functions) of the data are compared until the best trade-off between
being too general and imprecise, on the one hand, and over fitting the data, on the other hand, is
found empirically.
Taking the derivative of the loss function and setting it equal to zero yields the coefficient values,
but we’re going to perform the calculation stepwise, with one calculation for each training
instance since the machine learning algorithm will pass through the data multiple times.
To find the direction of the stepwise updates, we’ll take the derivative of the loss function, and
use that direction to move our learning a step towards the minimum:
This process is known as gradient descent, and 𝛼 defines the length of the small step, which is
55
In machine learning, we consider training in pairs (x, y₁)… (X, yₙ) and we cycle through updates
to each pair multiple times when optimizing θ. Let’s look at the derivative of the squared error for
This equation gives us the direction in which to move the β values, which are also known
as weights, towards their minimum. The constant 2 is typically disregarded because it doesn’t
affect the optimal values of β (Ng, 2018). So in our case, for each mth training instance, β is
updated as follows:
We start the learning process by establishing random values for each weight value of β and begin
the algorithm. The learning rate 𝛼 should be set so that progress towards the minimum of the loss
function is sufficiently fast without overshooting and making the minimum impossible to reach. A
dynamic learning rate, where 𝛼 is reduced as the function nears its minimum, is often
implemented. Assuming the support of a good learning rate, this machine learning algorithm will
calculate the values of the coefficients β as precisely as desired, reaching the same values derived
56
Logistic Regression with a Neural Network
The idea of neural networks came from the concept of how neurons work in living animals: a
nerve signal is either amplified or dampened by each neuron the signal passes through, and it is
the sum of multiple neurons in series and in parallel, each filtering multiple inputs and feeding
that signal to additional neurons to eventually provide the desired output. The feed-forward neural
network is the simplest form of a neural network, where the calculation is done only in the
forward direction, from input to output. Neural networks allow for the use of multiple layers of
neurons where each layer provides specific functions a simple linear regression neural network,
The figure below shows the framework for a simple feed-forward neural network that provides
logistic regression:
In a simple feed-forward neural network for classification, the weights wⱼ and ‘bias’ term w₀
represent the coefficients of β from the linear regression method and are trained by the network
The general neural network function takes the following form (Bishop, 2006):
57
Where f (·) is a nonlinear activation function and φⱼ(x) is a basis function. The basis function can
transform the inputs x before the weights w are determined. In the case of logistic regression, the
The activation function f(·) is also set to 1 for linear regression. However, with logistic regression,
a specific activation function is needed to convert the output of the linearly determined weights to
the sigmoid function, which is equivalent to the logistic function defined for logistic regression for
statistics. The sigmoid function, in contrast to the logistic function 𝜋(z), is mathematically
converted to have only one exponent to simplify programming as shown in the following
equation:
Where z = xβ. The sigmoid activation function provides the probability of the prediction.
Previously we used the generalized linear model in statistics to expand linear regression to logistic
regression for a binomial response. We can do a similar transformation for situations where the
response is multinomial, i.e., multiclass. The key difference is that instead of using the sigmoid
activation function to provide a probability to the prediction, the soft ax function is used.
58
Where z = xβ and K is the number of classes.
59