0% found this document useful (0 votes)
29 views101 pages

Ai Mca

Uploaded by

Shazia Siddiqui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views101 pages

Ai Mca

Uploaded by

Shazia Siddiqui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 101

Artificial Intelligence

Tic-Tac-Toe Learner AI

Tic-Tac-Toe is a simple game for two players that we enjoyed playing as kids
(especially in boring classrooms). The game involves 2 players placing their
respective symbols in a 3x3 grid. The player who manages to place three of their
symbols in horizontal/vertical/diagonal row wins the game. If either player fails
to do so the game ends in a draw. If both the people always play their optimal
strategies the game always ends in a draw.

Ex: Tic-Tac-Toe Board

Since the grid is small & there are only two players involved the number of
possible moves for every board state is limited thus allowing Tree-based search
algorithms like Alpha-Beta pruning to provide a computationally feasible &
exact solution to building a Computer-based Tic-Tac-Toe player.
In this article, we look at an approximate (Learning-based) approach to the same
game. Even though a better algorithm exists (i.e. Alpha-beta pruning), the
approximate method provides an alternative method that might be useful if the
complexity of the board were to increase. Also, the code changes to incorporate
that would be minimal.

The idea is to pose the Tic-Tac-Toe game as a well-posed learning problem as


mentioned in Tom Mitchell’s Machine Learning Book (Chapter -1). The
learning system is described in brief in the following section.

Tic-Tac-Toe Learning System

The basic idea behind the learning system is that the system should be able to
improve its performance (P) w.r.t to a set of tasks (T) by learning from training
experience (E). The training experience (E) can be a direct (predefined set of
data with individual labels) or indirect feedback (No labels for each training
example). In our case:-

 Task (T): Playing Tic-Tac-Toe

 Performance (P): Percentage of Games won against Humans

 Experience (E): Indirect feedback via solution trace (Game History)


generated from games played against itself (a clone)

Learning from Experience(E): Ideally, a function (Ideal Target Function)


needs to be learned that gives the best move possible for any given board state.
In our problem, we represent our ITF to be a Linear function (V) that maps a
given board state to real value (score). Then we use an approximation algorithm
(Least Mean Squares) to estimate the ITF from the solutions traces.
1. V(boardState) → R, (R-Score value for a given boardState.)

2. V_hat(boardState) ←(W.T) * X , (W- Weights of the Target function, X-


Features extracted from given boardState.)

3. LMS training rule to update weights: Wi ←Wi + lr * (V_hat(boardState)-


V_hat(Successor(boardState))) * X, (i- ith training example, lr- Learning
rate)

The score (R) for each non-final board state is assigned with the estimated score
of the successor board state. The final board state is assigned a score based on
the end result of the game.

3. V(boardState) ←V_hat(Successor(boardState))
4. V(finalBoardState) ←100 (Win) | 0 (Draw) | -100 (Loss)

Implementation

The Final Design is split into four modules (Ch-1, Tom Mitchell’s Machine
Learning Book):-

1. Experiment Generator: Its job is to generate new problem statements at the


beginning of every training epoch. In our case, it just returns an empty initial
board State.

2. Performance System: This module takes as input the problem provided by


the Experiment Generator & then uses the improved Learning algorithm to
produce a solution trace of the game at every epoch. In our case, this is done
by simulating a Tic-Tac-Toe game between two players (cloned programs).
These Players make move decisions using the current Target function.
3. Critic: This module takes the solution trace and outputs a set of training
examples to be input to the Generalizer.
Training Examples ←[<Features(boardState) , R >,<>….]

4. Generalizer: This module uses the training examples provided by the critic
to Update/Improve the Target Function by learning the desired weights using
the LMS weight update rule at each epoch.

The following is a small video showing the training & testing phases ( BTW: My
video editing skills are not good).

Best programming languages for AI development


The origin of Artificial Intelligence (AI) dates way back in time. So, it is safe to
say that it is not an innovation in the year 2022. We have witnessed, used, and
gotten used to this area of technology and it is getting better day by day.
Having said that, businesses and individuals incline more towards AI
development these days. With benefits like enhanced customer experience,
smart decision making, automation, minimum errors, and data analytics, AI
development seems to be a perfect choice.
However, after you have made this choice, there is another hard decision you
need to make - choosing a programming language for AI development. While
there are many languages out there that will get the work done, you should
know which one will suit your project. To make your decision easy, we have
made a list of the ten best programming languages for AI development. Let us
look into it:

1. JAVA

Java by Oracle is one of the best programming languages available out there.
Over the years, this language has adapted to the latest innovations and
technological advancements. The same is true for AI. Using Java for AI
development can help you get some scalable applications.
For AI development, Java offers ease of usage and debugging and simplifies
large-scale projects. You can represent the data in graphics and offer better user
interaction.
Java’s Virtual Machine Technology helps the developers build a single version
of an app that they can run on other Java-based platforms. Developers can also
work on the graphics and interfaces, making them more appealing with the
Standard Widget Toolkit. In all, Java helps you maintain, port, and make AI
applications secure.

2. PYTHON

Another one on the list is Python, the programming language that offers the
least code among all others. There are many reasons why we need to hire
Python developers that help in AI development. These reasons include:

 Prebuilt Libraries for advanced computing like Numpy, Scipy, and


Pybrain.
 Open-source language with support by developers from across the world.
There are many Forums and Tutorials for Python that you can seek help
from.
 The advantage of Python Machine Learning.
 Python is an independent and flexible language compatible with multiple
platforms with minimum tweaks.
 There are options like Scripting, OOPs approach, and IDE that allows
fast development with diverse algorithms.

With all these features and many others, Python has become one of the best
languages for AI development.

3. JAVASCRIPT

Just like Java, JavaScript is also an ideal match for AI development. However, it
is used to develop more secure and dynamic websites. While Python is suitable
for developers who don’t like coding, JavaScript is for those who don’t mind
it.
The AI capabilities of JavaScript help it interact and work smoothly with other
source codes like HTML and CSS. Like Java, JavaScript also has a large
community of developers that support the development process. With libraries
like jQuery, React.js, and Underscore.js, AI development becomes more
effective. From multimedia, buttons, to data storage, you can manage both
frontend and backend functions using JavaScript.
With JavaScript, you can ensure security, high performance, and less
development time.
4. JULIA

While Julia does not come with a large community or support, it offers many
high-end features for top-notch AI development. When it comes to handling
data analysis and numbers, Julia is the best development tool.
If you need to make a dynamic interface, catchy graphics, and data visuals, Julia
provides you with the right tools for perfect execution. With features like
debugging, memory management, and metaprogramming, this language makes
AI development a breeze.
For machine learning AI projects, Julia is the best bet. It comes with many
packages like Metahead, MLJ.JL, Turing.JL, and Flux.JL.

5. LISP

Lisp is one of the oldest languages used for AI development. It was developed
in the 1960s and has always been an adaptable and smart language. If your
project requires modification of code, problem-solving, rapid prototyping, or
dynamic development, Lisp is for you.
Some successful projects made with Lisp are Routinic, Grammarly, and DART.
Though it has its drawbacks, Lisp is still a promising programming language for
AI development.

6. R

A statistical programming language, R is one of the most suitable choices for


projects where you need statistical computations. It supports learning libraries
like MXNet, TensorFlow, Keras, etc. The language is adopted by many
industries like education, finance, telecommunication, pharmaceuticals, life
sciences, etc. It is the language that fuels tech giants like Microsoft, Google,
Facebook, and businesses like Uber, Airbnb, etc.
R includes user-created packages like graphical devices, tools, import/export
capabilities, statistical techniques, etc. With built-in graphic and data modeling
support, the language allows developers to work on deep learning moderns
without much hassle.

7. PROLOG

Prolog is short for Programming in Logic. The language was developed in 1972
in a rule-like form. It is majorly used for projects that involve computational
linguistics and artificial intelligence. For the projects that require a database,
natural language processing, and symbolic reasoning, Prolog is the best bet! It is
the perfect language support for research when it comes to artificial
intelligence.
Used for automated planning, theorem proving, expert and type systems, Prolog
still has limited usage. However, it is used to build some high-end NLP
applications and by giants like IBM Watson.

8. SCALA

Scala makes the coding process fast, easy, and much more productive. The
index Scaladex that has the Scala libraries and resources helps developers create
some quality applications.
It runs on the Java Virtual Machine (JVM) environment and helps developers
program smart software. Scala is compatible with Java and JS and offers many
features like pattern matching, high-performing functions, browser tools, and
flexible interfaces. For AI development, Scala is one of the best options and it
has impressed the developers in that area.

9. RUST

Everyone is looking for high-performance, fast, and safe software development,


and Rust helps you achieve that! It is a general-purpose programming language
that developers love to use for AI development. The syntax of Rust is similar to
C++ but the former also offers memory safety and prevents garbage collection.
Rust works at the backend of many well-known systems like Dropbox, Yelp,
Firefox, Azure, Polkadot, Cloudflare, npm, Discord, etc. The memory safety,
speed, and ease of expression make Rust the perfect choice for AI development
and scientific computing.

10. HASKELL

While Haskell comes with limited support, it is another good programming


language you can try for AI development. It offers pure functionality and
abstraction capabilities that make the language very flexible. However, the lack
of support might delay the AI development process.
With Haskell, code reusability comes in handy for the developers along with
other features like type system and memory management.
Python in detail:
Although Python was created before AI became crucial to businesses, it’s one
of the most popular languages for Artificial Intelligence. Python is the most
used language for Machine Learning (which lives under the umbrella of AI).
One of the main reasons Python is so popular within AI development is that it
was created as a powerful data analysis tool and has always been popular within
the field of big data.

As for modern technology, the most important reason why Python is always
ranked near the top is that there are AI-specific frameworks that were created
for the language. One of the most popular is TensorFlow, which is an open-
source library created specifically for machine learning and can be used for
training and inference of deep neural networks. Other AI-centric frameworks
include:

 scikit-learn – for training machine learning models.


 PyTorch – visual and natural language processing.
 Keras – serves as a code interface for complex mathematical calculations.
 Theano – library for defining, optimizing, and evaluating mathematical
expressions.

Python is also one of the easiest languages to learn and use.

Lisp
Lisp has been around since the 60s and has been widely used for scientific
research in the fields of natural languages, theorem proofs, and to solve artificial
intelligence problems. Lisp was originally created as a practical mathematical
notation for programs, but eventually became a top choice of developers in the
field of AI.

Even though Lisp is the second oldest programming language still in use, it
includes several features that are critical to successful AI projects:

 Rapid prototyping.
 Dynamic object creation.
 Mandatory garbage collection.
 Data structures can be executed as programs.
 Programs can be modified as data.
 Uses recursion as a control structure and not an iteration.
 Great symbolic information processing capabilities.
 Read-Eval-Print-Loop to ease interactive programming.
More importantly, the man who created Lisp (John McCarthy) was very
influential in the field of AI, so much of his work had been implemented for a
long time.

Java
It should go without saying that Java is an important language for AI. One
reason for that is how prevalent the language is in mobile app development.
And given how many mobile apps take advantage of AI, it’s a perfect match.

Not only can Java work with TensorFlow, but it also has other libraries and
frameworks specifically designed for AI:

 Deep Java Library – a library built by Amazon to create deep learning abilities.
 Kubeflow – makes it possible to deploy and manage Machine Learning stacks
on Kubernetes.
 OpenNLP – a Machine Learning tool for processing natural language.
 Java Machine Learning Library – provides several Machine Learning
algorithms.
 Neuroph – makes it possible to design neural networks.

Java also makes use of simplified debugging, and its easy-to-use syntax offers
graphical data presentation and incorporates both WORA and Object-Oriented
patterns.

C++
C++ is another language that’s been around for quite some time, but still is a
legitimate contender for AI use. One of the reasons for this is how widely
flexible the language is, which makes it perfectly suited for resource-intensive
applications. C++ is a low-level language that provides better handling for the
AI model in production. And although C++ might not be the first choice for AI
engineers, it can’t be ignored that many of the deep and machine learning
libraries are written in C++.

And because C++ converts user code to machine-readable code, it’s incredibly
efficient and performant.

R
R might not be the perfect language for AI, but it’s fantastic at crunching very
large numbers, which makes it better than Python at scale. And with R’s built-in
functional programming, vectorial computation, and Object-Oriented Nature, it
does make for a viable language for Artificial Intelligence.

R also enjoys a few packages that are specifically designed for AI:
 gmodels – provides several tools for the task of model fitting.
 TM – a framework used for text mining applications.
 RODBC – an ODBC interface.
 OneR – makes it possible to implement the One Rule Machine Learning
classification algorithm.

Julia
Julia is one of the newer languages on the list and was created to focus on
performance computing in scientific and technical fields. Julia includes several
features that directly apply to AI programming:

 Common numeric data types.


 Arbitrary precision values.
 Robust mathematical functions.
 Tuples, dictionaries, and code introspection.
 Built-in package manager.
 Dynamic type system.
 Ability to work for both parallel and distributed computing.
 Macros and metaprogramming capabilities.
 Support for multiple dispatches.
 Support for C functions.

https://nitsri.ac.in/Department/Computer%20Science%20&%20Engineering/
ProblemSolving(L-2).pdf

State space of a problem.

Set of all possible state for a given problem is known as state space of a
problem.
For finding the solution one can make use of explicit search tree that is
generated by initial state and the successor function that together define
the state space.

In general We May have search graph rather than search tree as the same
can be reached from multiple paths.

A state space essentially consists of a set of nodes representing each state of


the problem.
Terminology used in search tree :-
i) Node in tree: it is a book keeping data structure to represent the
structure configuration of a state in search tree
ii) state: it reflects world configuration. It is mapping of state and action to
another state.
iii) Fringe: it is a collection of nodes that have been generated but not yet
expended

State Space:
– The root of search tree is a search node corresponding to initial state in
this state only we can check if goal is reached.
– if goal is not reached we need to consider another state. such a can be
done by expanding form the current state by applying successor function
which generates new state. from this we may get multiple states.
– for each one of these, again we need to check goal test or else repeat
expansion of each state.
– the choice of which state to expand is determined by search strategy.

Set of all possible state for a given problem is known as state space of a
problem.

For finding the solution one can make use of explicit search tree that is
generated by initial state and the successor function that together define
the state space.

In general We May have search graph rather than search tree as the same
can be reached from multiple paths.

A state space essentially consists of a set of nodes representing each state of


the problem.

Terminology used in search tree :-


i) Node in tree: it is a book keeping data structure to represent the
structure configuration of a state in search tree
ii) state: it reflects world configuration. It is mapping of state and action to
another state.
iii) Fringe: it is a collection of nodes that have been generated but not yet
expended

State Space:
– The root of search tree is a search node corresponding to initial state in
this state only we can check if goal is reached.
– if goal is not reached we need to consider another state. such a can be
done by expanding form the current state by applying successor function
which generates new state. from this we may get multiple states.
– for each one of these, again we need to check goal test or else repeat
expansion of each state.
– the choice of which state to expand is determined by search strategy.

Exhaustive Search

For discrete problems in which no efficient solution method is known, it might


be necessary to test each possibility sequentially in order to determine if it is the
solution. Such exhaustive examination of all possibilities is known as
exhaustive search, direct search, or the "brute force" method. Unless it turns out
that NP-problems are equivalent to P-problems, which seems unlikely but has
not yet been proved, NP-problems can only be solved by exhaustive search in
the worst case.
UNIT II
Problem Reduction and Game Playing
Problem Reduction in AI with example
We already know about the divide and conquer strategy, a solution to a problem
can be obtained by decomposing it into smaller sub-problems. Each of this sub-
problem can then be solved to get its sub solution. These sub solutions can then
be recombined to get a solution as a whole. That is called is Problem
Reduction. This method generates arc which is called as AND arcs. One AND
arc may point to any number of successor nodes, all of which must be solved for
an arc to point to a solution.
Problem Reduction algorithm:
1. Initialize the graph to the starting node.
2. Loop until the starting node is labelled SOLVED or until its cost goes
above FUTILITY:
(i) Traverse the graph, starting at the initial node and following the current best
path and accumulate the set of nodes that are on that path and have not yet been
expanded.
(ii) Pick one of these unexpanded nodes and expand it. If there are no
successors, assign FUTILITY as the value of this node. Otherwise, add its
successors to the graph and for each of them compute f'(n). If f'(n) of any node
is O, mark that node as SOLVED.
(iii) Change the f'(n) estimate of the newly expanded node to reflect the new
information provided by its successors. Propagate this change backwards
through the graph. If any node contains a successor arc whose descendants are
all solved, label the node itself as SOLVED.
Constraint Satisfaction Problem in Artificial Intelligence
A Constraint Satisfaction Problem (CSP) is a program that requires its solution
within some limitations or conditions that is called Constraints. It consists of:
1. A finite set of variables which store the solution V={V1,V2,V3,…,Vn}
2. A set of discrete values known as Domain Form which the solution is picked.
D={D1,D2,D3,…,Dn}
3. A finite set of constraints C={C1,C2,C3,…,Cn}

Constraint Satisfaction
1. Until a complete solution is found or until all path has led to dead ends, do:
(i) Select an unexpected node of the search graph.
(ii) Apply the constraints inference rules to the selected node to generate all
possible new constraints.
(iii) If the set of constraints contains a contradiction then report that this path is
a dead-end.
(iv) If the set of constraints describes a complete solution then report success.
(v) If neither a contradiction nor a complete solution has been found then apply
the problem space rules to generate new partial solutions that are consistent with
the current set of constraints. Insert these partial solutions into the search graph.
Means-End Analysis in Artificial Intelligence
Means-end-analysis is a special type of knowledge-rich search that allows both
backward and forward-searching.
Principle of Means-End Analysis:
“It allows us to solve parts of a problem first and then go back and solve
the smaller problems that arise while assembling the final solution”.
Technique: It is based on the use of the operations which transform the state of
the world. MEA works on three primary goals:
i. Transformation
ii. Reduction
iii. Application
Transformation: It means to transform object A into object B. It is an AND
graph which subdivided a problem into an intermediate problem and then
transform that problem into the goal state B.
Note: This process terminates when there is no difference between A and B or
we can say when the goal is reached.
Reduction: It means to reduce the difference between object-A and object-B by
modifying object-A.
Note: The goal of the operation is to reduce the difference between the object-A
and object-B by transforming object-A into object-A’ nearer goal B. This is
called Relevant operator (R).
Application: It means to apply the operator R to object-A. This will again be an
AND graph showing the goal of reducing the difference between object-A and
the pre-conditions required for the operator R, giving intermediate object A”.
Operator R is then applied to A”, transforming it to A’, which is close to goal B.

Algorithm for Means-End Analysis (MEA):


Step-1: Until the goal is reached or no more procedure are available.
Step-2: Describe the current state, the goal state and the difference between the
two.
Step-3: Use the difference between the current state and goal state, possibly
with the description of the current state or goal state, to select a promising
procedure.
Step-4: Use the promising procedure and update the current state.
Step-5: If the goal is reached then success else it is a failure.
Alpha beta pruning is an optimisation technique for the minimax algorithm.
Through the course of this blog, we will discuss what alpha beta pruning
means, we will discuss minimax algorithm, rules to find good ordering, and
more.

Game playing in ai

1. Introduction
2. What is Alpha Beta pruning?
3. Condition for Alpha-beta pruning
4. Minimax algorithm
5. Key points in Alpha-beta Pruning
6. Working of Alpha-beta Pruning
7. Move Ordering in Pruning
8. Rules to find Good ordering
9. Codes in Python

Introduction

The word ‘pruning’ means cutting down branches and leaves. In data science
pruning is a much-used term which refers to post and pre-pruning in decision
trees and random forest. Alpha-beta pruning is nothing but the pruning of
useless branches in decision trees. This alpha-beta pruning algorithm was
discovered independently by researchers in the 1900s.

Alpha-beta pruning is an optimisation technique for the minimax algorithm


which is discussed in the next section. The need for pruning came from the
fact that in some cases decision trees become very complex. In that tree,
some useless branches increase the complexity of the model. So, to avoid
this, Alpha-Beta pruning comes to play so that the computer does not have to
look at the entire tree. These unusual nodes make the algorithm slow. Hence
by removing these nodes algorithm becomes fast.

What is Alpha Beta pruning?


The minimax algorithm is optimised via alpha-beta pruning, which is detailed
in the next section. The requirement for pruning arose from the fact that
decision trees might become extremely complex in some circumstances.
Some superfluous branches in that tree add to the model’s complexity. To
circumvent this, Alpha-Beta pruning is used, which saves the computer from
having to examine the entire tree. The algorithm is slowed by these atypical
nodes. As a result of deleting these nodes, the algorithm becomes more
efficient.

Condition for Alpha-beta pruning

 Alpha: At any point along the Maximizer path, Alpha is the best
option or the highest value we’ve discovered. The initial value for
alpha is – ∞.
 Beta: At any point along the Minimizer path, Beta is the best option
or the lowest value we’ve discovered.. The initial value for alpha is
+ ∞.
 The condition for Alpha-beta Pruning is that α >= β.
 The alpha and beta values of each node must be kept track of. Alpha
can only be updated when it’s MAX’s time, and beta can only be
updated when it’s MIN’s turn.
 MAX will update only alpha values and the MIN player will update
only beta values.
 The node values will be passed to upper nodes instead of alpha and
beta values during going into the tree’s reverse.
 Alpha and Beta values only are passed to child nodes.
Learn about A* algorithm.

Minimax algorithm

Minimax is a classic depth-first search technique for a sequential two-player


game. The two players are called MAX and MIN. The minimax algorithm is
designed for finding the optimal move for MAX, the player at the root node.
The search tree is created by recursively expanding all nodes from the root in
a depth-first manner until either the end of the game or the maximum search
depth is reached. Let us explore this algorithm in detail.
As already mentioned, there are two players in the game, viz- Max and Min.
Max plays the first step. Max’s task is to maximise its reward while Min’s
task is to minimise Max’s reward, increasing its own reward at the same
time. Let’s say Max can take actions a, b, or c. Which one of them will give
Max the best reward when the game ends? To answer this question, we need
to explore the game tree to a sufficient depth and assume that Min plays
optimally to minimise the reward of Max.

Here is an example. Four coins are in a row and each player can pick up one
coin or two coins on his/her turn. The player who picks up the last coin wins.
Assuming that Max plays first, what move should Max make to win?

If Max picks two coins, then only two coins remain and Min can pick two
coins and win. Thus picking up 1 coin shall maximise Max’s reward.

As you might have noticed, the nodes of the tree in the figure below have
some values inscribed on them, these are called minimax value. The minimax
value of a node is the utility of the node if it is a terminal node.

If the node is a non-terminal Max node, the minimax value of the node is the
maximum of the minimax values of all of the node’s successors. On the other
hand, if the node is a non-terminal Min node, the minimax value of the node
is the minimum of the minimax values of all of the node’s successors.

Now we will discuss the idea behind the alpha beta pruning. If we apply
alpha-beta pruning to the standard minimax algorithm it gives the same
decision as that of standard algorithm but it prunes or cuts down the nodes
that are unusual in decision tree i.e. which are not affecting the final decision
made by the algorithm. This will help to avoid the complexity in the
interpretation of complex trees.

See how KNN algorithm works .

Now let us discuss the intuition behind this technique. Let us try to find
minimax decision in the below tree :

In this case,

Minimax Decision = MAX {MIN {3, 5, 10}, MIN {2, a, b}, MIN {2, 7, 3}}
= MAX {3, c, 2} = 3

Here in the above result you must have a doubt in your mind that how can we
find the maximum from missing value. So, here is solution of your doubt
also:

In the second node we choose the minimum value as c which is less than or
equal to 2 i.e. c <= 2. Now If c <= 3 and we have to choose the max of 3, c, 2
the maximum value will be 3.

We have reached a decision without looking at those nodes. And this is


where alpha-beta pruning comes into the play.

Key points in Alpha-beta Pruning

 Alpha: Alpha is the best choice or the highest value that we have
found at any instance along the path of Maximizer. The initial value
for alpha is – ∞.

 Beta: Beta is the best choice or the lowest value that we have found
at any instance along the path of Minimizer. The initial value for
alpha is + ∞.

 The condition for Alpha-beta Pruning is that α >= β.

 Each node has to keep track of its alpha and beta values. Alpha can
be updated only when it’s MAX’s turn and, similarly, beta can be
updated only when it’s MIN’s chance.

 MAX will update only alpha values and MIN player will update
only beta values.

 The node values will be passed to upper nodes instead of values of


alpha and beta during go into reverse of tree.

 Alpha and Beta values only be passed to child nodes.

Working of Alpha-beta Pruning

1. We will first start with the initial move. We will initially define the
alpha and beta values as the worst case i.e. α = -∞ and β= +∞. We
will prune the node only when alpha becomes greater than or equal
to beta.

2. Since the initial value of alpha is less than beta so we didn’t prune it. Now
it’s turn for MAX. So, at node D, value of alpha will be calculated. The value
of alpha at node D will be max (2, 3). So, value of alpha at node D will be 3.

3. Now the next move will be on node B and its turn for MIN now. So, at
node B, the value of alpha beta will be min (3, ∞). So, at node B values will
be alpha= – ∞ and beta will be 3.
In the next step, algorithms traverse the next successor of Node B which is
node E, and the values of α= -∞, and β= 3 will also be passed.

4. Now it’s turn for MAX. So, at node E we will look for MAX. The current
value of alpha at E is – ∞ and it will be compared with 5. So, MAX (- ∞, 5)
will be 5. So, at node E, alpha = 5, Beta = 5. Now as we can see that alpha is
greater than beta which is satisfying the pruning condition so we can prune
the right successor of node E and algorithm will not be traversed and the
value at node E will be 5.

6. In the next step the algorithm again comes to node A from node B. At node
A alpha will be changed to maximum value as MAX (- ∞, 3). So now the
value of alpha and beta at node A will be (3, + ∞) respectively and will be
transferred to node C. These same values will be transferred to node F.

7. At node F the value of alpha will be compared to the left branch which is
0. So, MAX (0, 3) will be 3 and then compared with the right child which is
1, and MAX (3,1) = 3 still α remains 3, but the node value of F will become
1.
8. Now node F will return the node value 1 to C and will compare to beta
value at C. Now its turn for MIN. So, MIN (+ ∞, 1) will be 1. Now at node C,
α= 3, and β= 1 and alpha is greater than beta which again satisfies the
pruning condition. So, the next successor of node C i.e. G will be pruned and
the algorithm didn’t compute the entire subtree G.

Now, C will return the node value to A and the best value of A will be MAX
(1, 3) will be 3.
The above represented tree is the final tree which is showing the nodes which
are computed and the nodes which are not computed. So, for this example the
optimal value of the maximizer will be 3.

Look at open source Python Libraries .

Move Ordering in Pruning

The effectiveness of alpha – beta pruning is based on the order in which node
is examined. Move ordering plays an important role in alpha beta pruning.

There are two types of move ordering in Alpha beta pruning:

1. Worst Ordering: In some cases of alpha beta pruning none of the


node pruned by the algorithm and works like standard minimax
algorithm. This consumes a lot of time as because of alpha and beta
factors and also not gives any effective results. This is called Worst
ordering in pruning. In this case, the best move occurs on the right
side of the tree.
2. Ideal Ordering: In some cases of alpha beta pruning lot of the
nodes pruned by the algorithm. This is called Ideal ordering in
pruning. In this case, the best move occurs on the left side of the
tree. We apply DFS hence it first search left of the tree and go deep
twice as minimax algorithm in the same amount of time.
Rules to find Good ordering

 The best move happens from the lowest node


 Use domain knowledge while finding the best move
 Order of nodes should be in such a way that the best nodes will be
computed first
Check out this Python Tutorial for Beginners

Codes in Python

1 class MinimaxABAgent:
2 """
3 Minimax agent
4 """
5 def __init__(self, max_depth, player_color):
6 """
7 Initiation
8 Parameters
9 ----------
10 max_depth : int
11 The max depth of the tree
12 player_color : int
13 The player's index as MAX in minimax algorithm
14 """
15 self.max_depth = max_depth
16 self.player_color = player_color
17 self.node_expanded = 0
18
19 def choose_action(self, state):
20 """
21 Predict the move using minimax algorithm
22 Parameters
23 ----------
24 state : State
25 Returns
26 -------
27 float, str:
28 The evaluation or utility and the action key name
29 """
30 self.node_expanded = 0
31
32 start_time = time.time()
33
34 print("MINIMAX AB : Wait AI is choosing")
35 list_action = AIElements.get_possible_action(state)
36 eval_score, selected_key_action = self._minimax(0,state,True,float('-inf'),float('inf
37 print("MINIMAX : Done, eval = %d, expanded %d" % (eval_score, self.node_exp
38 print("--- %s seconds ---" % (time.time() - start_time))
39
40 return (selected_key_action,list_action[selected_key_action])
41
42 def _minimax(self, current_depth, state, is_max_turn, alpha, beta):
43
44 if current_depth == self.max_depth or state.is_terminal():
45 return AIElements.evaluation_function(state, self.player_color), ""
46
47 self.node_expanded += 1
48
49 possible_action = AIElements.get_possible_action(state)
50 key_of_actions = list(possible_action.keys())
51
52 shuffle(key_of_actions) #randomness
53 best_value = float('-inf') if is_max_turn else float('inf')
54 action_target = ""
55 for action_key in key_of_actions:
56 new_state = AIElements.result_function(state,possible_action[action_key])
57
58 eval_child, action_child = self._minimax(current_depth+1,new_state,not is_max
59
60 if is_max_turn and best_value < eval_child:
61 best_value = eval_child
62 action_target = action_key
63 alpha = max(alpha, best_value)
64 if beta <= alpha:
65 break
66
67 elif (not is_max_turn) and best_value > eval_child:
68 best_value = eval_child
69 action_target = action_key
70 beta = min(beta, best_value)
71 if beta <= alpha:
72 break
73
74 return best_value, action_target
In this document we have seen an important component of game theory.
Although the minimax algorithm’s performance is good but the algorithm is
slow. So to make it fast we use alpha-beta pruning algorithm which will cut
down the unusual nodes from the decision tree to improve the performance.
Nowadays fast and well-performed algorithm is widely used.

Unit III
Assignment : 6 problems each on following topics
Propositional alculus,
Proportional logic
Proportional logic
Predicate Logic
Unit III
Knowledge Representation

Unit IV

Download
Probability Theory

Probability theory is a branch of mathematics that investigates the probabilities


associated with a random phenomenon. A random phenomenon can have
several outcomes. Probability theory describes the chance of occurrence of a
particular outcome by using certain formal concepts.

Probability theory makes use of some fundamentals such as sample space,


probability distributions, random variables, etc. to find the likelihood of
occurrence of an event. In this article, we will take a look at the definition,
basics, formulas, examples, and applications of probability theory.

What is Probability Theory?

Probability theory makes the use of random variables and probability


distributions to assess uncertain situations mathematically. In probability theory,
the concept of probability is used to assign a numerical description to the
likelihood of occurrence of an event. Probability can be defined as the number
of favorable outcomes divided by the total number of possible outcomes of an
event

Probability Theory Definition

Probability theory is a field of mathematics and statistics that is concerned with


finding the probabilities associated with random events. There are two main
approaches available to study probability theory. These are theoretical
probability and experimental probability. Theoretical probability is determined
on the basis of logical reasoning without conducting experiments. In
contrast, experimental probability is determined on the basis of historic data by
performing repeated experiments.

Probability Theory Example

Suppose the probability of obtaining a number 4 on rolling a fair dice needs to


be established. The number of favorable outcomes is 1. The possible outcomes
of the dice are {1, 2, 3, 4, 5, 6}. This implies that there are a total of 6 outcomes.
Thus, the probability of obtaining 4 on a dice roll, using probability theory, can
be computed as 1 / 6 = 0.167.

Probability Theory Basics

There are some basic terminologies associated with probability theory that aid
in the understanding of this field of mathematics.

Random Experiment

A random experiment, in probability theory, can be defined as a trial that is


repeated multiple times in order to get a well-defined set of possible outcomes.
Tossing a coin is an example of a random experiment.

Sample Space

Sample space can be defined as the set of all possible outcomes that result from
conducting a random experiment. For example, the sample space of tossing a
fair coin is {heads, tails}.

Event

Probability theory defines an event as a set of outcomes of an experiment that


forms a subset of the sample space. The types of events are given as follows:

 Independent events: Events that are not affected by other events are
independent events.
 Dependent events: Events that are affected by other events are known as
dependent events.
 Mutually exclusive events: Events that cannot take place at the same time
are mutually exclusive events.
 Equally likely events: Two or more events that have the same chance of
occurring are known as equally likely events.
 Exhaustive events: An exhaustive event is one that is equal to the sample
space of an experiment.

Random Variable

In probability theory, a random variable can be defined as a variable that


assumes the value of all possible outcomes of an experiment. There are two
types of random variables as given below.
 Discrete Random Variable: Discrete random variables can take an exact
countable value such as 0, 1, 2... It can be described by the cumulative
distribution function and the probability mass function.
 Continuous Random Variable: A variable that can take on an infinite number
of values is known as a continuous random variable. The cumulative
distribution function and probability density function are used to define the
characteristics of this variable.

Probability

Probability, in probability theory, can be defined as the numerical likelihood of


occurrence of an event. The probability of an event taking place will always lie
between 0 and 1. This is because the number of desired outcomes can never
exceed the total number of outcomes of an event. Theoretical probability and
empirical probability are used in probability theory to measure the chance of an
event taking place.

Conditional Probability

When the likelihood of occurrence of an event needs to be determined given


that another event has already taken place, it is known as conditional
probability. It is denoted as P(A | B). This represents the conditional probability
of event A given that event B has already occurred.

Expectation

The expectation of a random variable, X, can be defined as the average value of


the outcomes of an experiment when it is conducted multiple times. It is
denoted as E[X]. It is also known as the mean of the random variable.

Variance
Variance is the measure of dispersion that shows how the distribution of a
random variable varies with respect to the mean. It can be defined as the
average of the squared differences from the mean of the random variable.
Variance can be denoted as Var[X].

Probability Theory Distribution Function

Probability distribution or cumulative distribution function is a function that


models all the possible values of an experiment along with their probabilities
using a random variable. Bernoulli distribution, binomial distribution, are some
examples of discrete probability distributions in probability theory. Normal
distribution is an example of a continuous probability distribution.

Probability Mass Function

Probability mass function can be defined as the probability that a discrete


random variable will be exactly equal to a specific value.

Probability Density Function

Probability density function is the probability that a continuous random variable


will take on a set of possible values.

Probability Theory Formulas

There are many formulas in probability theory that help in calculating the
various probabilities associated with events. The most important probability
theory formulas are listed below.

 Theoretical probability: Number of favorable outcomes / Number of possible


outcomes.
 Empirical probability: Number of times an event occurs / Total number of
trials.
 Addition Rule: P(A ∪ B) = P(A) + P(B) - P(A∩B), where A and B are
events.
 Complementary Rule: P(A') = 1 - P(A). P(A') denotes the probability of an
event not happening.
 Independent events: P(A∩B) = P(A) ⋅ P(B)
 Conditional probability: P(A | B) = P(A∩B) / P(B)
 Bayes' Theorem: P(A | B) = P(B | A) ⋅ P(A) / P(B)
 Probability mass function: f(x) = P(X = x)
 Probability density function: p(x) = p(x) = dF(x)dxdF(x)dx = F'(x), where
F(x) is the cumulative distribution function.
 Expectation of a continuous random variable: ∫xf(x)dx∫xf(x)dx, where f(x) is
the pdf.
 Expectation of a discrete random variable: ∑xp(x)∑xp(x), where p(x) is the
pmf.
2 2
 Variance: Var(X) = E[X ] - (E[X])
Applications of Probability Theory

Probability theory is used in every field to assess the risk associated with a
particular decision. Some of the important applications of probability theory are
listed below:

 In the finance industry, probability theory is used to create mathematical


models of the stock market to predict future trends. This helps investors to
invest in the least risky asset which gives the best returns.
 The consumer industry uses probability theory to reduce the probability of
failure in a product's design.
 Casinos use probability theory to design a game of chance so as to make
profits.

Related Articles:

 Probability Rules
 Probability and Statistics
 Geometric Distribution

Important Notes on Probability Theory

 Probability theory is a branch of mathematics that deals with the


probabilities of random events.
 The concept of probability in probability theory gives the measure of the
likelihood of occurrence of an event.
 The probability value will always lie between 0 and 1.
 In probability theory, all the possible outcomes of a random experiment give
the sample space.
 Probability theory uses important concepts such as random variables, and
cumulative distribution functions to model a random event and determine
various associated probabilities

Certainty Factor in Artificial Intelligence

This article is all about the certainty factor, which tells us about how much a
situation, is likely to come true. Through this, we can determine an estimate of
the amount of certainty or uncertainty in our decisions. In this article, we are
going to study about what certainty factor is and how it is determined for a
particular statement?
Submitted by Monika Sharma, on June 10, 2019

As we all know that when analyzing a situation and drawing certain results
about it in the real world, we cannot be cent percent sure about our conclusions.
There is some uncertainty in it for sure. We as human beings have the capability
of deciding whether the statement is true or false according to how much certain
we are about our observations. But machines do not have this analyzing power.
So, there needs to be some method to quantize this estimate of certainty or
uncertainty in any decision made. To implement this method, the certainty
factor was introduced for systems which work on Artificial Intelligence.

The Certainty Factor (CF) is a numeric value which tells us about how likely
an event or a statement is supposed to be true. It is somewhat similar to what we
define in probability, but the difference in it is that an agent after finding the
probability of any event to occur cannot decide what to do. Based on the
probability and other knowledge that the agent has, this certainty factor is
decided through which the agent can decide whether to declare the statement
true or false.

The value of the Certainty factor lies between -1.0 to +1.0, where the negative
1.0 value suggests that the statement can never be true in any situation, and the
positive 1.0 value defines that the statement can never be false. The value of
the Certainty factor after analyzing any situation will either be a positive or a
negative value lying between this range. The value 0 suggests that the agent has
no information about the event or the situation.

A minimum Certainty factor is decided for every case through which the agent
decides whether the statement is true or false. This minimum Certainty
factor is also known as the threshold value. For example, if the
minimum certainty factor (threshold value) is 0.4, then if the value of CF is
less than this value, then the agent claims that particular statement false.

Dempster Shafer Theory

What Dempster Shafer Theory was given by Arthure P.Dempster in 1967 and
his student Glenn Shafer in 1976.
This theory was released because of following reason:-

 Bayesian theory is only concerned about single evidences.


 Bayesian probability cannot describe ignorance.
DST is an evidence theory, it combines all possible outcomes of the problem.
Hence it is used to solve problems where there may be a chance that a different
evidence will lead to some different result.
The uncertainty in this model is given by:-
1. Consider all possible outcomes.
2. Belief will lead to believe in some possibility by bringing out some
evidence.(What is this supposed to mean?)
3. Plausibility will make evidence compatible with possible outcomes.
For eg:-
Let us consider a room where four people are present, A, B, C and D. Suddenly
the lights go out and when the lights come back, B has been stabbed in the back
by a knife, leading to his death. No one came into the room and no one left the
room. We know that B has not committed suicide. Now we have to find out
who the murderer is.
To solve these there are the following possibilities:
 Either {A} or {C} or {D} has killed him.
 Either {A, C} or {C, D} or {A, C} have killed him.
 Or the three of them have killed him i.e; {A, C, D}
 None of them have killed him {o} (let’s say).
There will be the possible evidence by which we can find the murderer by
measure of plausibility.
Using the above example we can say:
Set of possible conclusion (P): {p1, p2….pn}
where P is set of possible conclusions and cannot be exhaustive, i.e. at least one
(p)i must be true.
(p)i must be mutually exclusive.
Power Set will contain 2n elements where n is number of elements in the
possible set.
For eg:-
If P = { a, b, c}, then Power set is given as
{o, {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, {a, b, c}}= 23 elements.
Mass function m(K): It is an interpretation of m({K or B}) i.e; it means there is
evidence for {K or B} which cannot be divided among more specific beliefs for
K and B.
Belief in K: The belief in element K of Power Set is the sum of masses of
element which are subsets of K. This can be explained through an example
Lets say K = {a, b, c}
Bel(K) = m(a) + m(b) + m(c) + m(a, b) + m(a, c) + m(b, c) + m(a, b, c)
Plausibility in K: It is the sum of masses of set that intersects with K.
i.e; Pl(K) = m(a) + m(b) + m(c) + m(a, b) + m(b, c) + m(a, c) + m(a, b, c)
Characteristics of Dempster Shafer Theory:
 It will ignorance part such that probability of all events aggregate to 1.
(What is this supposed to mean?)
 Ignorance is reduced in this theory by adding more and more evidences.
 Combination rule is used to combine various types of possibilities.
Advantages:
 As we add more information, uncertainty interval reduces.
 DST has much lower level of ignorance.
 Diagnose hierarchies can be represented using this.
 Person dealing with such problems is free to think about evidences.
Disadvantages:
n
 In this, computation effort is high, as we have to deal with 2 of sets

Fuzzy Logic

Fuzzy logic is an approach to variable processing that allows for multiple


possible truth values to be processed through the same variable. Fuzzy logic
attempts to solve problems with an open, imprecise spectrum of data and
heuristics that makes it possible to obtain an array of accurate conclusions.

In more simple words, A Fuzzy logic stat can be 0, 1 or in between these


numbers i.e. 0.17 or 0.54. For example, In Boolean, we may say glass of hot
water ( i.e 1 or High) or glass of cold water i.e. (0 or low), but in Fuzzy logic,
We may say glass of warm water (neither hot nor cold)

Why is fuzzy logic used?


Fuzzy logic can be used for situations in which conventional logic
technologies are not effective, such as systems and devices that cannot be
precisely described by mathematical models, those that have significant
uncertainties or contradictory conditions, and linguistically controlled devices
or systems

Fuzzy logic has been used in numerous applications such as facial pattern
recognition, air conditioners, washing machines, vacuum cleaners, antiskid
braking systems, transmission systems, control of subway systems and
unmanned helicopters, knowledge-based systems for multiobjective
optimization of power systems

What is crisp set?


Crisp set is a collection of unordered distinct elements, which are derived
from Universal set. Universal set consists of all possible elements which take
part in any experiment. Set is quite useful and important way of representing
data. Let X represents a set of natural numbers, so

What are the steps in fuzzy logic?

Development
1. Step 1 − Define linguistic variables and terms. Linguistic variables are input and
output variables in the form of simple words or sentences. ...
2. Step 2 − Construct membership functions for them. ...
3. Step3 − Construct knowledge base rules. ...
4. Step 4 − Obtain fuzzy value. ...
5. Step 5 − Perform defuzzification
What are the characteristic of fuzzy logic?
Characteristics of Fuzzy Logic

Flexible and easy to implement machine learning technique. Helps you to


mimic the logic of human thought. Logic may have two values which represent
two possible solutions. Highly suitable method for uncertain or approximate
reasoning.

Fuzzy Sets
https://www.tutorialspoint.com/fuzzy_logic/fuzzy_logic_set_theory.htm
What is Fuzzy Logic?

Fuzzy Logic resembles the human decision-making methodology. It deals with


vague and imprecise information. This is gross oversimplification of the real-
world problems and based on degrees of truth rather than usual true/false or 1/0
like Boolean logic.
Take a look at the following diagram. It shows that in fuzzy systems, the values
are indicated by a number in the range from 0 to 1. Here 1.0 represents absolute
truth and 0.0 represents absolute falseness. The number which indicates the
value in fuzzy systems is called the truth value.

In other words, we can say that fuzzy logic is not logic that is fuzzy, but logic
that is used to describe fuzziness. There can be numerous other examples like
this with the help of which we can understand the concept of fuzzy logic.
Fuzzy Logic was introduced in 1965 by Lofti A. Zadeh in his research paper
“Fuzzy Sets”. He is considered as the father of Fuzzy Logic.
Fuzzy Logic - Classical Set Theory
A set is an unordered collection of different elements. It can be written
explicitly by listing its elements using the set bracket. If the order of the
elements is changed or any element of a set is repeated, it does not make any
changes in the set.
Example
 A set of all positive integers.
 A set of all the planets in the solar system.
 A set of all the states in India.
 A set of all the lowercase letters of the alphabet.

Mathematical Representation of a Set

Sets can be represented in two ways −


Roster or Tabular Form
In this form, a set is represented by listing all the elements comprising it. The
elements are enclosed within braces and separated by commas.
Following are the examples of set in Roster or Tabular Form −
 Set of vowels in English alphabet, A = {a,e,i,o,u}
 Set of odd numbers less than 10, B = {1,3,5,7,9}
Set Builder Notation
In this form, the set is defined by specifying a property that elements of the set
have in common. The set is described as A = {x:p(x)}
Example 1 − The set {a,e,i,o,u} is written as
A = {x:x is a vowel in English alphabet}
Example 2 − The set {1,3,5,7,9} is written as
B = {x:1 ≤ x < 10 and (x%2) ≠ 0}
If an element x is a member of any set S, it is denoted by x∈S and if an element
y is not a member of set S, it is denoted by y∉S.
Example − If S = {1,1.2,1.7,2},1 ∈ S but 1.5 ∉ S
Cardinality of a Set
Cardinality of a set S, denoted by |S||S|, is the number of elements of the set.
The number is also referred as the cardinal number. If a set has an infinite
number of elements, its cardinality is ∞∞.
Example − |{1,4,3,5}| = 4,|{1,2,3,4,5,…}| = ∞
If there are two sets X and Y, |X| = |Y| denotes two sets X and Y having same
cardinality. It occurs when the number of elements in X is exactly equal to the
number of elements in Y. In this case, there exists a bijective function ‘f’ from
X to Y.
|X| ≤ |Y| denotes that set X’s cardinality is less than or equal to set Y’s
cardinality. It occurs when the number of elements in X is less than or equal to
that of Y. Here, there exists an injective function ‘f’ from X to Y.
|X| < |Y| denotes that set X’s cardinality is less than set Y’s cardinality. It
occurs when the number of elements in X is less than that of Y. Here, the
function ‘f’ from X to Y is injective function but not bijective.
If |X| ≤ |Y| and |X| ≤ |Y| then |X| = |Y|. The sets X and Y are commonly referred
as equivalent sets.

Types of Sets

Sets can be classified into many types; some of which are finite, infinite, subset,
universal, proper, singleton set, etc.
Finite Set
A set which contains a definite number of elements is called a finite set.
Example − S = {x|x ∈ N and 70 > x > 50}
Infinite Set
A set which contains infinite number of elements is called an infinite set.
Example − S = {x|x ∈ N and x > 10}
Subset
A set X is a subset of set Y (Written as X ⊆ Y) if every element of X is an
element of set Y.
Example 1 − Let, X = {1,2,3,4,5,6} and Y = {1,2}. Here set Y is a subset of set
X as all the elements of set Y is in set X. Hence, we can write Y⊆X.
Example 2 − Let, X = {1,2,3} and Y = {1,2,3}. Here set Y is a subset (not a
proper subset) of set X as all the elements of set Y is in set X. Hence, we can
write Y⊆X.
Proper Subset
The term “proper subset” can be defined as “subset of but not equal to”. A Set
X is a proper subset of set Y (Written as X ⊂ Y) if every element of X is an
element of set Y and |X| < |Y|.
Example − Let, X = {1,2,3,4,5,6} and Y = {1,2}. Here set Y ⊂ X, since all
elements in Y are contained in X too and X has at least one element which is
more than set Y.
Universal Set
It is a collection of all elements in a particular context or application. All the
sets in that context or application are essentially subsets of this universal set.
Universal sets are represented as U.
Example − We may define U as the set of all animals on earth. In this case, a
set of all mammals is a subset of U, a set of all fishes is a subset of U, a set of
all insects is a subset of U, and so on.
Empty Set or Null Set
An empty set contains no elements. It is denoted by Φ. As the number of
elements in an empty set is finite, empty set is a finite set. The cardinality of
empty set or null set is zero.
Example – S = {x|x ∈ N and 7 < x < 8} = Φ
Singleton Set or Unit Set
A Singleton set or Unit set contains only one element. A singleton set is denoted
by {s}.
Example − S = {x|x ∈ N, 7 < x < 9} = {8}
Equal Set
If two sets contain the same elements, they are said to be equal.
Example − If A = {1,2,6} and B = {6,1,2}, they are equal as every element of
set A is an element of set B and every element of set B is an element of set A.
Equivalent Set
If the cardinalities of two sets are same, they are called equivalent sets.
Example − If A = {1,2,6} and B = {16,17,22}, they are equivalent as
cardinality of A is equal to the cardinality of B. i.e. |A| = |B| = 3
Overlapping Set
Two sets that have at least one common element are called overlapping sets. In
case of overlapping sets −
n(A∪B)=n(A)+n(B)−n(A∩B)n(A∪B)=n(A)+n(B)−n(A∩B)
n(A∪B)=n(A−B)+n(B−A)+n(A∩B)n(A∪B)=n(A−B)+n(B−A)+n(A∩B)
n(A)=n(A−B)+n(A∩B)n(A)=n(A−B)+n(A∩B)
n(B)=n(B−A)+n(A∩B)n(B)=n(B−A)+n(A∩B)
Example − Let, A = {1,2,6} and B = {6,12,42}. There is a common element
‘6’, hence these sets are overlapping sets.
Disjoint Set
Two sets A and B are called disjoint sets if they do not have even one element
in common. Therefore, disjoint sets have the following properties −
n(A∩B)=ϕn(A∩B)=ϕ
n(A∪B)=n(A)+n(B)n(A∪B)=n(A)+n(B)
Example − Let, A = {1,2,6} and B = {7,9,14}, there is not a single common
element, hence these sets are overlapping sets.

Operations on Classical Sets

Set Operations include Set Union, Set Intersection, Set Difference, Complement
of Set, and Cartesian Product.
Union
The union of sets A and B (denoted by A ∪ BA ∪ B) is the set of elements
which are in A, in B, or in both A and B. Hence, A ∪ B = {x|x ∈ A OR x ∈ B}.
Example − If A = {10,11,12,13} and B = {13,14,15}, then A ∪ B =
{10,11,12,13,14,15} – The common element occurs only once.

Intersection
The intersection of sets A and B (denoted by A ∩ B) is the set of elements
which are in both A and B. Hence, A ∩ B = {x|x ∈ A AND x ∈ B}.

Difference/ Relative Complement


The set difference of sets A and B (denoted by A–B) is the set of elements
which are only in A but not in B. Hence, A − B = {x|x ∈ A AND x ∉ B}.
Example − If A = {10,11,12,13} and B = {13,14,15}, then (A − B) =
{10,11,12} and (B − A) = {14,15}. Here, we can see (A − B) ≠ (B − A)

Complement of a Set
The complement of a set A (denoted by A′) is the set of elements which are not
in set A. Hence, A′ = {x|x ∉ A}.
More specifically, A′ = (U−A) where U is a universal set which contains all
objects.
Example − If A = {x|x belongs to set of add integers} then A′ = {y|y does not
belong to set of odd integers}
Cartesian Product / Cross Product
The Cartesian product of n number of sets A1,A2,…An denoted as A1 × A2...×
An can be defined as all possible ordered pairs (x1,x2,…xn) where x1 ∈ A1,x2
∈ A2,…xn ∈ An
Example − If we take two sets A = {a,b} and B = {1,2},
The Cartesian product of A and B is written as − A × B = {(a,1),(a,2),(b,1),
(b,2)}
And, the Cartesian product of B and A is written as − B × A = {(1,a),(1,b),(2,a),
(2,b)}

Properties of Classical Sets

Properties on sets play an important role for obtaining the solution. Following
are the different properties of classical sets −
Commutative Property
Having two sets A and B, this property states −
A∪B=B∪AA∪B=B∪A
A∩B=B∩AA∩B=B∩A
Associative Property
Having three sets A, B and C, this property states −
A∪(B∪C)=(A∪B)∪CA∪(B∪C)=(A∪B)∪C
A∩(B∩C)=(A∩B)∩CA∩(B∩C)=(A∩B)∩C
Distributive Property
Having three sets A, B and C, this property states −
A∪(B∩C)=(A∪B)∩(A∪C)A∪(B∩C)=(A∪B)∩(A∪C)
A∩(B∪C)=(A∩B)∪(A∩C)A∩(B∪C)=(A∩B)∪(A∩C)
Idempotency Property
For any set A, this property states −
A∪A=AA∪A=A
A∩A=AA∩A=A
Identity Property
For set A and universal set X, this property states −
A∪φ=AA∪φ=A
A∩X=AA∩X=A
A∩φ=φA∩φ=φ
A∪X=XA∪X=X
Transitive Property
Having three sets A, B and C, the property states −
If A⊆B⊆CA⊆B⊆C, then A⊆CA⊆C
Involution Property
For any set A, this property states −
A¯¯¯¯¯¯¯¯=AA¯¯=A
De Morgan’s Law
It is a very important law and supports in proving tautologies and contradiction.
This law states −
A∩B¯¯¯¯¯¯¯¯¯¯¯¯¯=A¯¯¯¯∪B¯¯¯¯A∩B¯=A¯∪B¯
A∪B¯¯¯¯¯¯¯¯¯¯¯¯¯=A¯¯¯¯∩B¯¯¯¯A∪B¯=A¯∩B¯
Fuzzy
Fuzzy sets can be considered as an extension and gross oversimplification of
classical sets. It can be best understood in the context of set membership.
Basically it allows partial membership which means that it contain elements that
have varying degrees of membership in the set. From this, we can understand
the difference between classical set and fuzzy set. Classical set contains
elements that satisfy precise properties of membership while fuzzy set contains
elements that satisfy imprecise properties of membership.

Mathematical Concept

A fuzzy set A˜A~ in the universe of information UU can be defined as a set of


ordered pairs and it can be represented mathematically as −
A˜={(y,μA˜(y))|y∈U}A~={(y,μA~(y))|y∈U}
Here μA˜(y)μA~(y) = degree of membership of yy in \widetilde{A}, assumes
values in the range from 0 to 1, i.e., μA˜(y)∈[0,1]μA~(y)∈[0,1].
element.

Operations on Fuzzy Sets

Having two fuzzy sets A˜A~ and B˜B~, the universe of information UU and an
element 𝑦 of the universe, the following relations express the union, intersection
and complement operation on fuzzy sets.
Union/Fuzzy ‘OR’
Let us consider the following representation to understand how
the Union/Fuzzy ‘OR’ relation works −
μA˜∪B˜(y)=μA˜∨μB˜∀y∈UμA~∪B~(y)=μA~∨μB~∀y∈U
Here ∨ represents the ‘max’ operation.
Intersection/Fuzzy ‘AND’
Let us consider the following representation to understand how
the Intersection/Fuzzy ‘AND’ relation works −
μA˜∩B˜(y)=μA˜∧μB˜∀y∈UμA~∩B~(y)=μA~∧μB~∀y∈U
Here ∧ represents the ‘min’ operation.

Complement/Fuzzy ‘NOT’
Let us consider the following representation to understand how
the Complement/Fuzzy ‘NOT’ relation works −
μA˜=1−μA˜(y)y∈UμA~=1−μA~(y)y∈U

Properties of Fuzzy Sets

Let us discuss the different properties of fuzzy sets.


Commutative Property
Having two fuzzy sets A˜A~ and B˜B~, this property states −
A˜∪B˜=B˜∪A˜A~∪B~=B~∪A~
A˜∩B˜=B˜∩A˜A~∩B~=B~∩A~
Associative Property
Having three fuzzy sets A˜A~, B˜B~ and C˜C~, this property states −
(\widetilde{A}\cup \left \widetilde{B}) \cup \widetilde{C} \right = \left \
widetilde{A} \cup (\widetilde{B}\right )\cup \widetilde{C}) (\widetilde{A}\
cup \left \widetilde{B}) \cup \widetilde{C} \right = \left \widetilde{A} \cup (\
widetilde{B}\right )\cup \widetilde{C})
(\widetilde{A}\cap \left \widetilde{B}) \cap \widetilde{C} \right = \left \
widetilde{A} \cup (\widetilde{B}\right \cap \widetilde{C}) (\widetilde{A}\cap
\left \widetilde{B}) \cap \widetilde{C} \right = \left \widetilde{A} \cup (\
widetilde{B}\right \cap \widetilde{C})
Distributive Property
Having three fuzzy sets A˜A~, B˜B~ and C˜C~, this property states −
A˜∪(B˜∩C˜)=(A˜∪B˜)∩(A˜∪C˜)A~∪(B~∩C~)=(A~∪B~)∩(A~∪C~)
A˜∩(B˜∪C˜)=(A˜∩B˜)∪(A˜∩C˜)A~∩(B~∪C~)=(A~∩B~)∪(A~∩C~)
Idempotency Property
For any fuzzy set A˜A~, this property states −
A˜∪A˜=A˜A~∪A~=A~
A˜∩A˜=A˜A~∩A~=A~
Identity Property
For fuzzy set A˜A~ and universal set UU, this property states −
A˜∪φ=A˜A~∪φ=A~
A˜∩U=A˜A~∩U=A~
A˜∩φ=φA~∩φ=φ
A˜∪U=UA~∪U=U
Transitive Property
Having three fuzzy sets A˜A~, B˜B~ and C˜C~, this property states −
IfA˜⊆B˜⊆C˜,thenA˜⊆C˜IfA~⊆B~⊆C~,thenA~⊆C~
Involution Property
For any fuzzy set A˜A~, this property states −
A˜¯¯¯¯¯¯¯¯=A˜A~¯¯=A~
De Morgan’s Law
This law plays a crucial role in proving tautologies and contradiction. This law
states −
A˜∩B˜¯¯¯¯¯¯¯¯¯¯¯¯¯=A˜¯¯¯¯∪B˜¯¯¯¯A~∩B~¯=A~¯∪B~¯
A˜∪B˜¯¯¯¯¯¯¯¯¯¯¯¯¯=A˜¯¯¯¯∩B˜¯¯¯¯

We already know that fuzzy logic is not logic that is fuzzy but logic that is used
to describe fuzziness. This fuzziness is best characterized by its membership
function. In other words, we can say that membership function represents the
degree of truth in fuzzy logic.

Following are a few important points relating to the membership function −


 Membership functions were first introduced in 1965 by Lofti A. Zadeh in
his first research paper “fuzzy sets”.
 Membership functions characterize fuzziness (i.e., all the information in
fuzzy set), whether the elements in fuzzy sets are discrete or continuous.
 Membership functions can be defined as a technique to solve practical
problems by experience rather than knowledge.
 Membership functions are represented by graphical forms.
 Rules for defining fuzziness are fuzzy too.

Mathematical Notation

We have already studied that a fuzzy set à in the universe of information U can
be defined as a set of ordered pairs and it can be represented mathematically as

A˜={(y,μA˜(y))|y∈U}A~={(y,μA~(y))|y∈U}
Here μA˜(∙)μA~(∙) = membership function of A˜A~; this assumes values in the
range from 0 to 1, i.e., μA˜(∙)∈[0,1]μA~(∙)∈[0,1]. The membership
function μA˜(∙)μA~(∙) maps UU to the membership spaceMM.
The dot (∙)(∙) in the membership function described above, represents the
element in a fuzzy set; whether it is discrete or continuous.
Features of Membership Functions

We will now discuss the different features of Membership Functions.


Core
For any fuzzy set A˜A~, the core of a membership function is that region of
universe that is characterize by full membership in the set. Hence, core consists
of all those elements yy of the universe of information such that,
μA˜(y)=1μA~(y)=1
Support
For any fuzzy set A˜A~, the support of a membership function is the region of
universe that is characterize by a nonzero membership in the set. Hence core
consists of all those elements yy of the universe of information such that,
μA˜(y)>0μA~(y)>0
Boundary
For any fuzzy set A˜A~, the boundary of a membership function is the region of
universe that is characterized by a nonzero but incomplete membership in the
set. Hence, core consists of all those elements yy of the universe of information
such that,
1>μA˜(y)>01>μA~(y)>0

Fuzzification

It may be defined as the process of transforming a crisp set to a fuzzy set or a


fuzzy set to fuzzier set. Basically, this operation translates accurate crisp input
values into linguistic variables.
Following are the two important methods of fuzzification −
Support Fuzzification(s-fuzzification) Method
In this method, the fuzzified set can be expressed with the help of the following
relation −
A˜=μ1Q(x1)+μ2Q(x2)+...+μnQ(xn)A~=μ1Q(x1)+μ2Q(x2)+...+μnQ(xn)
Here the fuzzy set Q(xi)Q(xi) is called as kernel of fuzzification. This method is
implemented by keeping μiμi constant and xixi being transformed to a fuzzy
set Q(xi)Q(xi).
Grade Fuzzification (g-fuzzification) Method
It is quite similar to the above method but the main difference is that it
kept xixi constant and μiμi is expressed as a fuzzy set.

Defuzzification

It may be defined as the process of reducing a fuzzy set into a crisp set or to
convert a fuzzy member into a crisp member.
We have already studied that the fuzzification process involves conversion from
crisp quantities to fuzzy quantities. In a number of engineering applications, it is
necessary to defuzzify the result or rather “fuzzy result” so that it must be
converted to crisp result. Mathematically, the process of Defuzzification is also
called “rounding it off”.
The different methods of Defuzzification are described below −
Max-Membership Method
This method is limited to peak output functions and also known as height
method. Mathematically it can be represented as follows −
μA˜(x∗)>μA˜(x)forallx∈XμA~(x∗)>μA~(x)forallx∈X
Here, x∗x∗ is the defuzzified output.
Centroid Method
This method is also known as the center of area or the center of gravity method.
Mathematically, the defuzzified output x∗x∗ will be represented as −
x∗=∫μA˜(x).xdx∫μA˜(x).dxx∗=∫μA~(x).xdx∫μA~(x).dx
Weighted Average Method
In this method, each membership function is weighted by its maximum
membership value. Mathematically, the defuzzified output x∗x∗ will be
represented as −
x∗=∑μA˜(xi¯¯¯¯¯).xi¯¯¯¯¯∑μA˜(xi¯¯¯¯¯)x∗=∑μA~(xi¯).xi¯∑μA~(xi¯)
Mean-Max Membership
This method is also known as the middle of the maxima. Mathematically, the
defuzzified output x∗x∗ will be represented as −
x∗=∑i=1nxi¯¯¯¯¯nx∗=∑i=1nxi¯n
UNIT V
Machine Learning

What is Machine Learning?

In 1959, Arthur Samuel, a computer scientist who pioneered the study of


artificial intelligence, described machine learning as “the study that gives
computers the ability to learn without being explicitly programmed.”

Alan Turing’s seminal paper (Turing, 1950) introduced a benchmark standard


for demonstrating machine intelligence, such that a machine has to be intelligent
and responsive in a manner that cannot be differentiated from that of a human
being.

Machine Learning is an application of artificial intelligence where a

computer/machine learns from the past experiences (input data) and makes future

predictions. The performance of such a system should be at least human level.

A more technical definition given by Tom M. Mitchell’s (1997) : “A computer


program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.” Example:
A handwriting recognition learning problem:Task T: recognizing and
classifying handwritten words within images
Performance measure P: percent of words correctly classified, accuracy
Training experience E: a data-set of handwritten words with given classifications
In order to perform the task T, the system learns from the data-set provided. A
data-set is a collection of many examples. An example is a collection of features.

Machine Learning Categories

Machine Learning is generally categorized into three types: Supervised


Learning, Unsupervised Learning, Reinforcement learning

Supervised Learning:

In supervised learning the machine experiences the examples along with the
labels or targets for each example. The labels in the data help the algorithm to
correlate the features.

Two of the most common supervised machine learning tasks


are classification and regression.
In classification problems the machine must learn to predict discrete values. That
is, the machine must predict the most probable category, class, or label for new
examples. Applications of classification include predicting whether a stock's price
will rise or fall, or deciding if a news article belongs to the politics or leisure
section. In regression problems the machine must predict the value of a continuous
response variable. Examples of regression problems include predicting the sales for
a new product, or the salary for a job based on its description.

Unsupervised Learning:

When we have unclassified and unlabeled data, the system attempts to uncover
patterns from the data . There is no label or target given for the examples. One
common task is to group similar examples together called clustering.
Reinforcement Learning:

Reinforcement learning refers to goal-oriented algorithms, which learn how to


attain a complex objective (goal) or maximize along a particular dimension over
many steps. This method allows machines and software agents to automatically
determine the ideal behavior within a specific context in order to maximize its
performance. Simple reward feedback is required for the agent to learn which
action is best; this is known as the reinforcement signal. For example, maximize
the points won in a game over many moves.

Techniques of Supervised Machine Learning

Regression is a technique used to predict the value of a response (dependent)


variables, from one or more predictor (independent) variables.

Most commonly used regressions techniques are: Linear


Regression and Logistic Regression. We will discuss the theory behind these
two prominent techniques alongside explaining many other key concepts
like Gradient-descent algorithm, Over-fit/Under-fit, Error
analysis, Regularization, Hyper-parameters, Cross-validation techniques
involved in machine learning.

Linear Regression

In linear regression problems, the goal is to predict a real-value variable y from a


given pattern X. In the case of linear regression the output is a linear function of
the input. Letŷ be the output our model predicts: ŷ = WX+b

Here X is a vector (features of an example), W are the weights (vector of


parameters) that determine how each feature affects the prediction andb is bias
term. So our task T is to predict y from X, now we need to measure
performance P to know how well the model performs.

Now to calculate the performance of the model, we first calculate the error of
each example i as:

we take the absolute value of the error to take into account both positive and
negative values of error.

Finally we calculate the mean for all recorded absolute errors (Average sum of
all absolute errors).

Mean Absolute Error (MAE) = Average of All absolute errors

More popular way of measuring model performance is using

Mean Squared Error (MSE): Average of squared differences between


prediction and actual observation.
The mean is halved (1/2) as a convenience for the computation of the gradient
descent [discussed later], as the derivative term of the square function will cancel
out the 1/2 term. For more discussion on the MAE vs MSE please refer [1] &
[2].

The main aim of training the ML algorithm is to adjust the weights W to reduce

the MAE or MSE.

To minimize the error, the model while experiencing the examples of the
training set, updates the model parameters W. These error calculations when
plotted against the W is also called cost function J(w), since it determines the
cost/penalty of the model. So minimizing the error is also called as minimization
the cost function J.

Gradient descent Algorithm:

When we plot the cost function J(w) vs w. It is represented as below:

As we see from the curve, there exists a value of parameters W which has the
minimum cost Jmin. Now we need to find a way to reach this minimum cost.
In the gradient descent algorithm, we start with random model parameters and
calculate the error for each learning iteration, keep updating the model
parameters to move closer to the values that results in minimum cost.

repeat until minimum cost: {

In the above equation we are updating the model parameters after each iteration.
The second term of the equation calculates the slope or gradient of the curve at
each iteration.

The gradient of the cost function is calculated as partial derivative of cost


function J with respect to each model parameter wj, j takes value of number of
features [1 to n]. α, alpha, is the learning rate, or how quickly we want to move
towards the minimum. If α is too large, we can overshoot. If α is too small,
means small steps of learning hence the overall time taken by the model to
observe all examples will be more.

There are three ways of doing gradient descent:

Batch gradient descent: Uses all of the training instances to update the model
parameters in each iteration.

Mini-batch Gradient Descent: Instead of using all examples, Mini-batch


Gradient Descent divides the training set into smaller size called batch denoted
by ‘b’. Thus a mini-batch ‘b’ is used to update the model parameters in each
iteration.

Stochastic Gradient Descent (SGD): updates the parameters using only a


single training instance in each iteration. The training instance is usually selected
randomly. Stochastic gradient descent is often preferred to optimize cost
functions when there are hundreds of thousands of training instances or more, as
it will converge more quickly than batch gradient descent [3].

Logistic Regression

In some problems the response variable is not normally distributed. For instance,
a coin toss can result in two outcomes: heads or tails. The Bernoulli distribution
describes the probability distribution of a random variable that can take the
positive case with probability P or the negative case with probability 1-P. If the
response variable represents a probability, it must be constrained to the
range {0,1}.

In logistic regression, the response variable describes the probability that the
outcome is the positive case. If the response variable is equal to or exceeds a
discrimination threshold, the positive class is predicted; otherwise, the negative
class is predicted.

The response variable is modeled as a function of a linear combination of the


input variables using the logistic function.

Since our hypotheses ŷ has to satisfy 0 ≤ ŷ ≤ 1, this can be accomplished by


plugging logistic function or “Sigmoid Function”
The function g(z) maps any real number to the (0, 1) interval, making it useful
for transforming an arbitrary-valued function into a function better suited for
classification. The following is a plot of the value of the sigmoid function for the
range {-6,6}:

Now coming back to our logistic regression problem, Let us assume that z is a
linear function of a single explanatory variable x. We can then express z as
follows:

And the logistic function can now be written as:


Note that g(x) is interpreted as the probability of the dependent variable.
g(x) = 0.7, gives us a probability of 70% that our output is 1. Our probability
that our prediction is 0 is just the complement of our probability that it is 1 (e.g.
if probability that it is 1 is 70%, then the probability that it is 0 is 30%).

The input to the sigmoid function ‘g’ doesn’t need to be linear function. It can
very well be a circle or any shape.

Cost Function

We cannot use the same cost function that we used for linear regression because
the Sigmoid Function will cause the output to be wavy, causing many local
optima. In other words, it will not be a convex function.

Non-convex cost function

In order to ensure the cost function is convex (and therefore ensure convergence
to the global minimum), the cost function is transformed using the logarithm of
the sigmoid function. The cost function for logistic regression looks like:
Which can be written as:

So the cost function for logistic regression is:

Since the cost function is a convex function, we can run the gradient descent
algorithm to find the minimum cost.

Under-fitting & Over-fitting


We try to make the machine learning algorithm fit the input data by increasing or
decreasing the models capacity. In linear regression problems, we increase or
decrease the degree of the polynomials.

Consider the problem of predicting y from x ∈ R. The leftmost figure below


shows the result of fitting a line to a data-set. Since the data doesn’t lie in a
straight line, so fit is not very good (left side figure).

To increase model capacity, we add another feature by adding term x² to it. This
produces a better fit ( middle figure). But if we keep on doing so ( x⁵, 5th order
polynomial, figure on the right side), we may be able to better fit the data but
will not generalize well for new data. The first figure represents under-fitting and
the last figure represents over-fitting.

Under-fitting:

When the model has fewer features and hence not able to learn from the data
very well. This model has high bias.
Over-fitting:

When the model has complex functions and hence able to fit the data very well
but is not able to generalize to predict new data. This model has high variance.

There are three main options to address the issue of over-fitting:

1. Reduce the number of features: Manually select which features to keep.


Doing so, we may miss some important information, if we throw away some
features.

2. Regularization: Keep all the features, but reduce the magnitude of weights
W. Regularization works well when we have a lot of slightly useful feature.

3. Early stopping: When we are training a learning algorithm iteratively such


as using gradient descent, we can measure how well each iteration of the
model performs. Up to a certain number of iterations, each iteration
improves the model. After that point, however, the model’s ability to
generalize can weaken as it begins to over-fit the training data.

Regularization
Regularization can be applied to both linear and logistic regression by adding a
penalty term to the error function in order to discourage the coefficients or
weights from reaching large values.

Hyper-parameters

Hyper-parameters are “higher-level” parameters that describe structural


information about a model that must be decided before fitting model parameters,
examples of hyper-parameters we discussed so far:
Learning rate alpha , Regularization lambda.

Cross-Validation

The process to select the optimal values of hyper-parameters is called model


selection. if we reuse the same test data-set over and over again during model
selection, it will become part of our training data and thus the model will be
more likely to over fit.

The overall data set is divided into:

1. the training data set

2. validation data set

3. test data set.

The training set is used to fit the different models, and the performance on the
validation set is then used for the model selection. The advantage of keeping a
test set that the model hasn’t seen before during the training and model selection
steps is that we avoid over-fitting the model and the model is able to better
generalize to unseen data.

In many applications, however, the supply of data for training and testing will be
limited, and in order to build good models, we wish to use as much of the
available data as possible for training. However, if the validation set is small, it
will give a relatively noisy estimate of predictive performance. One solution to
this dilemma is to use cross-validation, which is illustrated in Figure below.

Below Cross-validation steps are taken from here, adding here for completeness.

Cross-Validation Step-by-Step:

These are the steps for selecting hyper-parameters using K-fold cross-validation:

1. Split your training data into K = 4 equal parts, or “folds.”

2. Choose a set of hyper-parameters, you wish to optimize.

3. Train your model with that set of hyper-parameters on the first 3 folds.

4. Evaluate it on the 4th fold, or the”hold-out” fold.


5. Repeat steps (3) and (4) K (4) times with the same set of hyper-parameters,
each time holding out a different fold.

6. Aggregate the performance across all 4 folds. This is your performance


metric for the set of hyper-parameters.

7. Repeat steps (2) to (6) for all sets of hyper-parameters you wish to consider.

Cross-validation allows us to tune hyper-parameters with only our training set.


This allows us to keep the test set as a truly unseen data-set for selecting final
model.

Conclusion

We’ve covered some of the key concepts in the field of Machine Learning,
starting with the definition of machine learning and then covering different types
of machine learning techniques. We discussed the theory behind the most
common regression techniques (Linear and Logistic) alongside discussed other
key concepts of machine learning.

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the


unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with
another group."

It does it by finding some similar patterns in the unlabelled dataset such as


shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the


algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with
a cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm, but the


difference is the type of dataset that we are using. In classification, we work
with the labeled data set, whereas in clustering, we work with the unlabelled
dataset.

Example: Let's understand the clustering technique with the real-world example
of Mall: When we visit any shopping mall, we can observe that the things with
similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can
easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its


recommendation system to provide the recommendations as per the past search
of products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can
see the different fruits are divided into several groups with similar properties.
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to
another group also). But there are also other various approaches of Clustering
exist. Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is


also known as the centroid-based method. The most common example of
partitioning clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in such a
way that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.
Density-Based Clustering

The density-based clustering method connects the highly-dense areas into


clusters, and the arbitrarily shaped distributions are formed as long as the dense
region can be connected. This algorithm does it by identifying different clusters
in the dataset and connects the areas of high densities into clusters. The dense
areas in data space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset
has varying densities and high dimensions.
Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on


the probability of how a dataset belongs to a particular distribution. The
grouping is done by assuming some distributions commonly Gaussian
Distribution.

The example of this type is the Expectation-Maximization Clustering


algorithm that uses Gaussian Mixture Models (GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned


clustering as there is no requirement of pre-specifying the number of clusters to
be created. In this technique, the dataset is divided into clusters to create a tree-
like structure, which is also called a dendrogram. The observations or any
number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical
algorithm.
Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a
cluster. Fuzzy C-means algorithm is the example of this type of clustering; it
is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the
kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required,
with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas
in the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center
of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering
of Applications with Noise. It is an example of a density-based model
similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low
density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm
can be used as an alternative for the k-means algorithm or for those cases
where K-means can be failed. In GMM, it is assumed that the data points
are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative
hierarchical algorithm performs the bottom-up hierarchical clustering. In
this, each data point is treated as a single cluster at the outset and then
successively merged. The cluster hierarchy can be represented as a tree-
structure.
6. Affinity Propagation: It is different from other clustering algorithms as
it does not require to specify the number of clusters. In this, each data
point sends a message between the pair of data points until convergence.
It has O(N2T) time complexity, which is the main drawback of this
algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in


Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely


used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering
technique. The search result appears based on the closest object to the
search query. It does it by grouping similar data objects in one group that
is far from the other dissimilar objects. The accurate result of a query
depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that
for what purpose the particular land should be used, that means for which
purpose it is more suitable.

Inductive vs. Deductive Research Approach (with Examples)


Published on April 18, 2019 by Raimo Streefkerk. Revised on May 6, 2022.

The main difference between inductive and deductive reasoning is that


inductive reasoning aims at developing a theory while deductive reasoning
aims at testing an existing theory.

Inductive reasoning moves from specific observations to broad generalizations,


and deductive reasoning the other way around.

Both approaches are used in various types of research, and it’s not uncommon
to combine them in one large study.

Table of contents

1. Inductive research approach


2. Deductive research approach
3. Combining inductive and deductive research
4. Frequently asked questions about inductive vs deductive reasoning

Inductive research approach


When there is little to no existing literature on a topic, it is common to
perform inductive research because there is no theory to test. The inductive
approach consists of three stages:

1. Observation
oA low-cost airline flight is delayed
o Dogs A and B have fleas
o Elephants depend on water to exist
2. Observe a pattern
o Another 20 flights from low-cost airlines are delayed
o All observed dogs have fleas
o All observed animals depend on water to exist
3. Develop a theory or general (preliminary) conclusion
o Low cost airlines always have delays
o All dogs have fleas
o All biological life depends on water to exist

Limitations of an inductive approach


A conclusion drawn on the basis of an inductive method can never be proven,
but it can be invalidated.

Example
You observe 1000 flights from low-cost airlines. All of them experience a
delay, which is in line with your theory. However, you can never prove that
flight 1001 will also be delayed. Still, the larger your dataset, the more reliable
the conclusion.

Deductive research approach


When conducting deductive research, you always start with a theory (the result
of inductive research). Reasoning deductively means testing these theories. If
there is no theory yet, you cannot conduct deductive research.

The deductive research approach consists of four stages:

1. Start with an existing theory (and create a problem statement)


o Low cost airlines always have delays
o All dogs have fleas
o All biological life depends on water to exist
2. Formulate a falsifiable hypothesis based on existing theory
o If passengers fly with a low cost airline, then they will always
experience delays
o All pet dogs in my apartment building have fleas
o All land mammals depend on water to exist
3. Collect data to test the hypothesis
o Collect flight data of low-cost airlines
o Test all dogs in the building for fleas
o Study all land mammal species to see if they depend on water
4. Analyze and test the data
o5 out of 100 flights of low-cost airlines are not delayed
o 10 out of 20 dogs didn’t have fleas
o All land mammal species depend on water
5. Decide whether you can reject the null hypothesis
o 5 out of 100 flights of low-cost airlines are not delayed = reject
hypothesis
o 10 out of 20 dogs didn’t have fleas = reject hypothesis
o All land mammal species depend on water = support hypothesis

Limitations of a deductive approach


The conclusions of deductive reasoning can only be true if all the premises set
in the inductive study are true and the terms are clear.

Example

 All dogs have fleas (premise)


 Benno is a dog (premise)
 Benno has fleas (conclusion)

Based on the premises we have, the conclusion must be true. However, if the
first premise turns out to be false, the conclusion that Benno has fleas cannot be
relied upon

Combining inductive and deductive research


Many scientists conducting a larger research project begin with an inductive
study (developing a theory). The inductive study is followed up with deductive
research to confirm or invalidate the conclusion.

In the examples above, the conclusion (theory) of the inductive study is also
used as a starting point for the deductive study

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine
Learning.

The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put the
new data point in the correct category in the future. This best decision boundary
is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:

Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a
cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will
be a straight line. And if there are 3 features, then hyperplane will be a 2-
dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example.


Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2. We want a classifier that can classify the pair(x1,
x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the
closest point of the lines from both the classes. These points are called support
vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but
for non-linear data, we cannot draw a single straight line. Consider the below
image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add
a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the
same dataset user_data, which we have used in Logistic regression and KNN
classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the
code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10.#Extracting Independent and dependent Variable
11.x= data_set.iloc[:, [2,3]].values
12.y= data_set.iloc[:, 4].values
13.
14.# Splitting the dataset into training and test set.
15.from sklearn.model_selection import train_test_split
16.x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_sta
te=0)
17.#feature Scaling
18.from sklearn.preprocessing import StandardScaler
19.st_x= StandardScaler()
20.x_train= st_x.fit_transform(x_train)
21.x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give
the dataset as:
The scaled output for the test set will be:
Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM
classifier, we will import SVC class from Sklearn.svm library. Below is the
code for it:

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM
for linearly separable data. However, we can change it for non-linear data. And
then we fitted the classifier to the training dataset(x_train, y_train)

Output:

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value


of C(Regularization factor), gamma, and kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new
vector y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result


of y_pred and y_test to check the difference between the actual value and
predicted value.

Output: Below is the output for the prediction of the test set:
o Creating the confusion matrix:
Now we will see the performance of the SVM classifier that how many
incorrect predictions are there as compared to the Logistic regression
classifier. To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two
parameters, mainly y_true( the actual values) and y_pred (the targeted
value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:
As we can see in the above output image, there are 66+24= 90 correct
predictions and 8+2= 10 correct predictions. Therefore we can say that our
SVM model improved as compared to the Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].
max() + 1, step =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)
)
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).resh
ape(x1.shape),
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12.mtp.title('SVM classifier (Training set)')
13.mtp.xlabel('Age')
14.mtp.ylabel('Estimated Salary')
15.mtp.legend()
16.mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression
output. In the output, we got the straight line as hyperplane because we
have used a linear kernel in the classifier. And we have also discussed above
that for the 2d space, the hyperplane in SVM is a straight line.

o Visualizing the test set result:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].
max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)
)
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).resh
ape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13.mtp.title('SVM classifier (Test set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the
users into two regions (Purchased or Not purchased). Users who purchased the
SUV are in the red region with the red scatter points. And users who did not
purchase the SUV are in the green region with green scatter points. The
hyperplane has divided the two classes into Purchased and not purchased
variable.

ML | Case Based Reasoning (CBR) Classifier


 Difficulty Level : Medium
 Last Updated : 26 Mar, 2020
As we know Nearest Neighbour classifiers stores training tuples as points in
Euclidean space. But Case-Based Reasoning classifiers (CBR) use a database
of problem solutions to solve new problems. It stores the tuples or cases for
problem-solving as complex symbolic descriptions.
How CBR works?
When a new case arrises to classify, a Case-based Reasoner(CBR) will first
check if an identical training case exists. If one is found, then the accompanying
solution to that case is returned. If no identical case is found, then the CBR will
search for training cases having components that are similar to those of the new
case. Conceptually, these training cases may be considered as neighbours of the
new case. If cases are represented as graphs, this involves searching for
subgraphs that are similar to subgraphs within the new case. The CBR tries to
combine the solutions of the neighbouring training cases to propose a solution
for the new case. If compatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary. The CBR may
employ background knowledge and problem-solving strategies to propose a
feasible solution.
Applications of CBR includes:
1. Problem resolution for customer service help desks, where cases describe
product-related diagnostic problems.
2. It is also applied to areas such as engineering and law, where cases are either
technical designs or legal rulings, respectively.
3. Medical educations, where patient case histories and treatments are used to
help diagnose and treat new patients.
Challenges with CBR
 Finding a good similarity metric (eg for matching subgraphs) and suitable
methods for combining solutions.
 Selecting salient features for indexing training cases and the development of
efficient indexing techniques.
CBR becomes more intelligent as the number of the trade-off between accuracy
and efficiency evolves as the number of stored cases becomes very large. But
after a c4517ertain point, the system’s efficiency will suffer as the time required
to search for and pr
|oce//A1 `q1298ss relevant cases increases.
``7

Neural networks are parallel computing devices, which is basically an attempt


to make a computer model of the brain. The main objective is to develop a
system to perform various computational tasks faster than the traditional
systems. These tasks include pattern recognition and classification,
approximation, optimization, and data clustering.

What is Artificial Neural Network?


Artificial Neural Network ANNANN is an efficient computing system whose
central theme is borrowed from the analogy of biological neural networks.
ANNs are also named as “artificial neural systems,” or “parallel distributed
processing systems,” or “connectionist systems.” ANN acquires a large
collection of units that are interconnected in some pattern to allow
communication between the units. These units, also referred to as nodes or
neurons, are simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each
connection link is associated with a weight that has information about the input
signal. This is the most useful information for neurons to solve a particular
problem because the weight usually excites or inhibits the signal that is being
communicated. Each neuron has an internal state, which is called an activation
signal. Output signals, which are produced after combining the input signals and
activation rule, may be sent to other units.

A Brief History of ANN

The history of ANN can be divided into the following three eras −
ANN during 1940s to 1960s
Some key developments of this era are as follows −
 1943 − It has been assumed that the concept of neural network started
with the work of physiologist, Warren McCulloch, and mathematician,
Walter Pitts, when in 1943 they modeled a simple neural network using
electrical circuits in order to describe how neurons in the brain might
work.
 1949 − Donald Hebb’s book, The Organization of Behavior, put forth the
fact that repeated activation of one neuron by another increases its
strength each time they are used.
 1956 − An associative memory network was introduced by Taylor.
 1958 − A learning method for McCulloch and Pitts neuron model named
Perceptron was invented by Rosenblatt.
 1960 − Bernard Widrow and Marcian Hoff developed models called
"ADALINE" and “MADALINE.”
ANN during 1960s to 1980s
Some key developments of this era are as follows −
 1961 − Rosenblatt made an unsuccessful attempt but proposed the
“backpropagation” scheme for multilayer networks.
 1964 − Taylor constructed a winner-take-all circuit with inhibitions
among output units.
 1969 − Multilayer perceptron MLPMLP was invented by Minsky and
Papert.
 1971 − Kohonen developed Associative memories.
 1976 − Stephen Grossberg and Gail Carpenter developed Adaptive
resonance theory.
ANN from 1980s till Present
Some key developments of this era are as follows −
 1982 − The major development was Hopfield’s Energy approach.
 1985 − Boltzmann machine was developed by Ackley, Hinton, and
Sejnowski.
 1986 − Rumelhart, Hinton, and Williams introduced Generalised Delta
Rule.
 1988 − Kosko developed Binary Associative Memory BAMBAM and
also gave the concept of Fuzzy Logic in ANN.
The historical review shows that significant progress has been made in this
field. Neural network based chips are emerging and applications to complex
problems are being developed. Surely, today is a period of transition for neural
network technology.

Biological Neuron

A nerve cell neuronneuron is a special biological cell that processes


information. According to an estimation, there are huge number of neurons,
approximately 1011 with numerous interconnections, approximately 1015.
Schematic Diagram

Working of a Biological Neuron


As shown in the above diagram, a typical neuron consists of the following four
parts with the help of which we can explain its working −
 Dendrites − They are tree-like branches, responsible for receiving the
information from other neurons it is connected to. In other sense, we can
say that they are like the ears of neuron.
 Soma − It is the cell body of the neuron and is responsible for processing
of information, they have received from dendrites.
 Axon − It is just like a cable through which neurons send the information.
 Synapses − It is the connection between the axon and other neuron
dendrites.
ANN versus BNN
Before taking a look at the differences between Artificial Neural
Network ANNANN and Biological Neural Network BNNBNN, let us take a
look at the similarities based on the terminology between these two.
Biological Neural Network BNNBNN Artificial Neural Network ANNANN

Soma Node

Dendrites Input

Synapse Weights or Interconnections

Axon Output

The following table shows the comparison between ANN and BNN based on
some criteria mentioned.

Crite BNN ANN


ria

Proc Massively Massively parallel, fast but inferior than BNN


essin parallel,
g slow but
superior
than ANN

Size 1011 neuron 102 to


s and 104 nodes mainlydependsonthetypeofapplicationandnetworkdesignerm
1015 interco lydependsonthetypeofapplicationandnetworkdesigner
nnections

Lear They can Very precise, structured and formatted data is required to tolerate
ning tolerate ambiguity
ambiguity

Fault Performan It is capable of robust performance, hence has the potential to be fault
toler ce tolerant
ance degrades
with even
partial
damage

Stor Stores the Stores the information in continuous memory locations


age informatio
capa n in the
city synapse

Model of Artificial Neural Network

The following diagram represents the general model of ANN followed by its
processing.

For the above general model of artificial neural network, the net input can be
calculated as follows −
yin=x1.w1+x2.w2+x3.w3…xm.wmyin=x1.w1+x2.w2+x3.w3…xm.wm
i.e., Net input yin=∑mixi.wiyin=∑imxi.wi
The output can be calculated by applying the activation function over the net
input.
Y=F(yin)Y=F(yin)
Output = function netinputcalculated

Processing of ANN depends upon the following three building blocks −


 Network Topology
 Adjustments of Weights or Learning
 Activation Functions
In this chapter, we will discuss in detail about these three building blocks of
ANN

Network Topology

A network topology is the arrangement of a network along with its nodes and
connecting lines. According to the topology, ANN can be classified as the
following kinds −
Feedforward Network
It is a non-recurrent network having processing units/nodes in layers and all the
nodes in a layer are connected with the nodes of the previous layers. The
connection has different weights upon them. There is no feedback loop means
the signal can only flow in one direction, from input to output. It may be
divided into the following two types −
 Single layer feedforward network − The concept is of feedforward
ANN having only one weighted layer. In other words, we can say the
input layer is fully connected to the output layer.

 Multilayer feedforward network − The concept is of feedforward ANN


having more than one weighted layer. As this network has one or more
layers between the input and the output layer, it is called hidden layers.
Feedback Network
As the name suggests, a feedback network has feedback paths, which means the
signal can flow in both directions using loops. This makes it a non-linear
dynamic system, which changes continuously until it reaches a state of
equilibrium. It may be divided into the following types −
 Recurrent networks − They are feedback networks with closed loops.
Following are the two types of recurrent networks.
 Fully recurrent network − It is the simplest neural network architecture
because all nodes are connected to all other nodes and each node works as
both input and output.

 Jordan network − It is a closed loop network in which the output will go


to the input again as feedback as shown in the following diagram.
Adjustments of Weights or Learning

Learning, in artificial neural network, is the method of modifying the weights of


connections between the neurons of a specified network. Learning in ANN can
be classified into three categories namely supervised learning, unsupervised
learning, and reinforcement learning.
Supervised Learning
As the name suggests, this type of learning is done under the supervision of a
teacher. This learning process is dependent.
During the training of ANN under supervised learning, the input vector is
presented to the network, which will give an output vector. This output vector is
compared with the desired output vector. An error signal is generated, if there is
a difference between the actual output and the desired output vector. On the
basis of this error signal, the weights are adjusted until the actual output is
matched with the desired output.

Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a
teacher. This learning process is independent.
During the training of ANN under unsupervised learning, the input vectors of
similar type are combined to form clusters. When a new input pattern is applied,
then the neural network gives an output response indicating the class to which
the input pattern belongs.
There is no feedback from the environment as to what should be the desired
output and if it is correct or incorrect. Hence, in this type of learning, the
network itself must discover the patterns and features from the input data, and
the relation for the input data over the output.

Reinforcement Learning
As the name suggests, this type of learning is used to reinforce or strengthen the
network over some critic information. This learning process is similar to
supervised learning, however we might have very less information.
During the training of network under reinforcement learning, the network
receives some feedback from the environment. This makes it somewhat similar
to supervised learning. However, the feedback obtained here is evaluative not
instructive, which means there is no teacher as in supervised learning. After
receiving the feedback, the network performs adjustments of the weights to get
better critic information in future.

Activation Functions

It may be defined as the extra force or effort applied over the input to obtain an
exact output. In ANN, we can also apply activation functions over the input to
get the exact output. Followings are some activation functions of interest −
Linear Activation Function
It is also called the identity function as it performs no input editing. It can be
defined as −
F(x)=xF(x)=x
Sigmoid Activation Function
It is of two type as follows −
 Binary sigmoidal function − This activation function performs input
editing between 0 and 1. It is positive in nature. It is always bounded,
which means its output cannot be less than 0 and more than 1. It is also
strictly increasing in nature, which means more the input higher would be
the output. It can be defined as
F(x)=sigm(x)=11+exp(−x)F(x)=sigm(x)=11+exp(−x)
 Bipolar sigmoidal function − This activation function performs input
editing between -1 and 1. It can be positive or negative in nature. It is
always bounded, which means its output cannot be less than -1 and more
than 1. It is also strictly increasing in nature like sigmoid function. It can
be defined as
F(x)=sigm(x)=21+exp(−x)−1=1−exp(x)1+exp(x)

You might also like