0% found this document useful (0 votes)
27 views87 pages

DM Unit Iii

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views87 pages

DM Unit Iii

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

UNIT III

CLASSIFICATION AND PREDICTION TECHNIQUES

Classification by Decision Tree – Bayesian Classification – Rule Based Classification – Bayesian Belief
Networks – Classification by Back propagation – Support Vector Machines – K-Nearest Neighbor
Algorithm – Linear Regression, Nonlinear Regression

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows –

Classification
Classification of prediction is the process of finding a model that describes the classes or
concepts of information.
The purpose is to predict the class of objects whose class label is unknown using this model.
Below are some major differences between classification and prediction.

Classification models predict categorical


Discrete (The discrete data contain the values that fall under integers or whole numbers. The
total number of students in a class is an example of discrete data.
These data can’t be broken into decimal or fraction values.),

Prediction
Unordered class labels (i.e. Categorical variables represent types of data which may be divided
into groups. Examples of categorical variables are race, sex, age group, and educational
level.)

What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of
data is used as training data. The set of input data and the corresponding outputs are given to
the algorithm.

So, the training data set includes the input data and their associated class labels. Using the
training dataset, the algorithm derives a model or the classifier. The derived model can be a
decision tree, mathematical formula, or a neural network. In classification, when unlabeled
data is given to the model, it should find the class to which it belongs. The new data provided
to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is


to check whether it is raining or not. The answer can either be yes or no. So, there is a particular
number of choices.

Sometimes there can be more than two classes to classify. That is called multi class
classification. The bank needs to analyze whether giving a loan to a particular customer is
risky or not. For example, based on observable data for multiple loan borrowers, a
classification model may be established that forecasts credit risk.
The data could track job records, home ownership or leasing, years of residency, number,
type of deposits, historical credit ranking, etc.

The goal would be credit ranking, the predictors would be the other characteristics, and the
data would represent a case for each consumer. In this example, a model is constructed to find
the categorical label. The labels are risky or safe.
CLASSIFICATION BY DECISION TREE:

Introduction

Decision Trees are a type of Supervised Machine Learning (that is you explain what the
input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two
entities, namely decision nodes and leaves.

This algorithm can be used for regression and classification problems — yet, is mostly
used for classification problems. A decision tree follows a set of if-else conditions to
visualize the data and Classify it according to the conditions. For example,

Important terminology

Root Node: This attribute is used for dividing the data into two or more sets. The feature
attribute in this node is selected based on Attribute Selection Techniques.
1. Branch or Sub-Tree: A part of the entire decision tree is called a
branch or sub-tree.
2. Splitting: Dividing a node into two or more sub-nodes based on if-
else conditions.
3. Decision Node: After splitting the sub-nodes into further sub-nodes,
then it is called the decision node.
4. Leaf or Terminal Node: This is the end of the decision tree where it
cannot be split in to further sub-nodes.
5. Pruning: Removing a sub-node from the tree is called pruning.

The leaves are the decisions or the final outcomes. And the decision nodes
Are where the data is split.

An example of a decision tree can be explained using above binary tree. Let’s say you want to predict whether
a person is fit given their information like age, eating habit, and physical activity, etc. The decision nodes here
are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which
are outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary classification problem (a yes no type
problem). There are two main types of Decision Trees

1. Classification trees(Yes/No types)

What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or
‘unfit’. Here the decision variable is Categorical.

2. Regression trees (Continuous data types)


Here the decision or the outcome variable is Continuous, e.g. a number like 123.Working Now that we know
what a Decision Tree is, we’ll see how it works internally. There are many algorithms out there which construct
Decision Trees, but one of the best is called as ID3Algorithm. ID3 Stands for Iterative Dichotomiser.

Before discussing the ID3 algorithm, we’ll go through few definitions.

Entropy:

Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure
of the amount of uncertainty or random ness in data.
Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss whose
probability of heads is 0.5 and probability of tails is 0.5.

Here the entropy is the highest possible, since there’s no way of determining what the outcome might
be.

Alternatively, consider a coin which has heads on both the sides, the entropy of such an event can be
predicted perfectly since we know beforehand that it’ll always be heads.

In other words, this event has no randomness hence it’s entropy is zero. In particular, lower values
imply less uncertainty while higher values imply high un certainty.

Information Gain:

Information gain is also called as Kullback - Leibler divergence denoted by IG(S,A) for
a set S is the effective change in entropy after deciding on a particular attribute A.

It measures there lative change in entropy with respect to the independent


variables.

Alternatively,

where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the
entire set, while the second term calculates the Entropy after applying the feature A, where
P(x) is the probability of event x.

Let’s understand this with the help of an example. Consider a piece of data collected over
the course of 14days where the features are Outlook, Temperature, Humidity, Wind and
the outcome variable is whether Golf was played on the day.

Now, our job is to build a predictive model which takes in above 4 parameters and predicts
whether Golf will be played on the day. We’ll build a decision tree to do that using
ID3algorithm.
Day Outlook Temperature Humidity Wind Play Golf
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

ID3Algorithmwillperformfollowingtasksrecursively

1. Create root node for the tree


2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’
Denoted by H(S, x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all
leaf nodes.

Constructing a Decision Tree


Let us take an example of the last 10days weather dataset with attributes outlook,
temperature, wind, and humidity. The outcome variable will be playing cricket or not. We
will usetheID3 algorithm to build the decision tree.

Day Outlook Temperature Humidity Wind Play cricket

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes


5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Step1: The first step will be to create a root node.

Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node
“no” will be returned.
Step3: Find out the Entropy of all observations and entropy with attribute “x”
that is E(S)and E(S, x).

Step4: Find out the information gain and select the attribute with high information gain.

Step5: Repeat the above steps until all attributes are covered.

Calculation of Entropy:

Yes No

9 5
If entropy is zero, it means that all members belong to the same class and if entropy is
one then it means that half of the tuples belong to one class and one of them belong to
other class.0.94 means fair distribution.

Find the information gain attribute which gives maximum information gain.

For Example “Wind”, it takes two values: Strong and Weak, therefore, x=
{Strong, Weak}.

Find out H(x), P(x) for x=weak and x=strong. H(S) is already calculated above.

Weak=8
Strong=8

For “weak” wind, 6 of them say “Yes” to play cricket and 2 of them say“ No”. So
entropy will be:
For “strong” wind, 3said “No” to play cricket and 3 said “Yes”.

This shows perfect randomness as half items belong to one class and the remaining half
be long to others.

Calculate the information gain,

Similarly the information gain for other attributes is:

The attribute outlook has the highest information gain of 0.246, thus it is
chosen as root.

Overcast has 3 values: Sunny, Overcast and Rain. Overcast with play cricket is always
“Yes”. So it ends up with a leaf node, “yes”. For the other values “Sunny” and “Rain”.

Table for Outlook as “Sunny” will be:

Temperature Humidity Wind Golf

Hot High Weak No

Hot High Strong No

Mild High Weak No

Cool Normal Weak Yes

Mild Normal Strong Yes


Entropy for “Outlook” “Sunny” is:

Information gain for attributes with respect to Sunny is:

The information gain for humidity is highest, therefore it is chosen as the next node.
Similarly, Entropy is calculated for Rain. Wind gives the highest information gain. The
decision tree would look like below:

Advantages
Decision trees are easy to visualize.

Non-linear patterns in the data can be captures easily.

It can be used for predicting missing values, suitable for feature engineering techniques.

Disadvantages

Over-fitting of the data is possible.


The small variation in the input data can result in a different decision tree. This canbe reduced by

using feature engineering techniques.

We have to balance the data-set before training the model.


Classification and Regression Trees or CART for short is a term introduced by Leo
Breiman to refer to Decision Tree algorithms that can be used for classification or
regression predictive modeling problems.

Classically, this algorithm is referred to as “decision trees”, but on some platforms like
R they are referred to by the more modern term CART.

The CART algorithm provides a foundation for important algorithms like bagged
decision trees, random forest and boosted decision trees.

CART Model Representation.

Refer: A Step by Step CART Decision Tree Example - Sefik Ilkin Serengil (sefiks.com)
https://www.codingninjas.com/studio/library/classification-and-regression-tree-
algorithm

The representation for the CART model is a binary tree.

This is your binary tree from algorithms and data structures, nothing too fancy. Each root
node represents a single input variable (x) and a split point on that variable (assumingthe
variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.

Given a dataset with two inputs (x) of height in centimeters and weight in kilograms the
output of sex as male or female, below is a crude example of a binary decision tree
(completely fictitious for demonstration purposes only).

CART can handle both classification and regression tasks. This algorithm uses a new metric named gini
index to create decision points for classification tasks.

A CART tree is a binary decision tree that is constructed by splitting a node into two child nodes repeatedly,
beginning with the root node that contains the whole learning sample.

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item
(represented in the branches) to conclusions about the item's target value (represented in the leaves).

It is one of the predictive modeling approaches used in statistics, data mining and machine learning.
Tree models where the target variable can take a discrete set of values are called classification trees; in
these tree structures, leaves represent class labels

Outlook

Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions
for outlook feature.

Outlook Yes No Number of


instances

Sunny 2 3 5

Overcast 4 0 4

Rain 3 2 5

Gini(Outlook=Sunny)=1–(2/5)2–(3/5)2=1–0.16–0.36=0.48)=1–(2/5)2–(3/5)2=1–0.16–0.36=0.48
Gini(Outlook=Overcast)=1–(4/4)2–(0/4)2=0)=1–(4/4)2–(0/4)2=0 Gini(Outlook=Rain)=1–(3/5)2–
(2/5)2=1–0.36–0.16=0.48)=1–(3/5)2–(2/5)2=1–0.36–0.16=0.48
Then, we will calculate weighted sum of gini indexes for outlook feature.
Gini(Outlook)=(5/14)×0.48+(4/14)×0+(5/14)×0.48=0.171+0+0.171=0.342
=(5/14)×0.48+(4/14)×0+(5/14)×0.48=0.171+0+0.171=0.342
 Temperature

Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild.
Let’s summarize decisions for temperature feature.

Temperature Yes No Number


of
instances

Hot 2 2 4

Cool 3 1 4

Mild 4 2 6

Gini(Temp=Hot)=1–(2/4)2–(2/4)2=0.5)=1–(2/4)2–(2/4)2=0.5
Gini(Temp=Cool)=1–(3/4)2–(1/4)2=1–0.5625–0.0625=0.375)=1–(3/4)2–(1/4)2=1–0.5625–
0.0625=0.375
Gini(Temp=Mild)=1–(4/6)2–(2/6)2=1–0.444–0.111=0.445)=1–(4/6)2–(2/6)2=1–0.444–0.111=0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp)=(4/14)×0.5+(4/14)×0.375+(6/14)×0.445=0.142+0.107+0.190=0.439=(4/14)×0.5+(4/14)
×0.375+(6/14)×0.445=0.142+0.107+0.190=0.439

 Humidity

Humidity is a binary class feature. It can be high or normal.

Humidity Yes No Number of


instances

High 3 4 7

Normal 6 1 7

Gini(Humidity=High)=1–(3/7)2–(4/7)2=1–0.183–0.326=0.489ℎ)=1–(3/7)2–(4/7)2=1–0.183–
0.326=0.489
Gini(Humidity=Normal)=1–(6/7)2–(1/7)2=1–0.734–0.02=0.244)=1–(6/7)2–(1/7)2=1–0.734–
0.02=0.244
Weighted sum for humidity feature will be calculated next
Gini(Humidity)=(7/14)×0.489+(7/14)×0.244=0.367)=(7/14)×0.489+(7/14)×0.244=0.367
 Wind

Wind is a binary class similar to humidity. It can be weak and strong.

Wind Yes No Number of instances

Weak 6 2 8

Strong 3 3 6

Gini(Wind=Weak)=1–(6/8)2–(2/8)2=1–0.5625–0.062=0.375)=1–(6/8)2–(2/8)2=1–0.5625–
0.062=0.375
Gini(Wind=Strong)=1–(3/6)2–(3/6)2=1–0.25–0.25=0.5)=1–(3/6)2–(3/6)2=1–0.25–0.25=0.5
Gini(Wind)=(8/14)×0.375+(6/14)×0.5=0.428)=(8/14)×0.375+(6/14)×0.5=0.428
We’ve calculated gini index values for each feature. The winner will be outlook feature because its
cost is the lowest.

Feature Gini Index

Outlook 0.342

Temperature 0.439

Humidity 0.367

Wind 0.428

We’ll put outlook decision at the top of the tree.


First decision would be outlook feature
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that
overcast leaf is over.

Tree is over for overcast outlook leaf


We will apply same principles to those sub datasets in the following steps.
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature,
humidity and wind features respectively.

Day Outlook Temp. Humidity Wind Decision

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

11 Sunny Mild Normal Strong Yes

 Gini of temperature for sunny outlook

Temperature Yes No Number of instances

Hot 0 2 2

Cool 1 0 1

Mild 1 1 2

Gini(Outlook=Sunny and Temp.=Hot)=1–(0/2)2–(2/2)2=0Gini(Outlook=Sunny and Temp.=Hot)=1–


(0/2)2–(2/2)2=0
Gini(Outlook=Sunny and Temp.=Cool)=1–(1/1)2–(0/1)2=0Gini(Outlook=Sunny and
Temp.=Cool)=1–(1/1)2–(0/1)2=0
Gini(Outlook=Sunny and Temp.=Mild)=1–(1/2)2–(1/2)2=1–0.25–0.25=0.5Gini(Outlook=Sunny and
Temp.=Mild)=1–(1/2)2–(1/2)2=1–0.25–0.25=0.5
Gini(Outlook=Sunny and Temp.)=(2/5)×0+(1/5)×0+(2/5)×0.5=0.2Gini(Outlook=Sunny and
Temp.)=(2/5)×0+(1/5)×0+(2/5)×0.5=0.2

 Gini of humidity for sunny outlook

Humidity Yes No Number of instances

High 0 3 3

Normal 2 0 2

Gini(Outlook=Sunny and Humidity=High)=1–(0/3)2–(3/3)2=0Gini(Outlook=Sunny and


Humidity=High)=1–(0/3)2–(3/3)2=0
Gini(Outlook=Sunny and Humidity=Normal)=1–(2/2)2–(0/2)2=0Gini(Outlook=Sunny and
Humidity=Normal)=1–(2/2)2–(0/2)2=0
Gini(Outlook=Sunny and Humidity)=(3/5)×0+(2/5)×0=0Gini(Outlook=Sunny and
Humidity)=(3/5)×0+(2/5)×0=0

 Gini of wind for sunny outlook

Wind Yes No Number of instances

Weak 1 2 3

Strong 1 1 2

Gini(Outlook=Sunny and Wind=Weak)=1–(1/3)2–(2/3)2=0.266Gini(Outlook=Sunny and


Wind=Weak)=1–(1/3)2–(2/3)2=0.266
Gini(Outlook=Sunny and Wind=Strong)=1−(1/2)2–(1/2)2=0.2Gini(Outlook=Sunny and
Wind=Strong)=1−(1/2)2–(1/2)2=0.2
Gini(Outlook=Sunny and Wind)=(3/5)×0.266+(2/5)×0.2=0.466Gini(Outlook=Sunny and
Wind)=(3/5)×0.266+(2/5)×0.2=0.466

 Decision for sunny outlook

We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity
because it has the lowest value.
Feature Gini index

Temperature 0.2

Humidity 0

Wind 0.466

We’ll put humidity check at the extension of sunny outlook.

Sub datasets for high and normal humidity


As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision
will always be yes for normal humidity and sunny outlook. This branch is over.
Decisions for high and normal humidity
Now, we need to focus on rain outlook.
Rain outlook

Day Outlook Temp. Humidity Wind Decision

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

10 Rain Mild Normal Weak Yes

14 Rain Mild High Strong No

We’ll calculate gini index scores for temperature, humidity and wind features when outlook is rain.

 Gini of temperature for rain outlook

Temperature Yes No Number of instances

Cool 1 1 2

Mild 2 1 3

Gini(Outlook=Rain and Temp.=Cool)=1–(1/2)2–(1/2)2=0.5Gini(Outlook=Rain and Temp.=Cool)=1–


(1/2)2–(1/2)2=0.5
Gini(Outlook=Rain and Temp.=Mild)=1–(2/3)2–(1/3)2=0.444Gini(Outlook=Rain and
Temp.=Mild)=1–(2/3)2–(1/3)2=0.444
Gini(Outlook=Rain and Temp.)=(2/5)×0.5+(3/5)×0.444=0.466Gini(Outlook=Rain and
Temp.)=(2/5)×0.5+(3/5)×0.444=0.466

 Gini of humidity for rain outlook

Humidity Yes No Number of instances

High 1 1 2

Normal 2 1 3

Gini(Outlook=Rain and Humidity=High)=1–(1/2)2–(1/2)2=0.5Gini(Outlook=Rain and


Humidity=High)=1–(1/2)2–(1/2)2=0.5
Gini(Outlook=Rain and Humidity=Normal)=1–(2/3)2–(1/3)2=0.444Gini(Outlook=Rain and
Humidity=Normal)=1–(2/3)2–(1/3)2=0.444
Gini(Outlook=Rain and Humidity)=(2/5)×0.5+(3/5)×0.444=0.466Gini(Outlook=Rain and
Humidity)=(2/5)×0.5+(3/5)×0.444=0.466

 Gini of wind for rain outlook

Wind Yes No Number of instances

Weak 3 0 3

Strong 0 2 2

Gini(Outlook=Rain and Wind=Weak)=1–(3/3)2–(0/3)2=0Gini(Outlook=Rain and Wind=Weak)=1–


(3/3)2–(0/3)2=0
Gini(Outlook=Rain and Wind=Strong)=1–(0/2)2–(2/2)2=0Gini(Outlook=Rain and Wind=Strong)=1–
(0/2)2–(2/2)2=0
Gini(Outlook=Rain and Wind)=(3/5)×0+(2/5)×0=0Gini(Outlook=Rain and
Wind)=(3/5)×0+(2/5)×0=0

 Decision for rain outlook

The winner is wind feature for rain outlook because it has the minimum gini index score in
features.

Feature Gini index

Temperature 0.466

Humidity 0.466

Wind 0

Put the wind feature for rain outlook branch and monitor the new sub data sets.
Sub data sets for weak and strong wind and rain outlook
As seen, decision is always yes when wind is weak. On the other hand, decision is always no if
wind is strong. This means that this branch is over.

Final form of the decision tree built by CART algorithm


branches represent conjunctions of features that lead to those class labels. Decision trees where
the target variable can take continuous values (typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly represent decisions and
decision making. In data mining, a decision tree describes data (but the resulting classification tree
can be an input for decision making).

Bayesian classification
Bayesian classification is a probabilistic approach used in data mining and machine learning for classifying
data into different categories or classes based on the principles of Bayesian probability. It is particularly useful
for tasks like text classification, spam email detection, and medical diagnosis.
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers.
Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple
belongs to a particular class.
Marginal probability
The marginal probability is the probability of a single event occurring, independent of other events.
Example:
Marginal probability refers to the probability of a specific event occurring without considering any other
related events. In the case of rolling a fair six-sided die, each outcome (1, 2, 3, 4, 5, or 6) is equally likely, so
the probability of getting a 6 on a single roll of the die is:
P(6) = 1/6
This is because there is only one favourable outcome (rolling a 6), and there are a total of six possible outcomes
(the numbers 1 through 6 on the die).
So, the marginal probability of getting a 6 when throwing a fair six-sided die is 1/6 or approximately 0.1667
(16.67%).
A conditional probability, on the other hand, is the probability that an event occurs given that another
specific event has already occurred.
Conditional probability
It is the probability of one event occurring given that another event has already occurred. Whether
events are independent or dependent affects how you calculate conditional probabilities.
Independent Events:
When two events are independent, the occurrence of one event does not affect the probability of the
other event happening.
Example 1: Independent Events
Consider flipping a fair coin twice. The outcome of the first flip (Heads or Tails) does not affect the outcome
of the second flip. The conditional probability of getting a Head on the second flip given that you got a Tail
on the first flip is:
P(Second flip is Head | First flip is Tail) = P(Second flip is Head) = 0.5
This is because the probability of getting a Head on the second flip remains 0.5, regardless of what happened
on the first flip.
Dependent Events:
When two events are dependent, the occurrence of one event does affect the probability of the other
event happening.
Example 2: Dependent Events
Consider drawing two cards without replacement from a standard deck of 52 cards. The probability of the
second card being a certain suit depends on what the first card was.
If you draw a Heart on the first card, there are now 51 cards left in the deck, with 12 Hearts remaining. So,
the conditional probability of the second card being a Heart given that the first card was a Heart is:
P(Second card is Heart | First card is Heart) = 12/51
However, if you draw a Spade on the first card, there are still 51 cards left in the deck, but only 11 Spades
remaining. So, the conditional probability of the second card being a Heart given that the first card was a
Spade is:

P(Second card is Heart | First card is Spade) = 11/51


In this example, the probability of the second card being a Heart depends on what happened with the
first card, making the events dependent.
Joint probability:
Joint probability is the probability of two or more events occurring together. The nature of
independence or dependence between events also affects how you calculate joint probabilities. Here are
examples of joint probabilities for both independent and dependent events:
Independent Events:
Example 1: Independent Events
Suppose you are rolling two fair six-sided dice. The probability of getting a specific outcome on each die is
independent of the other. Let's find the joint probability of rolling a 3 on the first die and a 4 on the second
die:
P(First die = 3 and Second die = 4) = P(First die = 3) * P(Second die = 4) = (1/6) * (1/6) = 1/36
In this case, the events (rolling a 3 on the first die and rolling a 4 on the second die) are independent, so we
can simply multiply the individual probabilities to find the joint probability.
Dependent Events:
Example 2: Dependent Events
Consider drawing two cards without replacement from a standard deck of 52 cards, just like in the previous
example. The probability of the second card being a certain suit depends on the outcome of the first card
drawn.
Let's find the joint probability of drawing a Heart on the first card and then drawing another Heart on the
second card:
P(First card is Heart and Second card is Heart) = P(First card is Heart) * P(Second card is Heart |
First card is Heart)
P(First card is Heart) = 13/52 (since there are 13 Hearts in a deck of 52 cards)
Now, if the first card is a Heart, there are 51 cards left in the deck, including 12 Hearts. So, the conditional
probability of drawing a Heart on the second card given that the first card is a Heart is:
P(Second card is Heart | First card is Heart) = 12/51
Now, we can calculate the joint probability:
P(First card is Heart and Second card is Heart) = (13/52) * (12/51) = 1/17
In this case, the events (drawing Hearts on both cards) are dependent, so we calculate the joint probability by
multiplying the probability of the first event by the conditional probability of the second event given the first
event has occurred.
Formula for all the types of probability:
Bayes’ Theorem
Bayes’ Theorem describes the probability of an event, based on precedent knowledge of conditions which
might be related to the event. In other words, Bayes’ Theorem is the add-on of Conditional Probability.
It plays a important role in probability and statistics and it helps to calculate probabilities.
It is based on the premise that the more information we gather about an event., the better estimate of the
probability we can make.
It will answer the question. “what is the probability of Y given that x occurred?”.
It is based on the premise that the more information we gather about an event, the better estimate of the
probability we can make. (i.e) each time you get a new piece of information. Your estimate for the probability
of event becomes more and more accurate.
It is a way to formalize this type of logic and put it into formulas.
Example: To find out the probability of occurrence fire in this forest.
Assume we don’t have much information about this forest. So, based on Historical records you guess and give
a rough estimate that p(fire) is very less.
Later we getting additional information like weather and other conditions i.e If you see rain then probability
of fire goes down and if you see smoke then probability of fire goes up. Therefore our probability estimate
updates/change when we get new information about related events.
Assumption:
What is the probability of observing FIRE given SMOKE was spotted?
Solution:
Event A= Fire
Event B= Smoke

P(A|B)= How often A occur given B ….P(Fire|Smoke)


P(B|A)= How often B occur given A….P(Smoke|Fire)
P(A)=How often A occurs on its own….P(Fire)
P(B)=How often B occurs on its own….P(Smoke)

Assume we found below value using historical data that was collected
Probability of spotting dangerous fire in that location is low 3% = P(Fire)

Output:

Bayes’s theorem is used for the calculation of a conditional probability where intuition often fails. Although
widely used in probability, the theorem is being applied in the machine learning field too. Its use in machine
learning includes the fitting of a model to a training dataset and developing classification models.
Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) has been called the most powerful rule of probability and
statistics. It describes the probability of an event, based on prior knowledge of conditions that might be related to
the event.
For example, if a disease is related to age, then, using Bayes’ theorem, a person’s age can be used to more accurately
assess the probability that they have the disease, compared to the assessment of the probability of disease made
without knowledge of the person’s age.

It is a powerful law of probability that brings in the concept of ‘subjectivity’ or ‘the degree of belief’ into the cold,
hard statistical modeling. Bayes’ rule is the only mechanism that can be used to gradually update the probability
of an event as the evidence or data is gathered sequentially.

Bayes’ Theorem is named after Thomas Bayes. He first makes use of conditional probability to provide an
algorithm which uses evidence to calculate limits on an unknown parameter. Bayes’ Theorem has two types
of probabilities :
1. Prior Probability [P(H)]
2. Posterior Probability [P(H/X)] Where,

1. Prior Probability (initial belief)

Prior Probability is the probability of occurring an event before the collection of new data. It is the best logical
evaluation of the probability of an outcome which is based on the present knowledge of the event before the
inspection is performed.

2. Posterior Probability (updated belief after getting additional information)

When new data or information is collected then the Prior Probability of an event will be revised to produce
a more accurate measure of a possible outcome. This revised probability becomes the Posterior Probability
and is calculated using Bayes’ theorem. So, the Posterior Probability is the probability of an event X occurring
given that event H has occurred.

Examples

1. Spam Assassin works as a mail filter to identify the spam in which users train the system. In emails,
it considers patterns in the words which are marked as spam by the users. For Example, it may have
learned that the word “release” is marked as spam in 30% of the emails. Concluding 0.8% of non-
spam mails which includes the word “release” and 40% of all emails which are received by the user
is spam. Find the probability that a mail is a spam if the word “release” seems in it.
Solution :

Given,
P(Release | Spam) = 0.30
P(Release | Non Spam) = 0.008
P(Spam) = 0.40
=> P(Non Spam) = 0.40

P(Spam | Release) = ?

Now, using Bayes’ Theorem:


P(Spam | Release) = P(Release | Spam) * P(Spam) / P(Release)
= 0.30 * 0.40 / (0.40 * 0.30 + 0.30 * 0.008)
= 0.980

Hence, the required probability is 0.980.

2. Bag1 contains 4 white and 8 black balls and Bag2 contains 5 white and 3 black balls. From one
of the bag one ball is drawn at random and the ball which is drawn comes out as black. Find the
probability that the ball is drawn from Bag1.
Solution:

Given,
Let E1, E2 and A be the three events where,
E1 = Event of selecting Bag1
E2 = Event of selecting Bag2
A = Event of drawing black ball

Now,
P(E1) = P(E2) = 1/2
P(drawing a black ball from Bag1) = P(A|E1) = 8/12 = 2/3
P(drawing a black ball from Bag2) = P(A|E2) = 3/8

By using Bayes' Theorem, the probability of drawing a black ball from Bag1,
P(E1|A) = P(A|E1) * P(E1) / P(A|E1) * P(E1) + P(A|E2) * P(E2)
[P(A|E1) * P(E1) + P(A|E2) * P(E2) = Total Probability]
= (2/3 * 1/2) / (2/3 * 1/2 + 3/8 * 1/2)
= 16/25
Hence, the probability that the ball is drawn from Bag1 is 16/25.
Example 3:
Bayes Classification. What is Bayes Theorem? How to Build… | by Ramazan Erduran | Dev Genius
Bayes Theorem Explained With Example - Complete Guide | upGrad blog
Naive Bayes Classifier

Categorical Values:

Categorical values, also known as discrete values, represent distinct categories or labels that do not have a natural
order or numerical meaning.

They are used to classify data into specific groups or classes.

Examples of categorical values include:

Colors (e.g., red, blue, green)

Types of animals (e.g., cat, dog, bird)

Marital status (e.g., single, married, divorced)

Customer ratings (e.g., low, medium, high)

Categorical data can be further divided into nominal and ordinal data:

Nominal data: Categories have no inherent order or ranking (e.g., colors).

Ordinal data: Categories have a meaningful order or ranking (e.g., customer ratings from low to high).

Continuous Values:

Continuous values, also known as numerical values, represent quantities that can take on any real value within a
certain range.

They are often measured and can have decimal or fractional values.

Continuous data is used for measurements that can be quantified.

Examples of continuous values include:

Height (e.g., 165.5 cm)

Temperature (e.g., 27.3°C)

Age (e.g., 35.5 years)

Weight (e.g., 68.2 kg)

NAIVE BAYES THEOREM

The "naive" part of the name comes from the assumption that Naive Bayes makes. It assumes that all features
(attributes) are conditionally independent given the class label. In other words, it treats each feature as if it is unrelated
to all other features when predicting the class label. This is a simplifying assumption to make the calculations tractable.

Depending on the type of features (categorical or continuous) and the specific variant of Naive Bayes (Gaussian,
Multinomial, Bernoulli), the way conditional probabilities are estimated may vary.
ALGORITHM:

STEP 1: Data Preparation:

Gather a labeled dataset where each data point has features and corresponding class labels (binary: Yes/No or 1/0).

STEP2: Estimate Prior Probabilities:

Calculate the prior probabilities of each class:

P(Class=Yes): Count of data points with class "Yes" divided by the total number of data points.

P(Class=No): Count of data points with class "No" divided by the total number of data points.

STEP 3: Estimate Conditional Probabilities:

For each feature, estimate the conditional probabilities for each class:

For categorical features:

Calculate the conditional probabilities

P(Feature=Value|Class=No) for each unique value of the feature.

Use Laplace smoothing or other techniques to handle cases where certain feature values have zero counts in one
class.

For continuous features:

Calculate the mean (μ) and standard deviation (σ) of the feature for each class.

Use these parameters to compute the conditional probabilities using the Gaussian distribution formula.

STEP 4:

Prediction Phase:

Data Preparation:

Obtain a new, unlabeled data point with features.

Calculate Class Posteriors:


For each class, calculate the class posterior probability using Bayes' theorem:

For each feature, use the conditional probabilities estimated during training.

Make a Prediction:

Compare the class posterior probabilities and choose the class with the highest probability as the predicted class for
the new data point.

For each feature, use the conditional probabilities estimated during training.

Compare the class posterior probabilities and choose the class with the highest probability as the predicted class
for the new data point.

EXAMPLE

Estimating Conditional Probabilities For Categorical Attributes

create a simple sample dataset for estimating conditional probabilities for categorical attributes using the Naive Bayes
theorem with numerical calculations. In this example, we'll consider a dataset of weather conditions and whether
people go for a picnic or not. The attributes are "Outlook,""Temperature," and "Humidity," and the target variable is
"Go for Picnic" (Yes/No).

Sample Dataset:
Given Observation:

Outlook: Sunny

Temperature: Cool

Humidity: High

Now, let's calculate the probability of "Go for Picnic" being "No" based on this observation:

Step 1: Calculate Prior Probability P(Go for Picnic=No)&P(Go for Picnic=Yes):

From the dataset, there are 3 instances of "No" out of 10 total instances.

P(Go for Picnic=No) = 4/10 = 0.4

From the dataset, there are 7 instances of "Yes" out of 10 total instances.

P(Go for Picnic=Yes) =6/10 = 0.6

Step 2: Calculate Conditional Probabilities for Each Attribute:

Calculate P(Outlook=Sunny|Go for Picnic=No):

There are 3 instances where Outlook is Sunny and Go for Picnic is No.

P(Outlook=Sunny|Go for Picnic=No) = 3/4 =0.75

Calculate P(Temperature=Cool|Go for Picnic=No):

There is 1 instance where Temperature is Cool and Go for Picnic is No.

P(Temperature=Cool|Go for Picnic=No) = 1/4 ≈ 0.25

Calculate P(Humidity=High|Go for Picnic=No):

There are 2 instances where Humidity is High and Go for Picnic is No.

P(Humidity=High|Go for Picnic=No) = 3/5 ≈ 0.6

Similiarly For Yes,

Calculate P(Outlook=Sunny|Go for Picnic=Yes):

There are 2 instances where Outlook is Sunny and Go for Picnic is Yes.

P(Outlook=Sunny|Go for Picnic=Yes) = 1/4 ≈ 0.25

Calculate P(Temperature=Cool|Go for Picnic=Yes):

There are 1 instance where Temperature is Cool and Go for Picnic is Yes.

P(Temperature=Cool|Go for Picnic=Yes) = 3/4 ≈ 0.75

Calculate P(Humidity=High|Go for Picnic=Yes):

There are 3 instances where Humidity is High and Go for Picnic is Yes.

P(Humidity=High|Go for Picnic=Yes) = 2/5 ≈ 0.4


Step 3: Calculate the Marginal Probabilities (P(Sunny), P(Cool), P(High)):

P(Outlook=Sunny) = 4/10 = 0.8

P(Temperature=Cool) = 4/10 = 0.8

P(Humidity=High) = 5/10 = 0.5

Step 4: Apply the Naive Bayes Theorem:

Now, we can apply the Naive Bayes theorem to calculate the probability of "Go for Picnic" being "No" based on
the given observation:

P(Go for Picnic=No|Outlook=Sunny, Temperature=Cool, Humidity=High) =

= (P(Outlook=Sunny|Go for Picnic=No) * P(Temperature=Cool|Go for Picnic=No) *(Humidity=High|Go for Picnic=No)


* P(Go for Picnic=No)) /(P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High))

= (1.0 * 0.3333 * 0.6667 * 0.3) / (0.5 * 0.3 * 0.4)

Now, we can apply the Naive Bayes theorem to calculate the probability of "Go for Picnic" being "Yes" based on
the given observation:

P(Go for Picnic=Yes|Outlook=Sunny, Temperature=Cool, Humidity=High) =

= (P(Outlook=Sunny|Go for Picnic=Yes) * P(Temperature=Cool|Go for Picnic=Yes) * P(Humidity=High|Go for


Picnic=Yes) * P(Go for Picnic=Yes)) /(P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High))

= (0.2857 * 0.1429 * 0.4286 * 0.7) / (0.5 * 0.3 * 0.4)

Step 5: Calculate the Final Probability:

Calculate the final probability by performing the above calculation:

P(Go for Picnic=No|Outlook=Sunny, Temperature=Cool, Humidity=High) ≈ 0.4167

So, based on the given observation, the probability of "Go for Picnic" being "No" is approximately 41.67%.

P(Go for Picnic=Yes|Outlook=Sunny, Temperature=Cool, Humidity=High) ≈ 0.1762

So, based on the given observation, the probability of "Go for Picnic" being "Yes" is approximately 17.62%.

Based on the calculations:

The probability of "Go for Picnic" being "No" is approximately 41.67%.

The probability of "Go for Picnic" being "Yes" is approximately 17.62%.

Since the probability of "Go for Picnic" being "No" is higher than the probability of "Go for Picnic" being "Yes," the
final classification for the given observation (Outlook=Sunny, Temperature=Cool, Humidity=High) is:

"Go for Picnic = No"


Estimate Conditional Probabilities For The Continuous Attribute

Sample Dataset:

Now, let's estimate conditional probabilities for the continuous attribute "Age" given the target variable "Has
Disease" (Yes/No) using Gaussian Naive Bayes.

Step1:

Estimating Conditional Probabilities for Age:

For each class ("Has Disease=Yes" and "Has Disease=No"), we'll calculate the mean (μ) and standard deviation (σ)
of the "Age" attribute.

Class "Has Disease=Yes":

Calculate mean (μ_yes) and standard deviation (σ_yes) of age for individuals with the disease.

Class "Has Disease=No":


Step 2: Calculate Conditional Probabilities Using Gaussian (Normal) Distribution:

Now that we have the mean (μ) and standard deviation (σ) for each class, we can calculate the conditional
probabilities for a given age value and each class using the Gaussian (normal) distribution formula:

the probability density function (PDF) of the Gaussian (normal) distribution. The Gaussian distribution PDF is the
foundation for calculating probabilities for continuous attributes in Gaussian Naive Bayes.

You can apply this formula for various age values and both classes to calculate conditional probabilities. This
approach is called Gaussian Naive Bayes and is commonly used for continuous attributes in Naive Bayes
classification.

P(Age|Has Disease=Yes) = (1 / (σ_yes * √(2π))) * exp(-((Age - μ_yes)² / (2σ_yes²)))

P(Age|Has Disease=No) = (1 / (σ_no * √(2π))) * exp(-((Age - μ_no)² / (2σ_no²)))

Let's calculate P(Age=40|Has Disease=Yes) and P(Age=40|Has Disease=No):


For "Has Disease=Yes":

P(Age=40|Has Disease=Yes) = (1 / (10.42 * √(2π))) * exp(-((40 - 43.83)² / (2 * 10.42²)))

P(Age=40|Has Disease=Yes) ≈ 0.0505

For "Has Disease=No":

P(Age=40|Has Disease=No) = (1 / (9.72 * √(2π))) * exp(-((40 - 34.5)² / (2 * 9.72²)))

P(Age=40|Has Disease=No) ≈ 0.0356

Step 3: Compare Conditional Probabilities:

Now that we have calculated both conditional probabilities for Age=40, we can compare them to predict the class
("Has Disease=Yes" or "Has Disease=No") based on the higher probability.

P(Has Disease=Yes) is the prior probability for "Has Disease=Yes" based on your dataset.

P(Has Disease=No) is the prior probability for "Has Disease=No" based on your dataset.

Let's assume:

P(Has Disease=Yes) = 0.6 (60% of the dataset has "Has Disease=Yes").

P(Has Disease=No) = 0.4 (40% of the dataset has "Has Disease=No").

Step 4: Calculate Final Probabilities:

Calculate the final probability of "Has Disease=Yes" given Age=40:


Step 5: Make a Prediction:

Since P(Has Disease=Yes|Age=40) is higher than P(Has Disease=No|Age=40), we predict that the person with Age=40
is likely to have the disease, i.e., "Has Disease=Yes."

So, based on the given age value of 40, the prediction is "Has Disease=Yes."

Applications of Naive Bayes Classifier

Text Classification: Naive Bayes classifiers mostly used in text classification - Automatically classify web pages, forum
posts, blog snippets and tweets without manually going through them.

Ranking Pages: We can use the Naive Bayes Classifier Algorithm for ranking pages, indexing relevancy scores and
classifying data categorically.

Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation
System that leverages machine learning and data mining to filter unseen information and predict whether a user would
like a given resource (item, movie) or not.

KNN ALGORITHM:

o The k-Nearest Neighbors(kNN) algorithm determines the target value of a new data point or instance
by comparing it with existing data points or instances that are closest to it. The target values of the k-
closest instances are aggregated and taken as the target output for the new data point.
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the new
case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or dog category.

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
Step #1 - Assign a value to K.
Step #2 - Calculate the distance between the new data entry and all other existing data entries (you'll learn
how to do this shortly). Arrange them in ascending order.
Step #3 - Find the K nearest neighbors to the new entry based on the calculated distances.
Step #4 - Assign the new data entry to the majority class in the nearest neighbors.

K-Nearest Neighbors Classifiers and Model Example With Diagrams

Consider the diagram below:

The graph above represents a data set consisting of two classes — red and blue.

A new data entry has been introduced to the data set. This is represented by the green point in the graph
above.

We'll then assign a value to K which denotes the number of neighbors to consider before classifying the new
data entry. Let's assume the value of K is 3.
Since the value of K is 3, the algorithm will only consider the 3 nearest neighbors to the green point (new
entry). This is represented in the graph above.

Out of the 3 nearest neighbors in the diagram above, the majority class is red so the new entry will be
assigned to that class.

The last data entry has been classified as red.K-Nearest Neighbors Classifiers and Model Example With Data Set

The table above represents our data set. We have two columns — Brightness and Saturation. Each row in
the table has a class of either Red or Blue.

let's assume the value of K is 5.


How to Calculate Euclidean Distance in the K-Nearest Neighbors Algorithm
Here's the new data entry:

We have a new entry but it doesn't have a class yet. To know its class, we have to calculate the distance
from the new entry to other entries in the data set using the Euclidean distance formula.

Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)²

Where:

X₂ = New entry's brightness (20).

X₁= Existing entry's brightness.

Y₂ = New entry's saturation (35).


Y₁ = Existing entry's saturation.

Distance #1
For the first row, d1:

d1 = √(20 - 40)² + (35 - 20)²


= √400 + 225
= √625
= 25

Distance #2
For the second row, d2:

d2 = √(20 - 50)² + (35 - 50)²


= √900 + 225
= √1125
= 33.54

Distance #3
For the third row, d3:

d2 = √(20 - 60)² + (35 - 90)²


= √1600 + 3025
= √4625
= 68.01
Similiarly, calculate the distance for the last four rows.

Here's what the table will look like after all the distances have been calculated:

Let's rearrange the distances in ascending order:

Since we chose 5 as the value of K, we'll only consider the first five rows. That is:
As you can see above, the majority class within the 5 nearest neighbors to the new entry is Red. Therefore,
we'll classify the new entry as Red.
Here's the updated table:

How to Choose the Value of K in the K-NN Algorithm


1. There is no particular way of choosing the value K, but here are some common conventions to keep
in mind:
2. Choosing a very low value will most likely lead to in accurate predictions.

3. The commonly used value of K is 5.

4. Always use an odd number as the value of K. In general practice, choosing the value of K is k =
sqrt(N) where “N” stands for the number of samples in your training data set.

Advantages of K-NN Algorithm


1. It is simple to implement.

2. No training is required before classification.

3. It’s simple to implement.

4. It’s flexible to different feature/distance choices.

5. It naturally handles multi-class cases.

6. It can do well in practice with enough representative data.


Disadvantages of K-NN Algorithm
1. Can be cost-intensive when working with a large data set.

2. A lot of memory is required for processing large data sets.

3. Choosing the right value of K can be tricky.We need to determine the value of parameter “K”
(number of nearest neighbors).
4. Computation cost is quite high because we need to compute the distance of each query instance to all
training samples.
5. It requires a large storage of data.
6. We must have a meaningful distance function.

EXAMPLE 2:

Let’s take a small example examining age vs. loan amount.

 We need to predict Andrew’s default status — either yes or no.

 Then calculate the Euclidean distance for all the data points.


 With K=5, there are two Default=N and three Default=Y out of five closest neighbors. We can safely
say the default status for Andrew is “Y” based on the majority similarity in three points out of five.
 KNN is also a lazy learner because it doesn’t learn a discriminative function from the training data
but “memorizes” the training data set instead.
Backpropagation

Backpropagation is a key algorithm used in the training of artificial neural networks. It is a method for
adjusting the network's weights(In short, weights in neural networks determine how much importance each
input has in making predictions. They help the network learn patterns in the data and adjust its connections to
make accurate predictions.) and biases(In short, biases in neural networks help the network adapt to different
data patterns, shift activation functions, and model non-linear relationships, making the network more flexible
and capable of learning from data. They are important for the network to fit the data accurately.) to minimize
the error (or loss) between the predicted outputs and the actual target outputs.

Backpropagation Algorithms in Neural Networks

Backpropagation is use neural networks to improve output. A neural network is a collection of connected
input and output nodes.

Each node's accuracy is expressed as a loss function, which is also known as an error rate. Back propagation
calculates the mathematical gradient, or slope, of the error rate compared against the other weights in the
neural network.

Based on the calculations, neural network nodes with high error rates are given less weight than nodes with
lower error rates, which are given more weight. Weights determine how much influence an input will have on
an output.

How Backpropagation Algorithms Work


Backpropagation trains a neural network by assigning random weights to the algorithms and analyzing where
the error in the system increases.

When errors occur, the difference between the model output and the actual output is calculated. Once
calculated, a different weight is assigned and the system is run again, to see if the error is minimized.

If the error is not minimized, then an update of parameters is required To update parameters, weights and
biases are adjusted. Biases are located after weights and are in a different layer of a network, always being
assigned the value of 1.

After the parameters are updated, the process is run again. Once the error is at a minimum, the model is ready
to start predicting.

In looking at the diagram below, if “W” also known as weight is changed, then the error of the system goes
up, if “W” is changed into a smaller number the error goes down.

Once the error is as close to zero as possible that weight is set as the parameter and the model can start
predicting.

Types of Backpropagation

There are two types of back propagation: static and recurrent.


Static Back propagation

A static back propagation network aims to produce a map of static inputs to fixed outputs. This type
of network can solve static classification problems such as optical character recognition (OCR), which
allows computers to understand written documents.

Recurrent Back propagation

Recurrent back propagation is used in data mining, to find the fixed value. Once the fixed value is found,
the error is computed and then run through a back propagation algorithm.

The difference between the two types of back propagation is that static mapping is immediate and
recurrent back propagation takes a longer time to map.

Why Use Back propagation

Back propagation increases the accuracy of predictions as it is able to calculate derivatives quickly. Back
propagation algorithms are intended to develop learning algorithms for multilayer feed forward neural
networks. This algorithm is trained to capture mapping, which in turn aids in data mining and machine
learning. Back propagation increases efficiency by reducing the errors found in a network.

Why We Need Back propagation?

While designing neural network, in the beginning, we initialize weights with some random values or any
variable for that fact.

Now obviously, we are not super human. So, it’s not necessary that whatever weight values we have
selected will be correct or it fits our model the best.

Okey, fine, we have selected some weight values in the beginning, but our model output is way different
than our actual output i.e the error value is huge.

Now, how will you reduce the error? Basically, what we need to do we need to some how explain the model
to change the parameters(weights), such that error becomes minimum.

One way to train our model is called as Back propagation. Consider the diagram below:

The steps:

Calculate the error — How far is your model output from the actual output.

Error minimum—Check whether the error is minimized or not.


Update the parameters —If the error is huge then, update the parameters (weights and biases). After that
again check the error. Repeat the process until the error becomes minimum.

Model is ready to make a prediction —Once the error becomes minimum, you can feed some inputs to
your model and it will produce the output.

Step—1:ForwardPropagation

Step—2:BackwardPropagation

Step3: Putting all the values together and calculating the updated weight value.

Propagation Algorithm:

Suppose you have a simple dataset for a regression problem. It consists of input and target pairs like this:

1. Training Data:
You need to provide a dataset that includes input features and corresponding target values. This dataset is used for
training the network. In the example, the training data would be the pairs of X (input) and Y (target) values

2. Network Architecture:
You need to define the architecture of your neural network, which includes the number of layers, the number of
neurons in each layer, and the activation functions used. In the example, this was specified as:

3. Initialization of Weights and Biases:


You may specify how the initial weights and biases of the network are set. Often, they are initialized with small
random values.

Loss Function: Choose a loss function that measures the error between the predicted outputs and the actual targets.
In the example, the mean squared error (MSE) is commonly used for regression problems.

4. Forward Pass:

5. Backpropagation:
Calculate the gradient of the loss with respect to the network's parameters (weights and biases) by using the chain
rule. Update the weights and biases in the direction that reduces the loss, typically using an optimization algorithm
like gradient descent.

6. Repeat steps 4 and 5 for all data points in the dataset (this constitutes one training epoch).

7. Repeat steps 4-6 for multiple epochs until the loss converges to a minimum value or a predefined stopping
criterion is met.

Sample dataset with calculation using forward propagation:

Backpropagation calculates the gradients needed to adjust the weights and biases during the training process. The
process iteratively refines the network's parameters to minimize the loss and improve its ability to make accurate
predictions on the given dataset.

Each step in backpropagation involves calculating partial derivatives of the loss function with respect to the network's
parameters. The specific formulas and calculations will depend on the network architecture, activation functions, and
the choice of loss function. This is a simplified overview, and actual neural networks may have more complex
architectures and variations, but the fundamental principles of backpropagation remain the same.
step by step forward pass (forward propagation) and backward pass (backpropagation). There are two units
in the Input Layer, two units in the Hidden Layer and two units in the Output Layer. The w1,w2,w2,…,w8
represent the respective weights. b1 and b2 are the biases for Hidden Layer and Output Layer, respectively.

we’ll be passing two inputs i1 and i2, and perform a forward pass to compute total error and then a backward
pass to distribute the error inside the network and update weights accordingly.

Peeking inside a single neuron

Inside h1 (first unit of the hidden layer)

Inside a unit, two operations happen (i) computation of weighted sum and (ii) squashing of the weighted
sum using an activation function. The result from the activation function becomes an input to the next layer
(until the next layer is an Output Layer). In this example, we’ll be using the Sigmoid function (Logistic
function) as the activation function. The Sigmoid function basically takes an input and squashes the value
between 0 and +1.

The Forward Pass

Remember that each unit of a neural network performs two operations: compute weighted sum and process
the sum through an activation function. The outcome of the activation function determines if that particular
unit should activate or become insignificant.
Let’s get started with the forward pass.

For h1,

Now we pass this weighted sum through the logistic function (sigmoid function) so as to squash the weighted
sum into the range (0 and +1). The logistic function is an activation function for our example neural network.

Similarly for h2, we perform the weighted sum operation sumh2 and compute the activation value outputh2.

Now, outputh1 and outputh2 will be considered as inputs to the next layer.

Computing the total error

We started off supposing the expected outputs to be 0.05 and 0.95 respectively for outputo1 and outputo2.
Now we will compute the errors based on the outputs received until now and the expected outputs.

We’ll use the following error formula,


To compute Etotal, we need to first find out respective errors at o1 and o2.

The Backpropagation

The aim of backpropagation (backward pass) is to distribute the total error back to the network so as to update
the weights in order to minimize the cost function (loss). The weights are updated in such as way that when
the next forward pass utilizes the updated weights, the total error will be reduced by a certain margin (until
the minima is reached).

For weights in the output layer (w5, w6, w7, w8)

For w5,

Let’s compute how much contribution w5 has on E1. If we become clear on how w5 is updated, then it would
be really easy for us to generalize the same to the rest of the weights. If we look closely at the example neural
network, we can see that E1 is affected by outputo1, outputo1 is affected by sumo1, and sumo1 is affected
by w5. It’s time to recall the Chain Rule.
Let’s deal with each component of the above chain separately.
Note:
For weights in the hidden layer (w1, w2, w3, w4)

Similar calculations are made to update the weights in the hidden layer. However, this time the chain becomes
a bit longer. It does not matter how deep the neural network goes, all we need to find out is how much error is
propagated (contributed) by a particular weight to the total error of the network. For that purpose, we need to
find the partial derivative of Error w.r.t. to the particular weight. Let’s work on updating w1 and we’ll be able
to generalize similar calculations to update the rest of the weights.

For w1 (with respect to E1),


Once we’ve computed all the new weights, we need to update all the old weights with these new weights.
Once the weights are updated, one back propagation cycle is finished. Now the forward pass is done and the
total new error is computed. And based on this newly computed total error the weights are again updated. This
goes on until the loss value converges to minima. This way a neural network starts with random values for its
weights and finally converges to optimum values.

EPOCH

An epoch is a complete pass through the entire training dataset during the training of a neural network. It's an
important concept in the training process because it represents one cycle where the network learns from all
the training examples.

Number of Training Epochs: The number of training epochs specifies how many times the entire training
dataset is passed through the network for weight updates. It is a user-defined parameter and depends on when
you decide to stop training. Training for more epochs may lead to better convergence, but there's a risk of
overfitting if you train for too long.
Mini-Batch Size (for Stochastic Gradient Descent): Mini-batch size is used when you're training with
stochastic gradient descent (SGD) or its variants like mini-batch gradient descent. Instead of updating the
weights after processing the entire dataset (batch gradient descent) or after processing a single data point (pure
stochastic gradient descent), you update the weights after processing a mini-batch of data points. This strikes
a balance between the high variance of pure SGD and the high computational cost of batch gradient descent.

Sigmoid Function

The sigmoid function is a commonly used activation function in neural networks. It is a non-linear function
that takes an input and maps it to an output between 0 and 1. The sigmoid function is often used to introduce
non-linearity in the network, making it capable of modeling complex relationships in the data.

The sigmoid function, denoted as σ(z), where z is the input, is defined as follows: σ(z) = 1 / (1 + e^(-z))

In this formula, "e" is the base of the natural logarithm.

It takes an input value (z) and produces an output between 0 and 1. When z is very negative, the output is
close to 0, and when z is very positive, the output is close to 1.

SUMMARY:

Back propagation is a critical component of the training process in neural networks, and it is used to rectify
errors that occur during forward propagation. Here's how it works:

Forward Propagation: During forward propagation, the neural network makes predictions based on its
current weights and biases. It computes an output for a given input, and this output is compared to the actual
target value. Errors, or the differences between predictions and actual values, occur during this step.

Backpropagation: After forward propagation, back propagation is used to calculate how much each weight
and bias contributed to the errors in the predictions. It does this by computing gradients (derivatives) of the
error with respect to the model's parameters (weights and biases).

Weight and Bias Updates: With the gradients in hand, the network updates its weights and biases using
optimization algorithms like gradient descent. These updates are made in such a way that they help reduce the
errors during the next forward pass.

Repeating the Process: This process is repeated for multiple epochs, allowing the network to progressively
adjust its weights and biases to make more accurate predictions. Over time, the errors are minimized, and the
network converges to a state where it makes accurate predictions on the training data.

Back propagation is the mechanism by which the neural network learns from its errors in forward propagation.
It calculates how much each parameter contributed to these errors and then updates the parameters to minimize
those errors in subsequent forward passes. This iterative process of forward and backward passes is how neural
networks learn to make increasingly accurate predictions.
BAYESIAN BELIEF NETWORK

It is also called as bayesian network, belief network, decision network or bayesian model.
The probabilistic graphical model is bayesian belief network, which represents a set of vairables(attributes) and
their conditional dependencies using directed acyclic graph(DAG).
It assumes that with in the set of attributes, the probability distribution can have
8. Conditional probability relationship as well as
9. Conditional independence assumptions.

The prior knowledge or belief about the influence of one attribute over the other is handled through joint
probabilities.
Example
;
Rule-based classification

Rule-based classification
Rule-based classification is a method of categorizing data or making decisions based on a set of predefined rules or
conditions. In rule-based classification, you establish a set of rules or conditions, and data is classified into different
categories or classes based on how well it matches those rules. This approach is often used in various fields, including
machine learning, expert systems, and decision support systems.

IF-THEN Rule
To define the IF-THEN rule, we can split it into two parts:

•Rule Antecedent: This is the “if condition” part of the rule. This part is present in the LHS(Left Hand Side). The
antecedent can have one or more attributes as conditions, with logic AND operator.
•Rule Consequent: This is present in the rule's RHS(Right Hand Side). The rule consequent consists of the class
prediction.

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following
from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes

Rule-Based Classifier classify records by using a collection of “if…then…” rules.


(Condition) → Class Label

Here are the key components and characteristics of rule-based classification:


Rules: Rules are defined as "if-then" statements. Each rule consists of one or more conditions and an associated
action or decision to be taken if the conditions are met.

For example:
If Temperature > 30°C, then classify as "Hot."

Condition: Conditions are based on the attributes or features of the data being classified. These conditions can be
simple or complex, depending on the problem.
Action or Classification: The action or classification specifies what should be done when a data point meets the
conditions of a rule. It may involve assigning a category or label to the data point.
Priority: In some rule-based systems, rules may have a priority or order. When multiple rules apply to a data point,
the system may use the rule with the highest priority.
Combining Rules: Rule-based classification systems can contain multiple rules, and these rules may be combined
to make complex decisions. The order of rule evaluation can be important, and there may be strategies for resolving
conflicts.

Assessment of Rule
In rule-based classification in data mining, there are two factors based on which we can access the rules. These are:

•Coverage of Rule: The fraction of the records which satisfy the antecedent conditions of a particular rule is called
the coverage of that rule.
We can calculate this by dividing the number of records satisfying the rule(n1) by the total number of records(n).
Coverage(R) = n1/n

Accuracy of a rule: The fraction of the records that satisfy the antecedent conditions and meet the consequent values
of a rule is called the accuracy of that rule.
We can calculate this by dividing the number of records satisfying the consequent values(n2) by the number of records
satisfying the rule(n1).
Accuracy(R) = n2/n1

Generally, we convert them into percentages by multiplying them by 100. We do so to make it easy for the layman
to understand these terms and their values.

Rule-based classification is commonly used in the following contexts:


Expert Systems: Rule-based expert systems use knowledge encoded in rules to make decisions or provide expert
advice in specific domains, such as medical diagnosis or troubleshooting.
Decision Support Systems: Rule-based decision support systems help users make informed decisions by applying
predefined rules to data.
Quality Control: In manufacturing, rule-based systems can be used to classify products as pass/fail based on various
quality criteria.
Natural Language Processing (NLP): In NLP, rule-based systems use grammatical and linguistic rules to parse and
analyze text.
Business Rules Engines: Rule-based systems are used in business processes and workflows to enforce business rules
and automate decision-making.

One of the advantages of rule-based classification is its transparency and interpretability. Users can easily understand
why a particular decision was made because it's based on explicit rules. However, creating and maintaining a
comprehensive set of rules can be challenging in complex domains. In some cases, machine learning approaches,
such as decision trees, can be used to automatically generate rule-based systems from data.

Certainly, I can provide an example of rule-based classification with a simple dataset and numerical calculations.
Let's consider a dataset of students' exam scores, and we want to classify their performance as "Good," "Average,"
or "Poor" based on their scores. We will define rules to do this.
Dataset:
Student Exam Score
Student A 85
Student B 72
Student C 60
Student D 93
Student E 78
Student F 45
Student G 89

Rule-Based Classification:

Let's create simple rules to classify students based on their exam scores:

If Exam Score >= 90, then classify as "Good."


If 70 <= Exam Score < 90, then classify as "Average."
If Exam Score < 70, then classify as "Poor."

Numerical Calculation:
We'll apply these rules to the dataset and calculate the classification for each student:
Student A (85): 70 <= 85 < 90, so "Average."
Student B (72): 70 <= 72 < 90, so "Average."
Student C (60): 60 < 70, so "Poor."
Student D (93): 93 >= 90, so "Good."
Student E (78): 70 <= 78 < 90, so "Average."
Student F (45): 45 < 70, so "Poor."
Student G (89): 70 <= 89 < 90, so "Average."

Classification Results:
Based on the rules and numerical calculations, we have classified the students as follows:
Student A: "Average"
Student B: "Average"
Student C: "Poor"
Student D: "Good"
Student E: "Average"
Student F: "Poor"
Student G: "Average"

This is a simple example of rule-based classification, where we've applied predefined rules to a dataset to categorize
students based on their exam scores. In practice, you might have more complex rules and larger datasets, but the
concept remains the same: you use rules to make data-driven classifications or decisions.

In rule-based classification, the coverage rule defines the conditions under which a specific rule is applied to classify
data. It specifies the range or criteria for the input data that activates the rule. Let's create a coverage rule based on
the example dataset you provided earlier, where we classified students into "Good," "Average," or "Poor" based on
their exam scores.

In the context of rule-based classification, accuracy is typically used to measure how well the classification rules
correctly classify data points in a dataset. The accuracy rule is defined as the ratio of correctly classified data points
to the total number of data points. Here's how you can calculate the accuracy for the example dataset: In the context
of rule-based classification, it's common for rules to not be mutually exclusive and not be exhaustive. Let me explain
these concepts using the example we've been discussing.
Characteristics of Rule-Based Classifier/ Properties of Rule-Based Classifiers
If we convert the result of decision tree to classification rules, these rules would be mutually exclusive and exhaustive
at the same time.

There are two significant properties of rule-based classification in data mining. They are:
• Rules may not be mutually exclusive
• Rules may not be exhaustive

Rules May Not Be Mutually Exclusive:


In a rule-based classification system, rules may not be mutually exclusive, meaning that more than one rule can apply
to the same data point. This can happen when a data point meets the conditions of multiple rules simultaneously. In
our example:

• Student B (72) falls into the range for both the "Good" rule (72 < 90) and the "Average" rule (70 <=
72 < 90). This demonstrates that the rules "Good" and "Average" are not mutually exclusive.

In such cases, there should be a mechanism to prioritize or resolve conflicts between rules. For instance, you may
assign a rule priority, and the highest priority rule is applied in case of conflicts. This way, you can handle situations
where multiple rules could potentially classify a data point.
Many different rules are generated for the dataset, so it is possible and likely that many of them satisfy the same data
record. This condition makes the rules not mutually exclusive.

Example 2 for Mutually Exclusive Rules:

Suppose we have a dataset of patients with numerical values (age, blood pressure, cholesterol) and want to classify
them as "Healthy" or "High Risk" based on the following mutually exclusive rules:

Rule 1: If age > 50 and blood pressure > 140, classify as "High Risk."
Rule 2: If cholesterol > 200, classify as "High Risk."

If a patient is 55 years old with a blood pressure of 150 and cholesterol of 180, they will be classified as "High Risk"
based on Rule 1. Rule 2 is not considered in this case.

Since the rules are not mutually exclusive, we cannot decide on classes that cover different parts of data on different
rules. But this was our main objective. So, to solve this problem, we have two ways:
• The first way is using an ordered set of rules. By ordering the rules, we set priority orders. Thus, this ordered
rule set is called a decision list. So the class with the highest priority rule is taken as the final class.
• The second solution can be assigning votes for each class depending on their weights. So, in this, the set of
rules remains unordered.

Mutually exclusive rules


•Classifier contains mutually exclusive rules if the rules are independent of each other.
•Every record is covered by at most one rule.
Solution to make the rule set mutually exclusive:
•Ordered rule set
•Unordered rule set – use voting schemes
Mutually Exclusive Rules:
•Mutually exclusive rules ensure that each data point is assigned to only one class. There are two common approaches
to achieve this:
a. Ordered Rule Set:
In an ordered rule set, rules are prioritized, and when a data point matches a rule, it is assigned to the class associated
with that rule. Once a match is found, the classification process stops. If you have rules in a specific order, the first
matching rule is applied, and subsequent rules are not considered for classification.
b. Unordered Rule Set with Voting Schemes:
In an unordered rule set, all rules are evaluated for a given data point, and each rule may cast a vote for a specific
class. The class with the most votes is selected as the final classification. Voting schemes, such as majority voting,
can help decide the class based on the rule that receives the highest number of votes.
Exhaustive rules
•Classifier has exhaustive coverage if it accounts for every possible combination of attribute values.
•Each record is covered by at least one rule.
These rules can be simplified. However, simplified rules may no longer be mutually exclusive since a record may
trigger more than one rule. Simplified rules may no longer be exhaustive either since a record may not trigger any
rules.
Solution to make the rule set exhaustive:
•Use a default class Rule-based classification is a method of categorizing data into different classes or categories
based on a set of predefined rules. These rules can be mutually exclusive, meaning that they don't overlap with each
other, or they can be exhaustive, covering all possible scenarios. Here's how you can achieve both mutually exclusive
and exhaustive rule sets:
Exhaustive Rules:
Exhaustive rules ensure that every possible scenario is covered by at least one rule. To make the rule set exhaustive,
you can use a default class, which catches cases that don't match any specific rule.
Rules May Not Be Exhaustive:
Rules may not be exhaustive, meaning that not all possible cases are covered by the defined rules. In our example:
There's no rule defined for students whose scores are exactly 70 (e.g., Student C).
This demonstrates that there are cases (students with scores exactly at the rule boundaries) that are not covered by
any of the defined rules. To make the rules exhaustive, you would need to consider and explicitly define rules for
these cases.
It is not a guarantee that the rule will cover all the data entries. Any of the rules may leave some data entries. This
case, on its occurrence, will make the rules non-exhaustive. So, we have to solve this problem too. So, to solve this
problem, we can make use of a default class. Using a default class, we can assign all the data entries not covered by
any rules to the default class. Thus using the default class will solve the problem of non-exhaustivity.
Use a Default Class:
Create a rule that captures all cases that do not meet any of the specific rules. This rule assigns the data point to a
default class. This ensures that no data point goes unclassified.
Example for Exhaustive Rules:
Consider a classification task with the same patient dataset as before. To make the rules exhaustive, you could add a
default rule:
•Rule 3 (Default Rule): If none of the conditions in Rule 1 and Rule 2 are met, classify as "Healthy."
•Now, any patient not meeting Rule 1 or Rule 2 conditions will be classified as "Healthy" by the default rule.
In summary, mutually exclusive rules ensure that data points are assigned to only one class, and you can achieve this
with ordered or unordered rule sets. To make the rule set exhaustive, add a default class rule to cover cases that don't
fit any specific rule.
ORDERED AND UNORDERED EXPLAIN WITH SAME DATASET
The example dataset of patients and their classification into "Healthy" or "High Risk" based on age, blood pressure,
and cholesterol values. We'll explore how ordered and unordered rule sets work for the same dataset.
Example Dataset:
Patient 1: Age = 55, Blood Pressure = 150, Cholesterol = 180
Patient 2: Age = 45, Blood Pressure = 160, Cholesterol = 220
Ordered Rule Set:
•In an ordered rule set, rules are prioritized, and the classification process stops when a match is found. Let's consider
the following ordered rules:
Rule 1 (Age > 50 and Blood Pressure > 140): Classify as "High Risk."
Rule 2 (Cholesterol > 200): Classify as "High Risk."
Classification Process with Ordered Rules:
Patient 1: Age > 50 and Blood Pressure > 140 → Matches Rule 1 → Classified as "High Risk."
Patient 2: Age > 50 and Blood Pressure > 140 → Does not match Rule 1.
Cholesterol > 200 → Matches Rule 2 → Classified as "High Risk."
In an ordered rule set, once a match is found, the classification process stops. Patient 1 is classified as "High Risk"
based on Rule 1, and Patient 2 is classified as "High Risk" based on Rule 2.
Unordered Rule Set with Voting Schemes:
In an unordered rule set, all rules are evaluated for each data point, and voting schemes determine the final
classification. Let's consider the following unordered rules:
Rule A (Age > 50 and Blood Pressure > 140): Classify as "High Risk."
Rule B (Cholesterol > 200): Classify as "High Risk."
Classification Process with Unordered Rules and Majority Voting:
Patient 1: Age > 50 and Blood Pressure > 140 → Matches Rule A, which casts a vote for "High Risk."
Cholesterol > 200 → Does not match Rule B.
Patient 2: Age > 50 and Blood Pressure > 140 → Matches Rule A, which casts a vote for "High Risk."
Cholesterol > 200 → Matches Rule B, which also casts a vote for "High Risk."
In an unordered rule set with majority voting, both patients are classified as "High Risk" because both Rule A and
Rule B cast votes for this class.
In summary, ordered rule sets prioritize rules, and the first matching rule determines the classification. Unordered
rule sets evaluate all rules, and voting schemes decide the final classification based on the votes from each rule. In
the example, ordered rules assigned different classes to the patients, while unordered rules with majority voting
assigned the same class to both patients.

Unordered Rule Set (Voting Scheme):


In an unordered rule set, all rules are evaluated for each data point, and voting schemes determine the final
classification. The idea is that multiple rules may apply to a data point, and each rule can cast a vote for a particular
class. The class with the most votes is selected as the final classification.
Let's consider the following unordered rules:
Rule A: If Age > 50 and Blood Pressure > 140, cast a vote for "High Risk."
Rule B: If Cholesterol > 200, cast a vote for "High Risk."
Classification Process with Unordered Rules (Voting):
Patient 1: Age > 50 and Blood Pressure > 140 → Matches Rule A, which casts a vote for "High Risk.
Cholesterol > 200 → Does not match Rule B.
Patient 2: Age > 50 and Blood Pressure > 140 → Matches Rule A, which casts a vote for "High Risk."
Cholesterol > 200 → Matches Rule B, which also casts a vote for "High Risk."
In this scenario, both patients have received votes for "High Risk" from the rules.
Now, you apply a voting scheme to determine the final classification. In this case, we use a simple majority voting:
Patient 1 received 1 vote for "High Risk."
Patient 2 received 2 votes for "High Risk."
Since Patient 2 has more votes for the "High Risk" class, the final classification for both patients is "High Risk.
Unordered rule sets evaluate all relevant rules for each data point and use a voting scheme to decide the class with
the most votes. In this example, Patient 2 had more votes for "High Risk," resulting in the "High Risk" classification
for both patients.

Example 1:
Consider a training set that contains 60 positive examples and 100 negative examples. Rule r1 covers 50 positive
examples and 5 negative examples Rule r2 covers 2 positive examples and no negative examples.
Ans:
Accuracy of r1=50/55=90.9%
Accuracy of r2=2/2=100%

Advantages of Rule-Based Classifiers

As highly expressive as decision trees -Easy to interpret

Easy to generate

Can classify new instances rapidly

Performance comparable to decision trees

Support Vector Machine

SVM is a powerful supervised algorithm that works best on smaller datasets but on complex ones. Support Vector
Machine, abbreviated as SVM can be used for both regression and classification tasks, but generally, they work
best in classification problems.
What is a Support Vector Machine?

Support Vector Machines (SVM) is another classification algorithm that classifies data into one category or the other
by using hyperplanes.

SVMs are particularly good at solving binary classification problems, which require classifying the elements of a
data set into two groups.
The aim of a support vector machine algorithm is to find the best possible line, or decision boundary, that separates
the data points of different data classes. This boundary is called a hyperplane when working in high-dimensional
feature spaces. The idea is to maximize the margin, which is the distance between the hyperplane and the closest data
points of each category, thus making it easy to distinguish data classes.
It is a supervised machine learning problem where we try to find a hyperplane that best separates the two classes
difference between SVM and logistic regression. Both the algorithms try to find the best hyperplane, but the main
difference is logistic regression is a probabilistic approach whereas support vector machine is based on statistical
approaches.

SVM is helpful by finding the maximum margin between the hyperplanes that means maximum distances between
the two classes.

Types of Support Vector Machine Algorithms


Support vector machines have different types and variants that provide specific functionalities and address specific
problem scenarios. Here are two types of SVMs and their significance:

Linear SVM. Linear SVMs use a linear kernel to create a straight-line decision boundary that separates different
classes. They are effective when the data is linearly separable or when a linear approximation is sufficient. Linear
SVMs are computationally efficient and have good interpretability, as the decision boundary is a hyperplane in the
input feature space.

Nonlinear SVM. Nonlinear SVMs address scenarios where the data cannot be separated by a straight line in the input
feature space. They achieve this by using kernel functions that implicitly map the data into a higher-dimensional
feature space, where a linear decision boundary can be found. Popular kernel functions used in this type of SVM
include the polynomial kernel, Gaussian (RBF) kernel and sigmoid kernel. Nonlinear SVMs can capture complex
patterns and achieve higher classification accuracy when compared to linear SVMs.

How do support vector machines work?


The key idea behind SVMs is to transform the input data into a higher-dimensional feature space. This transformation
makes it easier to find a linear separation or to more effectively classify the data set.

Terms used in SVM:

Support Vectors: These are the points that are closest to the hyperplane. A separating line will be defined with
the help of these data points.

Margin: it is the distance between the hyperplane and the observations closest to the hyperplane (support vectors).
In SVM large margin is considered a good margin. There are two types of margins hard margin and soft margin. I
will talk more about these two in the later section.
What are the steps of SVM algorithm?
A. The steps of the SVM algorithm involve: (1) selecting the appropriate kernel function, (2) defining the parameters
and constraints, (3) solving the optimization problem to find the optimal hyperplane, and (4) making predictions
based on the learned model.

What does SVM do in machine learning?


A. In machine learning, SVM is used to classify data by finding the optimal decision boundary that maximally
separates different classes. It aims to find the best hyperplane that maximizes the margin between support vectors,
enabling effective classification even in complex, non-linear scenarios.

Example: If you want to classify an iris flower as setosa or non-setosa based on petal width and petal length, we could
use a separating line to do the same as shown below.

Note: the separating line here has been drawn manually for the purpose of illustration.

Consider the following plot which illustrates a few lines that can separate data into setosa or non-setosa.
Note: The lines and the additional grey points in the above plot have been manually inserted for the purpose of
illustration.

From the above plot, you can observe that all three lines – Line 1, 2 and 3 can separate the given sample data points
into the two classes (setosa and non-setosa). Points on the left of these lines represent setosa flowers and the ones on
the right represent the non-setosa flowers.

Let us now consider three new data points (A, B and C) where the class labels are unknown. We can observe that while
Line 3 classifies A as setosa, Line 1 and 2 classify it as non-setosa.

Thus, you can conclude that:

1. There can be many lines that can separate the known data correctly.
2. These lines may agree/disagree with each other when new data points are presented.
3. Point A lies very close to Line 2 and 3. It is possible that the lines would have misclassified point A.

Based on the above observations, the goal of SVM can be redefined to finding an optimal hyperplane that separates
the data into classes.

Choosing the optimal hyperplane

In the plot shown below, you can choose 3 parallel lines that divide the data.
Note: The lines and the additional grey points in the above plot have been manually inserted for the purpose
of illustration.

You can observe that all 3 lines drawn here can separate the given data. Line 3 is closest to non-setosa and Line 1 is
closest to setosa. These lines (Line 1 and 3) are called as margin lines as they represent the boundary of a category.
Consequently, the line that is midway (Line 2) between the margin lines would be the optimal line.

When the number of predictors is 3, you get an optimal plane. In general, for n features (represented in n-dimensional
space), you can find the optimal hyperplane.

SVM essentially creates a model such that it finds the widest possible margins and thus the optimal hyperplane.

When data is not separable by a line or a plane, SVM maps the data into a higher dimensional space, where it can be
separated using a linear hyperplane. The mapping function used to transform the data into the higher dimensional
space is called 'kernel' function.

Building an SVM model

You can look at a demo on building an SVM model.

Reading data

Consider the Iris dataset which provides measurements of sepal length, sepal width, petal length, and petal width for
50 flowers from each of 3 species (setosa, versicolor, and virginica) of Iris. Click here to download the iris dataset.

First, you will see how to build a binary classifier using SVM, that will help us identify whether a given instance of
flower is of the species 'versicolor'.

#reading input from csv file


iris_data = pd.read_csv("datasets/iris.csv")
iris_data.head()

Feature Engineering

Create a new column in the dataframe (v_nv), that distinguishes the species - 'versicolor'(marked by 0) from rest.
Then, build binary classifier based on this new column.

#creating new column 'v_nv', to distinguish versicolor species from rest


#the below lambda function returns 0 for 'versicolor' species and returns 1 for rest.
v_nv_fn = lambda x: 0 if x=="versicolor" else 1
# new column added into dataframe
iris_data["v_nv"] = iris_data["Species"].apply(v_nv_fn)
iris_data[iris_data['v_nv']==0].head()

#visualization using seaborn - pairplot


import seaborn as sns
sns.pairplot(iris_data, x_vars = "Petal.Length",y_vars="Petal.Width",
hue="v_nv",height=5)
In the above plot, you can find the separating hyperplane to classify data as versicolor and non-versicolor. You can
observe that the data is not linearly separable i.e, it cannot be separated using a straight line.

Model Creation

Now let us build a model based on Support Vector Classification, to predict if a data instance is of the species
'versicolor'.

#Support Vector Classification

from sklearn.svm import SVC


#setting predictors and target
X = iris_data[["Petal.Length","Petal.Width"]]
Y = iris_data["v_nv"]
# model building
model = SVC()
model.fit(X,Y)
model.score(X,Y)
# 0.9533333333333334

Model Visualization

A separating hyperplane that divides the iris dataset into versicolor and non-versicolor categories, looks as shown
below.

Note:'mlxtend' is another machine learning library and provides various useful tools for data science applications.

Here, plot_decision_regions() function of mlxtend library is used for plotting the decision regions.

from mlxtend.plotting import plot_decision_regions


features = np.array(X)
target = np.array(Y).ravel()
plot_decision_regions(features,target,clf=model)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.title('Decision boundary of SVM on iris data')

Multi-class classification

Further, even though SVM is considered as a binary classifier, it can be used for multi-class classification as well.
This is achieved in one of the 2 ways:

1. One-vs-One classification: It builds a binary classification model for each pair of classes. Thus, there will
be n * (n-1) / 2 models, where n is the number of classes. So, if there are 3 classes (as in our case), then
(3*2)/2=3 models are used.
2. One-vs-All classification: It compares every class with the remaining classes thereby building a model for
every class. The class with the highest probability is chosen. So, if there are 3 classes (in our case) then 3
models are used.

The below demo shows the classification of Iris dataset using SVM for multiple classes – setosa, versicolor and
virginica.

Feature Engineering

You can encode the species column with numerical values. And replace label 'setosa' with '0', 'versicolor' with '1' and
'virginica' with '2'. This encoding technique, of converting distinct values of column into unique numbers is called
'label encoding'.

# encoding the species column


iris_data.loc[iris_data.Species=="setosa","Species"] = 0
iris_data.loc[iris_data.Species=="versicolor","Species"] = 1
iris_data.loc[iris_data.Species=="virginica","Species"] = 2
#data type (dtype) of the column will be converted to 'category'
iris_data.Species = iris_data.Species.astype("category")
#https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
iris_data.head()

Multiclass Model Building using SVM

X = iris_data[["Petal.Length","Petal.Width"]]
Y = iris_data["Species"]
model = SVC()
model.fit(X,Y)
Sepal.LengtSepal.WidthPetal.Lengt Petal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5 3 1.6 0.2 setosa
5 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5 3.3 1.4 0.2 setosa
7 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5 2 3.5 1 versicolor
5.9 3 4.2 1.5 versicolor
6 2.2 4 1 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3 4.5 1.5 versicolor
5.8 2.7 4.1 1 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3 5 1.7 versicolor
6 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1 versicolor
5.8 2.7 3.9 1.2 versicolor
6 2.7 5.1 1.6 versicolor
5.4 3 4.5 1.5 versicolor
6 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3 4.1 1.3 versicolor
5.5 2.5 4 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3 4.6 1.4 versicolor
5.8 2.6 4 1.2 versicolor
5 2.3 3.3 1 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3 5.8 2.2 virginica
7.6 3 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2 virginica
6.4 2.7 5.3 1.9 virginica
To do this, SVMs use a kernel function. Instead of explicitly calculating the coordinates of the transformed space,
the kernel function enables the SVM to implicitly compute the dot products between the transformed feature vectors
and avoid handling expensive, unnecessary computations for extreme cases.

SVMs can handle both linearly separable and non-linearly separable data. They do this by using different types of
kernel functions, such as the linear kernel, polynomial kernel or radial basis function (RBF) kernel. These kernels
enable SVMs to effectively capture complex relationships and patterns in the data.

During the training phase, SVMs use a mathematical formulation to find the optimal hyperplane in a higher-
dimensional space, often called the kernel space. This hyperplane is crucial because it maximizes the margin between
data points of different classes, while minimizing the classification errors.

The kernel function plays a critical role in SVMs, as it makes it possible to map the data from the original feature
space to the kernel space. The choice of kernel function can have a significant impact on the performance of the SVM
algorithm; choosing the best kernel function for a particular problem depends on the characteristics of the data.

Some of the most popular kernel functions for SVMs are the following:

Linear kernel. This is the simplest kernel function, and it maps the data to a higher-dimensional space, where the
data is linearly separable.

Polynomial kernel. This kernel function is more powerful than the linear kernel, and it can be used to map the data
to a higher-dimensional space, where the data is non-linearly separable.

RBF kernel. This is the most popular kernel function for SVMs, and it is effective for a wide range of classification
problems.

Sigmoid kernel. This kernel function is similar to the RBF kernel, but it has a different shape that can be useful for
some classification problems.

The choice of kernel function for an SVM algorithm is a tradeoff between accuracy and complexity. The more
powerful kernel functions, such as the RBF kernel, can achieve higher accuracy than the simpler kernel functions,
but they also require more data and computation time to train the SVM algorithm. But this is becoming less of an
issue due to technological advances.

Once trained, SVMs can classify new, unseen data points by determining which side of the decision boundary they
fall on. The output of the SVM is the class label associated with the side of the decision boundary.

Advantages of SVMs

SVMs are powerful machine learning algorithms that have the following advantages:

Effective in high-dimensional spaces. High-dimensional data refers to data in which the number of features is larger
than the number of observations, i.e., data points. SVMs perform well even when the number of features is larger
than the number of samples. They can handle high-dimensional data efficiently, making them suitable for applications
with a large number of features.

Resistant to overfitting. SVMs are less prone to overfitting compared to other algorithms, like decision trees --
overfitting is where a model performs extremely well on the training data but becomes too specific to that data and
can't generalize to new data. SVMs' use of the margin maximization principle helps in generalizing well to unseen
data.

Versatile. SVMs can be applied to both classification and regression problems. They support different kernel
functions, enabling flexibility in capturing complex relationships in the data. This versatility makes SVMs applicable
to a wide range of tasks.
Effective in cases of limited data. SVMs can work well even when the training data set is small. The use of support
vectors ensures that only a subset of data points influences the decision boundary, which can be beneficial when data
is limited.

Ability to handle nonlinear data. SVMs can implicitly handle non-linearly separable data by using kernel functions.
The kernel trick enables SVMs to transform the input space into a higher-dimensional feature space, making it
possible to find linear decision boundaries.

Disadvantages of SVMs
While support vector machines are popular for the reasons listed above, they also come with some limitations and
potential issues:

Computationally intensive. SVMs can be computationally expensive, especially when dealing with large data sets.
The training time and memory requirements increase significantly with the number of training samples.

Sensitive to parameter tuning. SVMs have parameters such as the regularization parameter and the choice of kernel
function. The performance of SVMs can be sensitive to these parameter settings. Improper tuning can lead to
suboptimal results or longer training times.

Lack of probabilistic outputs. SVMs provide binary classification outputs and do not directly estimate class
probabilities. Additional techniques, such as Platt scaling or cross-validation, are needed to obtain probability
estimates.

Difficulty in interpreting complex models. SVMs can create complex decision boundaries, especially when using
nonlinear kernels. This complexity may make it challenging to interpret the model and understand the underlying
patterns in the data.

Scalability issues. SVMs may face scalability issues when applied to extremely large data sets. Training an SVM on
millions of samples can become impractical due to memory and computational constraints.

Regression

Regression is a supervised machine learning technique that helps in predicting continuous numerical values or
quantity. For Example temperature, price and so on.

Regression Model can be a linear or a non-linear function. Let us understand the process of Linear Regression.

Regression Types

Linear regression is a statistical practice of calculating a straight line that specifies a mathematical relationship
between two variables.

Linear regression is defined as an algorithm that provides a linear relationship between an independent variable and
a dependent variable to predict the outcome of future events.

Linear regression is an algorithm that provides a linear relationship between an independent variable and a dependent
variable to predict the outcome of future events. It is a statistical method used in data science and machine learning
for predictive analysis.

Regression Techniques:

To predict the value of time taken to repair a computer based on the number of units being replaced, a regression
model can be built using Regression analysis.

Regression analysis is a statistical process for estimating the relationships between variables. It can be used to build
a model to predict the value of the target variable from the predictor variables.
Mathematically, a regression model is represented as y= f(X), where y is the target or dependent variable and X is
the set of predictors or independent variables (x1, x2, …, xn).

Types of Linear Regression:

If a linear regression model involves only one predictor variable, it is called a Simple Linear Regression model.

f(X) = ß0 + ß1x1 + ∈

If a linear regression model involves multiple predictor variables, it is called a Multiple Linear Regression
model.

f(X) = ß0 + ß1x1 + ß2x2 + ... + ßnxn + ∈

The ß values are known as weights (ß0 is also called intercept and the subsequent ß1, ß2, etc. are called as
coefficients). The error , ϵ is assumed to be normally distributed with a constant variance.

Simple Linear Regression

In the Regression example, we tried predicting the time to repair a computer from only a single variable i.e. the
number of faulty units. Now, let us try using Simple Linear Regression Model.

Thus, the model can be framed as:

Time taken to repair a computer = ß0 + (ß1 * Units being replaced) + ε

We have to the find out the best values of (ß0, ß1) that can represent the true nature of the relationship between the
number of faulty units in a computer and the time taken to repair the computer.

In Simple Linear Regression model, the dependent variable y is related to a single predictor variable x. This simple
linear regression model having the dependent variable y can be described using:

However, there may be more than one predictor variables available. For example:

The volume of a tree trunk might depend on its height as well as girth.

The price of a house might depend on the number of bedrooms, the built-up area of the plot, the age of the house etc.

The height of a child might depend on age, weight, heights of the parents etc.

In order to predict the dependent variable based on multiple predictors, Multiple Linear Regression model is used. It
is described as -

The regression coefficients ßj, where j = 1 to n.

Evaluating the model performance using RMSE

Root Mean Squared Error (RMSE) measure can be used to give an estimate of the average error that can be expected
from the model.

Prediction accuracy of the regression model

During the evaluation of a model on train and test data, following are the situations that may be faced:

Model performance on train and test data is poor (High RMSE)


Such models are typically referred to as underfit models because the model cannot explain the variation in the data
reasonably. In such situations, quality and veracity of the data should be relooked and the features selected to build
the model should be analysed. This might require more time and effort in engineering the features.

Model performance on train data is good (Low RMSE, High R-squared) but on test data is poor (High RMSE)

Such models are typically referred to as overfit models, i.e. they have been fit perfectly for the train data but are not
generalized enough. In such situations, we may need to:

Gather more data instances as it is difficult to overfit larger size of data.

Reduce the complexity of the model - Evaluate which features are important and use only those features to build the
model.

Nonlinear regression is a form of regression analysis in which data is fit to a model and then expressed as a
mathematical function. Simple linear regression relates two variables (X and Y) with a straight line (y = mx + b),
while nonlinear regression relates the two variables in a nonlinear (curved) relationship.
The goal of the model is to make the sum of the squares as small as possible. The sum of squares is a measure that
tracks how far the Y observations vary from the nonlinear (curved) function that is used to predict Y.

Non-linear regression involves models where the relationship between the dependent and independent variables is
not linear. It encompasses a broader range of functional forms. For instance, it could be a quadratic, exponential,
logarithmic, or any other non-linear function. A general non-linear regression equation might be:

In non-linear regression, the goal is to find the best parameters β that minimize the difference between the observed
and predicted values based on the chosen non-linear function.

It is computed by first finding the difference between the fitted nonlinear function and every Y point of data in the
set. Then, each of those differences is squared. Lastly, all of the squared figures are added together. The smaller the
sum of these squared figures, the better the function fits the data points in the set. Nonlinear regression uses
logarithmic functions, trigonometric functions, exponential functions, power functions, Lorenz curves, Gaussian
functions, and other fitting methods.

Both linear and nonlinear regression predict Y responses from an X variable (or variables).
Nonlinear regression is a curved function of an X variable (or variables) that is used to predict a Y variable
Nonlinear regression can show a prediction of population growth over time.
Nonlinear regression modeling is similar to linear regression modeling in that both seek to track a particular response
from a set of variables graphically. Nonlinear models are more complicated than linear models to develop because
the function is created through a series of approximations (iterations) that may stem from trial-and-error.
Mathematicians use several established methods, such as the Gauss-Newton method and the Levenberg-Marquardt
method.
Often, regression models that appear nonlinear upon first glance are actually linear. The curve estimation procedure
can be used to identify the nature of the functional relationships at play in your data, so you can choose the correct
regression model, whether linear or nonlinear. Linear regression models, while they typically form a straight line,
can also form curves, depending on the form of the linear regression equation. Likewise, it’s possible to use algebra
to transform a nonlinear equation so that it mimics a linear equation—such a nonlinear equation is referred to as
“intrinsically linear.”

One example of how nonlinear regression can be used is to predict population growth over time. A scatterplot of
changing population data over time shows that there seems to be a relationship between time and population growth,
but that it is a nonlinear relationship, requiring the use of a nonlinear regression model. A logistic population growth
model can provide estimates of the population for periods that were not measured, and predictions of future
population growth.

Independent and dependent variables used in nonlinear regression should be quantitative. Categorical variables, like
region of residence or religion, should be coded as binary variables or other types of quantitative variables.

In order to obtain accurate results from the nonlinear regression model, you should make sure the function you specify
describes the relationship between the independent and dependent variables accurately. Good starting values are also
necessary. Poor starting values may result in a model that fails to converge, or a solution that is only optimal locally,
rather than globally, even if you’ve specified the right functional form for the model.

You might also like