AIML Module-04
AIML Module-04
MODULE-04
A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches
and terminal nodes/leaf nodes.
The topmost node in the tree is the root node. Internal nodes are the test nodes and are also called as
decision nodes. These nodes represent a choice or test of an input attribute and the outcome or
outputs of the test condition are the branches emanating from this decision node.
The branches are labelled as per the outcomes or output values of the test condition
Each branch represents a sub-tree or subsection of the entire tree.
Every decision node is part of path to a leaf node. The leaf nodes represent the labels or the outcome
of a decision path.
The decision tree model, in general, represents a collection of logical rules of classification in the
form of a tree structure.
Decision networks, otherwise called as influence diagrams, have a directed graph structure with
nodes and links. It is an extension of Bayesian belief networks that represents information about each
node's current state, its possible actions, the possible outcome of those actions, and their utility.
Figure shows symbols that are used to represent different nodes in the construction of a decision tree.
A circle is used to represent a root node, a diamond symbol is used to represent a decision node or the
internal nodes, and all leaf nodes are represented with a rectangle.
A decision tree consists of two major procedures discussed below.
1. Building the Tree
Goal: Construct decision tree with the given training dataset. The tree is constructed in topdown
fashion. It starts from the root node. At every level of tree construction, one need to find the best
split attributes or best decision node among all attributes. This process is recursive and continued
until we end up in the last level of the tree or finding a leaf node which cannot be split further.
Goal: Given a test instance, infer to the target class it belongs to.
Classification: Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every decision
node with the test object attribute value and walk to the branch corresponding to the test's
outcome. This process is repeated until we end up in a leaf node which contains the target class
of the test object.
2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables.
4. Can model a high degree of nonlinearity in the relationship between the target variables and the
predictor variables
5. Quick to train
Disadvantages
1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing it.
2. If training data has errors or missing attribute values, then the decision tree constructed may
become unstable or biased.
3. If the training data has continuous valued attributes, handling it is computationally complex and
has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
Fundamentals of Entropy
Given the training dataset with a set of attributes or features, the decision tree is constructed by
finding the attribute or feature that best describes the target class for the given test instances.
The best split feature is the one which contains more information about how to split the dataset
among all features so that the target class is accurately identified for the test instances.
In other words, the best split attribute is more informative to split the dataset into sub datasets and
this process is continued until the stopping criterion is reached. This splitting should be pure at every
stage of selecting the best feature.
The best feature is selected based on the amount of information among the features which are
basically calculated on probabilities.
Quantifying information is closely related to information theory. In the field of information theory,
the features are quantified by a measure called Shannon Entropy which is calculated based on the
probability distribution of the events.
Entropy is the amount of uncertainty or randomness in the outcome of a random variable or an
event. Moreover, entropy describes about the homogeneity of the data instances.
The best feature is selected based on the entropy value.
Higher the entropy → Higher the uncertainty
Lower the entropy Lower the uncertainty.
Similarly, if all instances are belong to the same class (only positive) or (only negative) then the
entropy is 0.
On the other hand, if the instances are equally distributed, which means 50% positive and 50%
negative, then the entropy is 1.
If there are 10 data instances, out of which some belong positive class and some belong to negative
class, then the entropy is calculated as
If the entropy is 0, then the split is pure which means that all samples in the set will partition into one
class or category. But if the entropy is 1, the split is impure and the distribution of the samples is
more random.
The stopping criterion is based on the entropy value
Let P be the probability distribution of data instances from 1 to n as shown.
So, P = P1,....... Pn
Entropy of P is the information measure of this probability distribution given in
Entropy_Info(P) = Entropy_Info(P1,.... Pn)
= - (P1 log2 P1 + P2 log2 P₂ +…….. + Pn log2 Pn)
where, P, is the probability of data instances classified as class 1 and P₂ is the probability of data
instances classified as class 2 and so on.
P1 = |No of data instances belonging to class 1| / |Total no of data instances in the training
dataset|
1. Find the best attribute from the training dataset using an attribute selection measure and place it
at the root of the tree.
2. Split the training dataset into subsets based on the outcomes of the test attribute and each subset
in a branch contains the data instance or tuples with the same value for the selected test attribute.
3. Repeat step 1 and step 2 on each until we end up in leaf nodes in all the branches of the tree.
4. This splitting process is recursive until the stopping criterion is reached.
Stopping Criteria
CART, that stands for Classification and Regression Trees, is another algorithm which was
developed by Breiman et al. in 1984.
The accuracy of the tree constructed depends upon the selection of the best split attribute.
Different algorithms are used for building decision trees which use different measures to decide on
the splitting criterion.
Algorithms such as ID3, C4.5 and CART are popular algorithms used in the construction of decision
trees.
The algorithm ID3 uses 'Information Gain' as the splitting criterion whereas the algorithm C4.5
uses 'Gain Ratio' as the splitting criterion. The CART algorithm is popularly used for classifying
both categorical and continuous-valued target variables. CART uses GINI Index to construct a
decision tree.
Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which
consider only one feature/attribute to split at each decision node whereas decision trees constructed
using CART algorithm are multivariate decision trees which consider a conjunction of univariate
splits.
3. Choose the attribute for which entropy is minimum and therefore the gain is maximum as the best
split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.
Note: We stop branching a node if entropy is 0. The best split attribute at every iteration is the
attribute with the highest information gain.
Problems
Step 1: Calculate the Entropy for the target class Job offer
Entropy_Info(Job Offer) = Entropy_Info(7,3) = = 0.8807
Iteration 1:
Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset
Entropy of CGPA
+ *
= 4/10 * 0.8 + 0 + 0 0.3243
Gain(CGPA) = 0.8807 – 0.3243 0.5564
Entropy of Interactiveness
+ *
= 2/10 * 0 + 3/10 * 0.9177 + 5/10 * 0.7215 0.6361
Gain(Practical Knowledge) = 0.8807 – 0.6361 0.2446
Moderate 3 0 3
Poor 0 2 2
+ *
= 5/10 * 0.9177 + 3/10 * 0+ 2/10 * 0 0.36096
Gain(Communication Skill) = 0.8807 – 0.36096 0.5203
Gain Table
Attribute Gain
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2446
Communication Skill 0.5203
Step 3: Since the gain of CGPA is maximum we choose this attribute as the best split and consider as
Root node. The resulting tree is as shown below
Iteration 2:
In this iteration the same process of computing the Entopry_Info and Gain are repeated with subset of
the Training set.
The subset consists of 4 data instance as shown below
Entropy of Interactiveness
= 0 + 0.4997 0.4997
Gain(Interactiveness) = 0.8108 – 0.4997 0.3111
+ *
=0+0+00
Gain(Practical Knowledge) = 0.8108 – 0 0.8108
+ *
= 0 + 0 + 0 0
Gain(Communication Skill) = 0.8108 – 0 0.8108
Gain Table
Attribute Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skill 0.8108
Here both the attributes Practical Knowledge and Communication Skills have the same Gain we can
either construct the decision tree using Practical Knowledge or Communication Skills
The final Decision Tree is as shown in below
C4.5 CONSTRUCTION
C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and missing
values, and it also supports post-pruning.
C5.0 is the successor of C4.5 and is more efficient and used for building smaller decision trees.
C4.5 works with missing values by marking as '?', but these missing attribute values are not
considered in the calculations.
The algorithm C4.5 is based on Occam's Razor which says that given two correct solution the
simpler solution has to be chosen.
Moreover, the algorithm requires a larger training set for better accuracy.
It uses Gain Ratio as a measure during the construction of decision trees.
ID3 is more biased towards attributes with larger values. To overcome this bias issue, C4.5 uses a
purity measure Gain ratio to identify the best split attribute.
In C4 5 algorithm the Information Gain measure used in ID3 algorithm is normalized by computing
factor called Split Info.
This normalized information gain of an attribute called as Gain Ratio is computed by the ratio of
the calculated Split_Info and Information Gain of each attribute.
Then the attribute with the highest normalized information gain, that is, highest gain ratio is used
the splitting criteria.
Given a Training dataset T,
The Split Info of an attribute A is computed as
| | | |
Split_Info(T, A) = ∑
| | | |
where, the attribute A has got 'v' distinct values {a1, a2,….,av}, and |Ai| is the number of instances
for distinct value 'f in attribute A.
The Gain Ratio of an attribute A is computed as
Gain Ratio(A)= Info_Gain(A) / Split_Info(T, A)
Problems
Step 1: Calculate the Entropy for the target class Job offer
Entropy_Info(Job Offer) = Entropy_Info(7,3) = = 0.8807
Iteration 1:
Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset
Entropy of CGPA
CGPA Job Offer = Yes Job Offer = No Total
≥9 3 1 4
≥8 4 0 4
<8 0 2 2
+ *
= 4/10 * 0.8 + 0 + 0 0.3243
Gain(CGPA) = 0.8807 – 0.3243 0.5564
Split_Info(T, CGPA) =
Entropy of Interactiveness
Split_Info(T, Interactiveness) =
1.5211
+ *
= 2/10 * 0 + 3/10 * 0.9177 + 5/10 * 0.7215 0.6361
Gain(Practical Knowledge) = 0.8807 – 0.6361 0.2446
1.4853
+ *
= 5/10 * 0.9177 + 3/10 * 0+ 2/10 * 0 0.36096
Gain(CGPA) = 0.8807 – 0.36096 0.5203
1.4853
Gain Table
Attribute Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Practical Knowledge 0.1648
Communication Skill 0.3502
Step 3: Since the gain_ratio of CGPA is maximum we choose this attribute as the best split and consider
as Root node. The resulting tree is as shown below
Iteration 2:
In this iteration the same process of computing the Entopry_Info and Gain_info, Split_Info and
Gain_Ratio are repeated with subset of the Training set.
The subset consists of 4 data instance as shown below
Entropy of Interactiveness
= 0 + 0.4997 0.4997
Gain(Interactiveness) = 0.8108 – 0.4997 0.3111
Split_Info(T, Interactiveness) =
1
Gain_Ratio(Interactiveness) = (Gain(Interactiveness))/(Split_Info(T, Interactiveness))
= 0.3111/1 0.3111
+ *
=0+0+00
Gain(Practical Knowledge) = 0.8108 – 0 0.8108
1.5
Gain_Ratio(Practical Knowledge) = (Gain(Practical Knowledge))/(Split_Info(T, Practical Knowledge))
= 0.8108/1.5 0.5408
+ *
= 0 + 0 + 0 0
Gain(Communication Skill) = 0.8108 – 0 0.8108
1.5
Gain Table
Attribute Gain_Ratio
Interactiveness 0.3111
Practical Knowledge 0.5408
Communication Skill 0.5408
Here both the attributes Practical Knowledge and Communication Skills have the same Gain_ratio we
can either construct the decision tree using Practical Knowledge or Communication Skills
Example:
Solution:
First sort the values in an ascending order by removing the duplicates and consider only the unique
values of the attribute.
6.8 7.9 8.2 8.5 8.8 8.8 9.1 9.5
Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0
No 1 2 2 1 2 1 2 1 2 1 3 0 3 0
Entropy(0,1) = 0
Entropy(7,2) = 0.763
Entropy(0,2) = 0
Entropy(7,1) = 0.543
Entropy(1,2) = 0.917
Entropy(6,1) = 0.591
Entropy(2,2) = 1
Entropy(5,1) = 0.649
Entropy(4,2) = 0.917
Entropy(3,1) = 0.810
Entropy(5,3) = 0.953
Entropy(2,0) = 0
Entropy(7,3) = 0.880
Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0
No 1 2 2 1 2 1 2 1 2 1 3 0 3 0
Entropy 0 0.763 0 0.543 0.917 0.591 1 0.649 0.917 0.810 0.953 0 0.880 0
Entropy
0.687 0.434 0.689 0.789 0.874 0.763 0.880
_Info
Since CGPA with 7.9 has the maximum gain as 0.4462. Hence , CGPA ∈ 7.9 is chosen as the split point.
We can discretise the continuous values of CGPA as 2 categories with CGPA ≤ 7.9 and CGPA > 7.9. The
resulting discretized instances are shown in below table.
4 6.8 ≤ 7.9 No
7 7.9 ≤ 7.9 No
where, Pi is the probability that a data instance or a tuple 'd' belongs to class C.
| . |
Pi = | |
GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as a
binary attribute which splits the data instances into two subsets S1, and S2
| | | |
Gini_Index (T, A) =
| | | |
The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an attribute.
The best splitting attribute is chosen by the minimum Gini_Index which is otherwise maximum
∆Gini because it reduces the impurity.
2. Compute Gini_Index for each of the attribute and for the subsets of each attribute in the training
dataset.
3. Choose the best splitting subset which has minimum Gini_Index for an attribute.
6. The best split attribute with the best split subset is placed as the root node.
7. The root node is branched into two subtrees with each subtree an outcome of the test condition of the
root node attribute. Accordingly, the training dataset is also split into two subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining attributes
until a leaf node is derived or no more training instances are available in the subset.
Problem 1:
Construct a decision tree using CART algorithm for the following dataset
Solution
Step 1: Calculate the Gini_Index for the dataset, which consists of 10 data instances. The target attribute
'Job Offer has 7 instances as Yes and 3 instances as No.
Gini_Index(T) = 0.42
Attribute- CGPA
Gini_Index(T, CGPA ∈ ((≥ 8, <8), ≥ 9)) = (6/10) x 0.445 + (4/10) × 0.375 0.417
Gini_Index of CGPA
Subset Gini_Index
(≥ 9, ≥ 8) <8 0.1755
(≥ 9, < 8) ≥8 0.3
(≥ 8, < 8) ≥9 0.417
Attribute- Interactiveness
Gini_Index(T, Interactiveness ∈ {Yes, No}) = (6/10) *(0.28) + (4/10) * 0.5 = 0.168 + 0.2 0.368
Gini_Index of CGPA
Subset Gini_Index
Yes No 0.368
Compute ∆Gini or the best splitting subset of that attribute
Very Good 2 0
Good 4 1
Average 1 2
Gini_Index(T, Practical Knowledge ∈ {Very Good, Good}) = 1 - (6/7)2 - (1/7)2 = 1-0.7544 0.2456
Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (7/10)2 * 0.245 + (3/10)2 *0.445
0.3054
Gini_Index(T, Practical Knowledge ∈ {Very Good, Average}) = 1- (3/5)2 – (2/5)2 = 1-0.52 0.48
Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (5/10) * 0.48 + (5/10) * 0.32 0.40
Gini_Index(T, Practical Knowledge ∈ (Very Good, Average)) = 1- (5/8)2 –(3/8)2 = 1 - 0.5312 0.4688
Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (8/10)*0.4688 + (2/10)×0 0.3750
Subset Gini_Index
{Very Good, Good} Average 0.3054
{Very Good, Average} Good 0.40
{Good, Average} Very Good 0.3750
Good 4 1
Moderate 3 0
Poor 0 2
Gini_Index(T, Communication Skills ∈ {Good, Moderate}, Poor) = (8/10) * 0.2194 + (2/10) * 0 0.1755
Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/7)2 - (3/7)2 = 1 – 0.5101 0.489
Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (5/10) * 0.48 + (5/10) * 0.32 0.40
Subset Gini_Index
{ Good, Moderate} Poor 0.1755
{Good, Poor} Moderate 0.3429
{Moderate, Poor} Good 0.40
Since both CGPA and Communication Skills has the same ∆Gini values which is highest among other
attributes we choose either CGPA/ Communication Skills as the best split attribute. Here we have
chosen CGPA
Attribute- Interactiveness
Gini_Index of CGPA
Subset Gini_Index
Yes No 0.056
Compute ∆Gini or the best splitting subset of that attribute
Very Good 2 0
Good 4 0
Average 1 1
Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (6/8)2 *0+(2/8)2*0.5 0.125
Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (4/8) * 0.375 + (4/8) * 0 0.187
Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (6/8)*0.278 + (2/8)×0 0.2085
Subset Gini_Index
{Very Good, Good} Average 0.125
{Very Good, Average} Good 0.1875
{Good, Average} Very Good 0.2085
Good 4 0
Moderate 3 0
Poor 0 1
Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/5)2 - (1/5)2 = 1 – 0.64 – 0.04 0.32
Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (4/8) * 0.375 + (4/8) * 0 0.1875
Subset Gini_Index
{ Good, Moderate} Poor 0
{Good, Poor} Moderate 0.2
{Moderate, Poor} Good 0.1875
= 0.2184 - 0 0.2184
Problem 2:
Apply CART algorithm on the below dataset and construct a decision Tree.
Solution:
Regression Trees
Regression trees are a variant of decision trees where the target feature is a continuous valued variable.
These trees can be constructed using an algorithm called reduction in variance which uses standard
deviation to choose the best splitting attribute.
1. Compute standard deviation for each attribute with respect to target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of an attribute.
3. Compute weighted standard deviation for each attribute.
4. Compute standard deviation reduction by subtracting weighted standard deviation for each attribute
from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split attribute
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test condition of the
root node attribute. Accordingly, the training dataset is also split into different subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining attributes
until a leaf node is derived or no more training instances are available in the subset
Overfitting is a general problem with decision trees. Once the decision tree is constructed it must be
validated for better accuracy and to avoid over-fitting and under-fitting.
There is always a tradeoff between accuracy and complexity of the tree. The tree must be simple
and accurate.
If the tree is more complex, it can classify the data instances accurately for the training set but when
test data is given, the tree constructed may perform poorly which means misclassifications are
higher and accuracy is reduced. This problem is called as over-fitting.
To avoid overfitting of the tree, one need to prune the trees and construct an optimal decision tree.
Trees can be pre-pruned or post-pruned.
If tree nodes are pruned during construction or the construction is stopped earlier without exploring
the nodes' branches, then it is called as pre-pruning
If tree nodes are pruned after the construction is over then it is called as post-pruning.
Basically, the dataset is split into three sets called training dataset, validation dataset and test
dataset. Generally, 40% of the dataset is used for training the decision tree and the remaining 60%
is used for validation and testing.
Once the decision tree is constructed, it is validated with the validation dataset and the
misclassifications are identified.
Using the number of instances correctly classified and number of instances wrongly classified,
Average Squared Error (ASE) is computed.
The tree nodes are pruned based on these computations and the resulting tree is validated until we
get a tree that performs better.
Cross validation is another way to construct an optimal decision tree.
Another approach is that after the tree is constructed using the training set, statistical tests like error
estimation and Chi-square test are used to estimate whether pruning or splitting is required for a
particular node to find a better accurate tree.
The third approach is using a principle called Minimum Description Length which uses a
complexity measure for encoding the training set and the growth of the decision tree is stopped
when the encoding size (i.e., (size(tree)) + size(misclassifications (tree)) is minimized.
Some of the tree pruning methods are listed below:
1. Reduced Error Pruning
2. Minimum Error Pruning (MEP)
3. Pessimistic Pruning
4. Error-based Pruning (EBP)
5. Optimal Pruning
BAYESIAN LEARNING
Bayesian Learning is a learning method that describes and represents knowledge in an uncertain domain
and provides a way to reason about this knowledge using probability measure. It uses Bayes theorem to
infer the unknown parameters of a model.
Bayesian inference is useful in many applications which involve reasoning and diagnosis such as game
theory, medicine, etc Bayesian inference is much more powerful in handling missing data and for
estimating any uncertainty in predictions.
Naive Bayes Model relies on Bayes theorem that works on the principle of three kinds of probabilities
called prior probability, likelihood probability, and posterior probability.
Prior Probability
It is the general probability of an uncertain event before an observation is seen or some evidence is
collected. It is the initial probability that is believed before any new information is collected.
Likelihood Probability
Likelihood probability is the relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis.
It is stated as P (Evidence | Hypothesis), which denotes the likeliness of the occurrence of the evidence
given the parameters.
Posterior Probability
It is the updated or revised probability of an event taking into account the observations from the training
data.
P (Hypothesis | Evidence) is the posterior distribution representing the belief about the hypothesis, given
the evidence from the training data.
Therefore, Posterior probability prior probability + new evidence
Naive Bayes Classification models work on the principle of Bayes theorem. Bayes' rule is a mathe-
matical formula used to determine the posterior probability, given prior probabilities of events.
Generally, Bayes theorem is used to select the most probable hypothesis from data, considering
both prior knowledge and posterior distributions.
It is based on the calculation of the posterior probability and is stated as:
P (Hypothesis h | Evidence E)
where, Hypothesis h is the target class to be classified and Evidence E is the given test instance.
P (Hypothesis h | Evidence E) =
where,
= max h∈H
Problems
1. If dangerous fires are rare (1%) but smoke is fairly common (10%) due to barbecues, and 90% of
dangerous fires make smoke. Then what is the Probability of dangerous Fire when there is Smoke?
Given:
P(fire) = 1 %
P(smoke) = 10 %
P(fire | smoke) = ?
|
P(fire | smoke) =
P(fire | smoke) = ( 90 *1 ) / 10 9%
2. If all rainy days start off cloudy! is 50%. But cloudy mornings are common about 40% of days start
cloudy. And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%). You are
planning a picnic today, but the morning is cloudy. What is the chance of rain during the day?
Given:
P(Rain) = 10%
P(Cloud | Rain) = 50%
P(Cloud) = 40%
|
P(Rain | Cloud) =
3. Consider a boy who has a volleyball tournament on the next day, but today he feels sick. It is unusual
that there is only a 40% chance he would fall sick since he is a healthy boy. Now, Find the
probability of the boy participating in the tournament. The boy is very much interested in volley ball,
so there is a 90% probability that he would participate in tournaments and 20% that he will fall sick
given that he participates in the tournament.
Given:
P (Boy participating in the tournament) = 90%
P (He is sick | Boy participating in the tournament) = 20%
P (He is Sick) = 40%
P (Boy participating in the tournament | He is sick) = ?
= (20 * 90) / 40 45 %
4. Consider a medical diagnosis problem in which there are two alternative hypotheses: that the
patient has particular form of cancer, and (2) that the patient does not. The available data is
from a particular laboratory test with two possible outcomes: + (positive) and - (negative).
We have prior knowledge that over the entire population of people only .008 have this disease.
Furthermore, the lab test is only an imperfect indicator of the disease. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present and a correct
negative result in only 97% of the cases in which the disease is not present. In other cases, the
test returns the opposite result.
Given:
Suppose a new patient is observed for whom the lab test returns a positive (+) result.Should we
diagnose the patient as having cancer or not?
Problem
Consider the below table and apply Naïve Bayes classifier to find the class of the new test instance
{Sunny, Cold, Normal, Weak}
Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature/attribute.
Attribute – Outlook
Outlook Play Tennis = Play Tennis = Outlook Play Tennis = Play Tennis =
Yes No Yes No
Sunny 2 3 Sunny 2/9 3/5
Overcast 4 0 Overcast 4/9 0/5
Rain 3 2 Rain 3/9 2/5
Total 9 5
Attribute – Temperature
Temperature Play Tennis Play Tennis Temperature Play Tennis Play Tennis
= Yes = No = Yes = No
Hot 2 2 Hot 2/9 2/5
Cold 3 1 Cold 3/9 1/5
Mild 4 2 Mild 4/9 2/5
Total 9 5
Attribute – Humidity
Wind Play Tennis Play Tennis Wind Play Tennis Play Tennis
= Yes = No = Yes = No
High 3 4 High 3/9 4/5
Normal 6 1 Normal 6/9 1/5
Total 9 5
Attribute – Wind
Wind Play Tennis Play Tennis Wind Play Tennis Play Tennis
= Yes = No = Yes = No
Strong 3 3 Strong 3/9 3/5
Weak 6 2 Weak 6/9 2/5
Total 9 5
Step 3: Applying Bayes Theorem for new test data to calculate the probability
P(PlayTennis =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P ( Sunny | Yes) * P (Cold| Yes) * P (Normal | Yes ) * P (Weak| Yes) * P (Yes)
= 2/9 * 3/9 * 6/9 * 6/9 * 9/14 0.0211
Step 4: Using hMAP we can classify the new test data as Yes. Since the Probability value of P(PlayTennis
=Yes | Test data) = 0.0211 which is maximum.
Problem 2:
Asses the student’s performance using Naïve Bayes algorithm with the dataset provided in Table.
Predict whether a student gets job offer or not in his final year of the course for a given test data {CGPA
≥ 9 , Interactiveness = Yes, Practical Knowledge = Average, Communication Skill = Good)
Solution:
Prior Probability of Target Concept:
Job Offer No. of Instance Probability Value
Yes 7 7/10
No 3 3/10
Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature/attribute.
Attribute – CGPA
Frequency matrix Likelihood Probability
CGPA Job offer = Yes Job offer = No CGPA Job offer = Yes Job offer = No
≥9 3 1 ≥9 3/7 1/3
≥8 4 0 ≥8 4/7 0/3
<8 0 2 <8 0/7 2/3
Total 7 3
Attribute – Interactiveness
Frequency matrix Likelihood Probability
Interactiveness Job offer = Job offer = Interac iveness Job offer = Job of er
Yes No Yes = No
Yes 5 1 Yes 5/7 1/3
No 2 2 No 2/7 2/3
Total 7 3
Practical Job offer = Job offer = Practical Job offer = Job offer =
Knowledge Yes No Knowledge Yes No
Very Good 2 0 Very Good 2/7 0/3
Average 1 2 Average 1/7 2/3
Good 4 1 Good 4/7 1/3
Total 7 3
Communication Job offer = Job offer = Communication Job offer = Job offer =
Skills Yes No Skills Yes No
Good 4 1 Good 4/7 1/3
Moderate 3 0 Moderate 3/7 0/3
Poor 0 2 Poor 0/7 2/3
Total 7 3
Step 3: Applying Bayes Theorem for new test data to calculate the probability
P(Job offer =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P ( ≥ 9 | Yes) * P (Yes | Yes) * P (Average | Yes ) * P (Good | Yes) * P (Yes)
= 3/7 * 5/7 * 1/7 * 4/7 * 7/10 0.0175
Step 4: Using hMAP we can classify the new test data as Yes. Since the Probability value of P(Job offer
=Yes | Test data) = 0.0175 which is maximum.
Consider the test data to be (CGPA ≥ 8, Interactivness = Yes , Practical Knowledge = Average,
Communication Skills = Good)
Since the probability value is Zero, the model fails to predict and this is called as Zero- Probability
Error.
This problem arises because there are no instances for the attribute value CGPA ≥ 8 and Job Offer = No
and hence the probability value of this case is zero.
This probability error can be solved by applying a smoothing technique called Laplace correction.
Given 1000 data instances in the training dataset, if there are zero instances for a particular value of a
feature we can add 1 instance for each attribute value pair of that feature which will not make much
difference for 1000 data instances and the overall probability does not become zero.
Now add 1 instance for each CGPA-value pair for Job offer = No. Then
Example:
Given the hypothesis space with 4 hypothesis h1, h2, h3 and h4. Determine if the patient is diagnosed
as COVID positive or COVID negative using Bayes Optimal classifier.
Solution: From the training dataset T, the posterior probabilities of the four different hypotheses for
a new instance are given
P(hi | T) P(Covid Positive | P(Covid Negative | hi)
hi)
0.3 0 1
0.1 1 0
0.2 1 0
0.1 1 0
Baves Optimal classifier combines the predictions of hy h and h, which is 0.4 and gives the result that
the patient is COVID positive.
∑ ∈ | | = 0.3 × 1 0.3
∑ ∈ | | = 0.1×1+0.2×1+0.1×1 0.4
Gibbs Algorithm
The main drawback of Bayes optimal classifier is that it computes the posterior probability for all
hypotheses in the hypothesis space and then combines the predictions to classify a new instance.
Gibbs algorithm is a sampling technique which randomly selects a hypothesis from the hypothesis
space according to the posterior probability distribution and classifies a new instance.
It is found that the prediction error occurs twice with the Gibbs algorithm when compared to Bayes
Optimal classifier.
P (Xi = xk | Cj) =
√
Where Xi is the ith continuous attribute in the given dataset
xk is a value of the attribute.
Cj denotes the jth class of the target feature.
µij denotes the mean of the values of that continuous attribute Xi with respect to the class j of
the target feature.
σij denotes the standard deviation of the values of that continuous attribute Xi w.r.t the class j of
the target feature.
Problem
Consider the following table, apply Naïve Bayes algorithm to find the target concept of the new test data
{CGPA =8.5, Interactiveness = Yes}
CGPA Interactiveness Job Offer
9.5 Yes Yes
8.2 No Yes
9.3 No No
7.6 No No
8.4 Yes Yes
9.1 Yes Yes
7.5 Yes No
9.6 No Yes
8.6 Yes Yes
8.3 Yes Yes
Solution
Step 1: Computing the prior probability for the target concept = Job offer
Step 2: Computing Frequency matrix and Likelihood Probability for each of the feature/attribute.
Attribute – CGPA: Since the values of the CGPA is continuous we apply Guassian distribution
formula to calculate the likelihood probability.
First we compute
1. mean and standard deviation for CGPA w.r.t target class Job offer = Yes
. . . . . . . . . . . . . .
=√ 0.5383
2. mean and standard deviation for CGPA w.r.t target class Job offer = No
. . . . . .
σij = σCGPA – NO = √ 0.825
Attribute – Interactiveness
Since P(Job offer =Yes | Test data) has the highest probability value i.e 0.4375 the test data will be
classified as ‘Job Offer’ = Yes.
In this algorithm, the features used for making predictions are Boolean variables that take only two
values either 'yes' or 'no'.
This is particularly useful for text classification where all features are binary with each feature
containing two values whether the word occurs or not.