0% found this document useful (0 votes)
57 views46 pages

AIML Module-04

The document discusses decision tree learning models, which classify data with high accuracy. Decision trees summarize training data in a tree structure and can then classify new data. The document describes the structure of decision trees and the process of building and using them to classify data. Key aspects like entropy, information gain, and algorithms for building decision trees are explained.

Uploaded by

AISHWARYA HOYSAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views46 pages

AIML Module-04

The document discusses decision tree learning models, which classify data with high accuracy. Decision trees summarize training data in a tree structure and can then classify new data. The document describes the structure of decision trees and the process of building and using them to classify data. Key aspects like entropy, information gain, and algorithms for building decision trees are explained.

Uploaded by

AISHWARYA HOYSAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Artificial Intelligence and Machine Learning (21CS54)

MODULE-04

DECISION TREE LEARNING


 Decision tree learning model, one of the most popular supervised predictive learning models,
classifies data instances with high accuracy and consistency. The model performs an inductive
inference that reaches a general conclusion from observed examples.
 Decision tree is a concept tree which summarizes the information contained in the training dataset in
the form of a tree structure.
 Once the concept model is built test data can be easily classified.
 This model can be used to classify both categorical target variables and continuous- valued target
variables.
 Given a training dataset X, this model computes a hypothesis function f(X) as decision tree.
 Inputs to the model are data instances or objects with a set of features or attributes which can be
discrete or continuous and the output of the model is a decision tree which predicts or the target class
for the test data object.
 In statistical terms, attributes or features are called as independent variables. The target feature or
target class is called as response variable which indicates the category we need to predict a test
object. are called as
 The decision tree learning model generates a complete hypothesis space in the form of tree structure
with the given training dataset and allows to search through the possible set hypotheses This kind of
search bias is called as preference bias.

Structure of a Decision Tree

 A decision tree has a structure that consists of a root node, internal nodes/decision nodes, branches
and terminal nodes/leaf nodes.
 The topmost node in the tree is the root node. Internal nodes are the test nodes and are also called as
decision nodes. These nodes represent a choice or test of an input attribute and the outcome or
outputs of the test condition are the branches emanating from this decision node.
 The branches are labelled as per the outcomes or output values of the test condition
 Each branch represents a sub-tree or subsection of the entire tree.
 Every decision node is part of path to a leaf node. The leaf nodes represent the labels or the outcome
of a decision path.
 The decision tree model, in general, represents a collection of logical rules of classification in the
form of a tree structure.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 1


Artificial Intelligence and Machine Learning (21CS54)

 Decision networks, otherwise called as influence diagrams, have a directed graph structure with
nodes and links. It is an extension of Bayesian belief networks that represents information about each
node's current state, its possible actions, the possible outcome of those actions, and their utility.
 Figure shows symbols that are used to represent different nodes in the construction of a decision tree.

 A circle is used to represent a root node, a diamond symbol is used to represent a decision node or the
internal nodes, and all leaf nodes are represented with a rectangle.
 A decision tree consists of two major procedures discussed below.
1. Building the Tree

Goal: Construct decision tree with the given training dataset. The tree is constructed in topdown
fashion. It starts from the root node. At every level of tree construction, one need to find the best
split attributes or best decision node among all attributes. This process is recursive and continued
until we end up in the last level of the tree or finding a leaf node which cannot be split further.

Output: Decision tree representing the complete hypothesis space.

2. Knowledge Inference or Classification

Goal: Given a test instance, infer to the target class it belongs to.

Classification: Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every decision
node with the test object attribute value and walk to the branch corresponding to the test's
outcome. This process is repeated until we end up in a leaf node which contains the target class
of the test object.

Output: Target label of the test instance.

Advantages and Disadvantages of Decision Trees


Advantages

1. Easy to model and interpret

2. Simple to understand

3. The input and output attributes can be discrete or continuous predictor variables.

4. Can model a high degree of nonlinearity in the relationship between the target variables and the
predictor variables

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 2


Artificial Intelligence and Machine Learning (21CS54)

5. Quick to train

Disadvantages

1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing it.

2. If training data has errors or missing attribute values, then the decision tree constructed may
become unstable or biased.

3. If the training data has continuous valued attributes, handling it is computationally complex and
has to be discretized.

4. A complex decision tree may also be over-fitting with the training data.

5. Decision tree learning is not well suited for classifying multiple output classes.

6. Learning an optimal decision tree is also known to be NP-complete.

Fundamentals of Entropy
 Given the training dataset with a set of attributes or features, the decision tree is constructed by
finding the attribute or feature that best describes the target class for the given test instances.
 The best split feature is the one which contains more information about how to split the dataset
among all features so that the target class is accurately identified for the test instances.
 In other words, the best split attribute is more informative to split the dataset into sub datasets and
this process is continued until the stopping criterion is reached. This splitting should be pure at every
stage of selecting the best feature.
 The best feature is selected based on the amount of information among the features which are
basically calculated on probabilities.
 Quantifying information is closely related to information theory. In the field of information theory,
the features are quantified by a measure called Shannon Entropy which is calculated based on the
probability distribution of the events.
 Entropy is the amount of uncertainty or randomness in the outcome of a random variable or an
event. Moreover, entropy describes about the homogeneity of the data instances.
 The best feature is selected based on the entropy value.
 Higher the entropy → Higher the uncertainty
 Lower the entropy Lower the uncertainty.
 Similarly, if all instances are belong to the same class (only positive) or (only negative) then the
entropy is 0.
 On the other hand, if the instances are equally distributed, which means 50% positive and 50%
negative, then the entropy is 1.
 If there are 10 data instances, out of which some belong positive class and some belong to negative
class, then the entropy is calculated as

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 3


Artificial Intelligence and Machine Learning (21CS54)

 If the entropy is 0, then the split is pure which means that all samples in the set will partition into one
class or category. But if the entropy is 1, the split is impure and the distribution of the samples is
more random.
 The stopping criterion is based on the entropy value
 Let P be the probability distribution of data instances from 1 to n as shown.
 So, P = P1,....... Pn
 Entropy of P is the information measure of this probability distribution given in
Entropy_Info(P) = Entropy_Info(P1,.... Pn)
= - (P1 log2 P1 + P2 log2 P₂ +…….. + Pn log2 Pn)
where, P, is the probability of data instances classified as class 1 and P₂ is the probability of data
instances classified as class 2 and so on.

 P1 = |No of data instances belonging to class 1| / |Total no of data instances in the training
dataset|

Algorithm: General Algorithm for Decision Trees

1. Find the best attribute from the training dataset using an attribute selection measure and place it
at the root of the tree.
2. Split the training dataset into subsets based on the outcomes of the test attribute and each subset
in a branch contains the data instance or tuples with the same value for the selected test attribute.
3. Repeat step 1 and step 2 on each until we end up in leaf nodes in all the branches of the tree.
4. This splitting process is recursive until the stopping criterion is reached.

Stopping Criteria

The following are some of the common stopping conditions:


1. The data instances are homogenous which means all belong to the same class Ci and hence its
entropy is 0.
2. A node with some defined minimum number of data instance becomes a leaf.
3. The maximum tree depth is reached, so further splitting is not done and the node becomes a leaf
node.

DECISION TREE INDUCTION ALGORITHMS


 There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE,
CRUISE, and CTREE, that are used for classification in real-time environment.
 The most commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed by
J.R Quinlan in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 4


Artificial Intelligence and Machine Learning (21CS54)

CART, that stands for Classification and Regression Trees, is another algorithm which was
developed by Breiman et al. in 1984.
 The accuracy of the tree constructed depends upon the selection of the best split attribute.
 Different algorithms are used for building decision trees which use different measures to decide on
the splitting criterion.
 Algorithms such as ID3, C4.5 and CART are popular algorithms used in the construction of decision
trees.
 The algorithm ID3 uses 'Information Gain' as the splitting criterion whereas the algorithm C4.5
uses 'Gain Ratio' as the splitting criterion. The CART algorithm is popularly used for classifying
both categorical and continuous-valued target variables. CART uses GINI Index to construct a
decision tree.
 Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which
consider only one feature/attribute to split at each decision node whereas decision trees constructed
using CART algorithm are multivariate decision trees which consider a conjunction of univariate
splits.

ID3 Tree Construction


 ID3 is a supervised learning algorithm which uses a training dataset with labels and constructs a
decision tree.
 ID3 is an example of univariate decision trees as it considers only one feature at Bach decision
node.
 This leads to axis-aligned splits.
 The tree is then used to classify the future test instances. It constructs the tree using a greedy
approach in a top-down fashion by identifying the best attribute at each level of the tree.
 ID3 works well if the attributes or features are considered as discrete/categorical value.
 The algorithm builds the tree using a purity measure called 'Information Gain' with the give training
data instances and then uses the constructed tree to classify the test data.
 ID3 works well for a large dataset. If the dataset is small, over-fitting may occur Moreover, it is not
accurate if the dataset has missing attribute values.
 No pruning is done during or after construction of the tree and it is prone to outliers.

Algorithm: Procedure to Construct a Decision Tree using ID3


1. Compute Entropy_Info for the whole training dataset based on the target attribute.
Entropy_Info(T) = - ∑
2. Compute Entropy_Info and Information_Gain for each of the attribute in the training dataset.
| |
Entropy_Info(T, A) = ∑
| |

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 5


Artificial Intelligence and Machine Learning (21CS54)

Information_Gain(A) = Entropy_Info(T) - Entropy_Info(T, A)

3. Choose the attribute for which entropy is minimum and therefore the gain is maximum as the best
split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test condition of
the root node attribute. Accordingly, the training dataset is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining
attributes until a leaf node is derived or no more training instances are available in the subset.

Note: We stop branching a node if entropy is 0. The best split attribute at every iteration is the
attribute with the highest information gain.

Information_Gain is a metric that measures how much information is gained by branching on an


attribute A. In other words, it measures the reduction in impurity in an arbitrary subset of data.

Problems

Sl. No CGPA Interactiveness Practical Knowledge Communication Skill Job Offer


1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Step 1: Calculate the Entropy for the target class Job offer
Entropy_Info(Job Offer) = Entropy_Info(7,3) = = 0.8807

Iteration 1:

Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset

Entropy of CGPA

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 6


Artificial Intelligence and Machine Learning (21CS54)

CGPA Job Offer = Yes Job Offer = No Total


≥9 3 1 4
≥8 4 0 4
<8 0 2 2

Entropy_Info(Job Offer, CGPA) = * + *

+ *
= 4/10 * 0.8 + 0 + 0  0.3243
Gain(CGPA) = 0.8807 – 0.3243  0.5564

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total


Yes 5 1 6
No 2 2 4

Entropy_Info(Job Offer, Interactiveness) = * + *

= 6/10 * 0.6497+ 4/10 * 1  0.7896


Gain(Interactiveness) = 0.8807 – 0.7896  0.0911

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 1 2 3
Good 4 1 5

Entropy_Info(Job Offer, Practical Knowledge) = * + *

+ *
= 2/10 * 0 + 3/10 * 0.9177 + 5/10 * 0.7215  0.6361
Gain(Practical Knowledge) = 0.8807 – 0.6361  0.2446

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 4 1 5

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 7


Artificial Intelligence and Machine Learning (21CS54)

Moderate 3 0 3
Poor 0 2 2

Entropy_Info(Job Offer, Communication Skill) = * + *

+ *
= 5/10 * 0.9177 + 3/10 * 0+ 2/10 * 0  0.36096
Gain(Communication Skill) = 0.8807 – 0.36096  0.5203

Gain Table

Attribute Gain
CGPA 0.5564
Interactiveness 0.0911
Practical Knowledge 0.2446
Communication Skill 0.5203

Step 3: Since the gain of CGPA is maximum we choose this attribute as the best split and consider as
Root node. The resulting tree is as shown below

Iteration 2:

In this iteration the same process of computing the Entopry_Info and Gain are repeated with subset of
the Training set.
The subset consists of 4 data instance as shown below

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 8


Artificial Intelligence and Machine Learning (21CS54)

Interactiveness Practical Knowledge Communication Skill Job Offer


Yes Very Good Good Yes
No Average Poor No
Yes Good Moderate Yes
No Very Good Good Yes

Entropy_Info(Job Offer) = Entropy_Info(3,1) = = 0.8108

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total


Yes 2 0 2
No 1 1 2

Entropy_Info(Job Offer, Interactiveness) = * + *

= 0 + 0.4997  0.4997
Gain(Interactiveness) = 0.8108 – 0.4997  0.3111

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 0 1 1
Good 1 0 1

Entropy_Info(Job Offer, Practical Knowledge) = * + *

+ *
=0+0+00
Gain(Practical Knowledge) = 0.8108 – 0  0.8108

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 2 0 2
Moderate 1 0 1
Poor 0 1 1

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 9


Artificial Intelligence and Machine Learning (21CS54)

Entropy_Info(Job Offer, Communication Skill) = * + *

+ *
= 0 + 0 + 0 0
Gain(Communication Skill) = 0.8108 – 0  0.8108

Gain Table

Attribute Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skill 0.8108

Here both the attributes Practical Knowledge and Communication Skills have the same Gain we can
either construct the decision tree using Practical Knowledge or Communication Skills
The final Decision Tree is as shown in below

C4.5 CONSTRUCTION
 C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and missing
values, and it also supports post-pruning.
 C5.0 is the successor of C4.5 and is more efficient and used for building smaller decision trees.
 C4.5 works with missing values by marking as '?', but these missing attribute values are not
considered in the calculations.
 The algorithm C4.5 is based on Occam's Razor which says that given two correct solution the
simpler solution has to be chosen.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 10


Artificial Intelligence and Machine Learning (21CS54)

 Moreover, the algorithm requires a larger training set for better accuracy.
 It uses Gain Ratio as a measure during the construction of decision trees.
 ID3 is more biased towards attributes with larger values. To overcome this bias issue, C4.5 uses a
purity measure Gain ratio to identify the best split attribute.
 In C4 5 algorithm the Information Gain measure used in ID3 algorithm is normalized by computing
factor called Split Info.
 This normalized information gain of an attribute called as Gain Ratio is computed by the ratio of
the calculated Split_Info and Information Gain of each attribute.
 Then the attribute with the highest normalized information gain, that is, highest gain ratio is used
the splitting criteria.
 Given a Training dataset T,
The Split Info of an attribute A is computed as
| | | |
Split_Info(T, A) = ∑
| | | |

where, the attribute A has got 'v' distinct values {a1, a2,….,av}, and |Ai| is the number of instances
for distinct value 'f in attribute A.
The Gain Ratio of an attribute A is computed as
Gain Ratio(A)= Info_Gain(A) / Split_Info(T, A)

Algorithm: Procedure to Construct a Decision Tree using C4.5


1. Compute Entropy_Info for the whole training dataset based on the target attribute
2. Compute Entropy_Info, Info_Gain, Split_Info and Gain_Ratio for each of the attribute in the training
dataset.
3. Choose the attribute for which Gain_Ratio is maximum as the best split attribute.
4. The best split attribute is placed as the root node.
5. The root node is branched into subtrees with each subtree as an outcome of the test condition of the
root node attribute. Accordingly, the training dataset is also split into subsets.
6. Recursively apply the same operation for the subset of the training set with the remaining attributes
until a leaf node is derived or no more training instances are available in the subset.

Problems

Sl. No CGPA Interactiveness Practical Knowledge Communication Skill Job Offer


1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 11


Artificial Intelligence and Machine Learning (21CS54)

5 ≥8 Yes Good Moderate Yes


6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Step 1: Calculate the Entropy for the target class Job offer
Entropy_Info(Job Offer) = Entropy_Info(7,3) = = 0.8807

Iteration 1:
Step 2: Calculate the Entropy_Info and Infrmation_Gain for each of the attribute in the training dataset

Entropy of CGPA
CGPA Job Offer = Yes Job Offer = No Total
≥9 3 1 4
≥8 4 0 4
<8 0 2 2

Entropy_Info(Job Offer, CGPA) = * + *

+ *
= 4/10 * 0.8 + 0 + 0  0.3243
Gain(CGPA) = 0.8807 – 0.3243  0.5564

Split_Info(T, CGPA) =

= 0.5285 +0.5285 + 0.4641  1.5211

Gain_Ratio(CGPA) = (Gain(CGPA))/(Split_Info(T, CGPA))


= 0.5564/1.5211 0.3658

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total


Yes 5 1 6
No 2 2 4

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 12


Artificial Intelligence and Machine Learning (21CS54)

Entropy_Info(Job Offer, Interactiveness) = * + *

= 6/10 * 0.6497+ 4/10 * 1  0.7896


Gain(Interactiveness) = 0.8807 – 0.7896  0.0911

Split_Info(T, Interactiveness) =

 1.5211

Gain_Ratio(Interactiveness) = (Gain(Interactiveness))/(Split_Info(T, Interactiveness))


= 0.0911/0.9704 0.0939

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 1 2 3
Good 4 1 5

Entropy_Info(Job Offer, Practical Knowledge) = * + *

+ *
= 2/10 * 0 + 3/10 * 0.9177 + 5/10 * 0.7215  0.6361
Gain(Practical Knowledge) = 0.8807 – 0.6361  0.2446

Split_Info(T, Practical Knowledge) =

 1.4853

Gain_Ratio(Practical Knowledge ) = (Gain(Practical Knowledge))/(Split_Info(T, Practical Knowledge))


= 0.2446/1.4853 0.1648

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 4 1 5
Moderate 3 0 3
Poor 0 2 2

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 13


Artificial Intelligence and Machine Learning (21CS54)

Entropy_Info(Job Offer, Communication Skill) = * + *

+ *
= 5/10 * 0.9177 + 3/10 * 0+ 2/10 * 0  0.36096
Gain(CGPA) = 0.8807 – 0.36096  0.5203

Split_Info(T, Communication Skill) =

 1.4853

Gain_Ratio(Communication Skill) = (Gain(Communication Skill))/(Split_Info(T, Communication


Skill))
= 0.5202/1.4853 0.3502

Gain Table
Attribute Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Practical Knowledge 0.1648
Communication Skill 0.3502

Step 3: Since the gain_ratio of CGPA is maximum we choose this attribute as the best split and consider
as Root node. The resulting tree is as shown below

Iteration 2:

In this iteration the same process of computing the Entopry_Info and Gain_info, Split_Info and
Gain_Ratio are repeated with subset of the Training set.
The subset consists of 4 data instance as shown below

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 14


Artificial Intelligence and Machine Learning (21CS54)

Interactiveness Practical Knowledge Communication Skill Job Offer


Yes Very Good Good Yes
No Average Poor No
Yes Good Moderate Yes
No Very Good Good Yes

Entropy_Info(Job Offer) = Entropy_Info(3,1) = = 0.8108

Entropy of Interactiveness

Interactivness Job Offer = Yes Job Offer = No Total


Yes 2 0 2
No 1 1 2

Entropy_Info(Job Offer, Interactiveness) = * + *

= 0 + 0.4997  0.4997
Gain(Interactiveness) = 0.8108 – 0.4997  0.3111

Split_Info(T, Interactiveness) =

1
Gain_Ratio(Interactiveness) = (Gain(Interactiveness))/(Split_Info(T, Interactiveness))
= 0.3111/1 0.3111

Entropy of Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No Total


Very Good 2 0 2
Average 0 1 1
Good 1 0 1

Entropy_Info(Job Offer, Practical Knowledge) = * + *

+ *
=0+0+00
Gain(Practical Knowledge) = 0.8108 – 0  0.8108

Split_Info(T, Practical Knowledge) =

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 15


Artificial Intelligence and Machine Learning (21CS54)

 1.5
Gain_Ratio(Practical Knowledge) = (Gain(Practical Knowledge))/(Split_Info(T, Practical Knowledge))
= 0.8108/1.5 0.5408

Entropy of Communication Skill

Communication Skill Job Offer = Yes Job Offer = No Total


Good 2 0 2
Moderate 1 0 1
Poor 0 1 1

Entropy_Info(Job Offer, Communication Skill) = * + *

+ *
= 0 + 0 + 0 0
Gain(Communication Skill) = 0.8108 – 0  0.8108

Split_Info(T, Communication Skill) =

 1.5

Gain_Ratio(Communication Skill) = (Gain(Communication Skill))/(Split_Info(T, Communication


Skill))
= 0.8108/1.5 0.5408

Gain Table
Attribute Gain_Ratio
Interactiveness 0.3111
Practical Knowledge 0.5408
Communication Skill 0.5408

Here both the attributes Practical Knowledge and Communication Skills have the same Gain_ratio we
can either construct the decision tree using Practical Knowledge or Communication Skills

The final Decision Tree is as shown in below

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 16


Artificial Intelligence and Machine Learning (21CS54)

Dealing with continuous Attributes in c4.5


 The C4.5 algorithm is further improved by considering attributes which are continuous and a
continuous attribute is discretized by finding a split point or threshold.
 When an attribute ‘A’ has numerical values which are continuous a threshold or best split point ‘s’ is
found such that the set of values is categorized into two sets such as A < s and A ≥ s.
 The best split point is attribute value which has maximum information gain for that attribute.

Example:

Sl. No CGPA Job Offer


1 9.5 Yes
2 8.2 Yes
3 9.1 No
4 6.8 No
5 8.5 Yes
6 9.5 Yes
7 7.9 No
8 9.1 Yes
9 8.8 Yes
10 8.8 Yes

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 17


Artificial Intelligence and Machine Learning (21CS54)

Solution:
First sort the values in an ascending order by removing the duplicates and consider only the unique
values of the attribute.
6.8 7.9 8.2 8.5 8.8 8.8 9.1 9.5

Compute the gain for distinct values of this continuous attributes

6.8 7.9 8.2 8.5 8.8 9.1 9.5

Range ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >

Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0

No 1 2 2 1 2 1 2 1 2 1 3 0 3 0

Entropy(0,1) = 0

Entropy(7,2) =  0.763

Entropy_Info(T, CGPA ∈ 6.8 = 1/10 *Entropy (0,1) + 9/10 *Entropy (7,2)

= 1/10 *0 + 9/10 *0.763  0.6873

Gain(CGPA ∈ 6.8) = 0.8808 – 0.6873  0.1935

Entropy(0,2) = 0

Entropy(7,1) =  0.543

Entropy_Info(T, CGPA ∈ 7.9) = 2/10 *Entropy (0,2) + 8/10 *Entropy (7,1)

= 2/10 *0 + 8/10 *0.543  0.434

Gain(CGPA ∈ 7.9) = 0.8808 – 0.4346  0.4462

Entropy(1,2) =  0.917

Entropy(6,1) =  0.591

Entropy_Info(T, CGPA ∈ 8.2) = 3/10 *Entropy (1,2) + 7/10 *Entropy (6,1)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 18


Artificial Intelligence and Machine Learning (21CS54)

= 3/10 *0.9177 + 7/10 *0.5913  0.6892

Gain(CGPA ∈ 8.2) = 0.8808 – 0.6892  0.1916

Entropy(2,2) = 1

Entropy(5,1) =  0.649

Entropy_Info(T, CGPA ∈ 8.5) = 4/10 *Entropy (2,2) + 6/10 *Entropy (5,1)

= 4/10 *1 + 6/10 *0.649  0.7898

Gain(CGPA ∈ 8.5) = 0.8808 – 0.7898  0.091

Entropy(4,2) =  0.917

Entropy(3,1) =  0.810

Entropy_Info(T, CGPA ∈ 8.8) = 6/10 *Entropy (4,2) + 4/10 *Entropy (3,1)

= 6/10 *0.9177 + 4/10 *0.8108  0.8749

Gain(CGPA ∈ 8.8) = 0.8808 – 0.8749 0.0059

Entropy(5,3) =  0.953

Entropy(2,0) = 0

Entropy_Info(T, CGPA ∈ 9.1) = 8/10 *Entropy (5,3) + 2/10 *Entropy (2,0)

= 8/10 *0.953 + 2/10 *0  0.7630

Gain(CGPA ∈ 9.1) = 0.8808 – 0.7630 0.1178

Entropy(7,3) =  0.880

Entropy_Info(T, CGPA ∈ 9.1) = 10/10 *Entropy (7,3) + 0/10 *Entropy (0,0)

= 10/10 *0.880 + 0/10 *0  0.880

Gain(CGPA ∈ 9.1) = 0.8808 – 0.880 0

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 19


Artificial Intelligence and Machine Learning (21CS54)

6.8 7.9 8.2 8.5 8.8 9.1 9.5

Range ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > ≤ >

Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0

No 1 2 2 1 2 1 2 1 2 1 3 0 3 0

Entropy 0 0.763 0 0.543 0.917 0.591 1 0.649 0.917 0.810 0.953 0 0.880 0

Entropy
0.687 0.434 0.689 0.789 0.874 0.763 0.880
_Info

Gain 0.1935 0.4462 0.1916 0.091 0.0059 0.1178 0

Since CGPA with 7.9 has the maximum gain as 0.4462. Hence , CGPA ∈ 7.9 is chosen as the split point.

We can discretise the continuous values of CGPA as 2 categories with CGPA ≤ 7.9 and CGPA > 7.9. The
resulting discretized instances are shown in below table.

Sl. No CGPA CGPA Job Offer


(Continuous) (Discretized)

1 9.5 > 7.9 Yes

2 8.2 > 7.9 Yes

3 9.1 > 7.9 No

4 6.8 ≤ 7.9 No

5 8.5 > 7.9 Yes

6 9.5 > 7.9 Yes

7 7.9 ≤ 7.9 No

8 9.1 > 7.9 Yes

9 8.8 > 7.9 Yes

10 8.8 > 7.9 Yes

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 20


Artificial Intelligence and Machine Learning (21CS54)

Classification and Regression Trees Construction


 The classification and Regression Trees (CART) algorithm is a multivariate decision tree learning
used for classifying categorical and continuous- valued target variables.
 CART algorithm is an example of multivariate decision tree that gives oblique splits.
 It solves both classification and regression problems. If the target feature is categorical, it constructs a
classification tree and if the target feature is continuous, it constructs a regression tree.
 CART uses GINI Index to construct a decision tree.
 GINI Index is defined as the number of data instances for a class or it is the proportion of instances.
 It constructs the tree as a binary tree by recursively splitting a node into two nodes. Therefore, even if
an attribute has more than two possible values, GINI Index is calculated for all subsets of the
attributes and the subset which has maximum value is selected as the best split subset.
 For example, if an attribute A has three distinct values say {a1, a2, a3} the possible subsets are { },
{a1}, {a2}, {a3}, {a1, a2}, {a1, a3}, {a2, a3}, {a1, a2, a3}. So, if an attribute has 3 distinct values, the
number of possible subsets is 23, which means 8.
 Excluding the empty set { } and the full set {a1, a2, a3}, we have 6 subsets. With 6 subsets, we can
form three possible combinations such as:
o {a1} with {a2, a3}
o {a2} with {a1, a3}
o {a3} with {a1, a2}
 Hence, in this CART algorithm, we need to compute the best splitting attribute and the best split
subset i in the chosen attribute.
 Higher the GINI value, higher is the homogeneity of the data instances.
 Gini_Index(T) is computed as
Gini_Index(T) = 1 - ∑

where, Pi is the probability that a data instance or a tuple 'd' belongs to class C.

| . |
Pi = | |

 GINI Index assumes a binary split on each attribute, therefore, every attribute is considered as a
binary attribute which splits the data instances into two subsets S1, and S2

| | | |
Gini_Index (T, A) =
| | | |

 The splitting subset with minimum Gini_Index is chosen as the best splitting subset for an attribute.
The best splitting attribute is chosen by the minimum Gini_Index which is otherwise maximum
∆Gini because it reduces the impurity.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 21


Artificial Intelligence and Machine Learning (21CS54)

∆Gini is computed as ∆Gini (A) = Gini(T) - Gini(T, A)

Algorithm: Procedure to Construct a Decision Tree using CART


1. Compute Gini_Index for the whole training dataset based on the target attribute.

2. Compute Gini_Index for each of the attribute and for the subsets of each attribute in the training
dataset.

3. Choose the best splitting subset which has minimum Gini_Index for an attribute.

4. Compute ∆Gini for the best splitting subset of that attribute.

5. Choose the best splitting attribute that has maximum ∆Gini.

6. The best split attribute with the best split subset is placed as the root node.

7. The root node is branched into two subtrees with each subtree an outcome of the test condition of the
root node attribute. Accordingly, the training dataset is also split into two subsets.

8. Recursively apply the same operation for the subset of the training set with the remaining attributes
until a leaf node is derived or no more training instances are available in the subset.

Problem 1:
Construct a decision tree using CART algorithm for the following dataset

Sl. No CGPA Interactiveness Practical Knowledge Communication Skill Job Offer


1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Solution
Step 1: Calculate the Gini_Index for the dataset, which consists of 10 data instances. The target attribute
'Job Offer has 7 instances as Yes and 3 instances as No.

Gini_Index(T) = 1 - (7/10)2 - (3/10)2

= 1- 0.49 - 0.09 =1-0.58

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 22


Artificial Intelligence and Machine Learning (21CS54)

Gini_Index(T) = 0.42

Attribute- CGPA

CGPA Job Offer = Yes Job Offer = No


≥9 3 1
≥8 4 0
<8 0 2

Gini Index(T, CGPA ∈ (≥ 9, ≥ 8)) =1 - (7/8)2 - (1/8)2 = 1-0.7806  0.2194

Gini_Index(T, CGPA ∈ <8) = 1- (0/2)²- (2/2)2 = 1 – 1  0

Gini_Index(T, CGPA ∈ (≥ 9, ≥ 8), <8) = (8/10) × 0.2194 + (2/10)×0  0.17552

Gini_Index(T, CGPA ∈ (≥ 9, <8)) = 1 - (3/6)2 - (3/6)2 = 1 - 0.5  0.5

Gini_Index(T, CGPA ∈ (≥ 8)) = 1 - (4/4)2 - (0/4)2 = 1- 1  0

Gini_Index(T, CGPA ∈ (≥ 9, <8), ≥ 8) = (6/10) × 0.5+ (4/10) × 0  0.3

Gini Index(T, CGPA∈ (≥ 8, <8)) = 1 - (4/6)2 - (2/6)2  1- 0.555  0.445

Gini_Index(T, CGPA ∈ (≥ 9) ) = 1 - (3/4)2- (1/4)2 = 1-0.625  0.375

Gini_Index(T, CGPA ∈ ((≥ 8, <8), ≥ 9)) = (6/10) x 0.445 + (4/10) × 0.375 0.417

Gini_Index of CGPA

Subset Gini_Index
(≥ 9, ≥ 8) <8 0.1755
(≥ 9, < 8) ≥8 0.3
(≥ 8, < 8) ≥9 0.417

∆Gini(CGPA) = Gini(T) - Gini(T, CGPA)


= 0.42 - 0.1755  0.2445

Attribute- Interactiveness

Interactiveness Job Offer = Yes Job Offer = No


Yes 5 1
No 2 2

Gini_Index(T, Interactiveness ∈ (Yes)) = 1– (5/6)2 – (1/6)2 = 1 - 0.72  0.28

Gini_Index(T, Interactiveness ∈ {No}) = 1– (2/4)2 – (2/4)2 = 1– 0.5  0.5

Gini_Index(T, Interactiveness ∈ {Yes, No}) = (6/10) *(0.28) + (4/10) * 0.5 = 0.168 + 0.2  0.368

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 23


Artificial Intelligence and Machine Learning (21CS54)

Gini_Index of CGPA

Subset Gini_Index
Yes No 0.368
Compute ∆Gini or the best splitting subset of that attribute

∆Gini(Interactiveness) = Gini(T) - Gini(T, Interactiveness)

= 0.42 – 0.368  0.052

Attribute – Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No

Very Good 2 0
Good 4 1
Average 1 2

Gini_Index(T, Practical Knowledge ∈ {Very Good, Good}) = 1 - (6/7)2 - (1/7)2 = 1-0.7544  0.2456

Gini_Index(T, Practical Knowledge ∈ (Average)) = 1 – (1/3)2 – (2/3)2 = 1 - 0.555  0.445

Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (7/10)2 * 0.245 + (3/10)2 *0.445

 0.3054

Gini_Index(T, Practical Knowledge ∈ {Very Good, Average}) = 1- (3/5)2 – (2/5)2 = 1-0.52 0.48

Gini_Index(T, Practical Knowledge ∈ (Good)) = 1 – (4/5)2 – (1/5)2 =1-0.68 0.32

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (5/10) * 0.48 + (5/10) * 0.32  0.40

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average)) = 1- (5/8)2 –(3/8)2 = 1 - 0.5312  0.4688

Gini_Index(T, Practical Knowledge ∈ (Very Good)) = 1 – (2/2)2 –(0/2)2 = 1- 1  0

Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (8/10)*0.4688 + (2/10)×0  0.3750

Gini_Index for Practical Knowledge

Subset Gini_Index
{Very Good, Good} Average 0.3054
{Very Good, Average} Good 0.40
{Good, Average} Very Good 0.3750

∆Gini (Practical Knowledge) = Gini(T) - Gini(T, Practical Knowledge)

= 0.42 - 0.3054  0.1146

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 24


Artificial Intelligence and Machine Learning (21CS54)

Attribute – Communication Skill

Communication Skill Job Offer = Yes Job Offer = No

Good 4 1
Moderate 3 0
Poor 0 2

Gini_Index(T, Communication Skills ∈ {Good, Moderate}) = 1 – (7/8)2 - (1/8)2 = 1 -0.7806  0.2194

Gini_Index(T, Communication Skills ∈ {Poor}) = 1 - (2/2)2 - (2/2)2 = 1 – 1  0

Gini_Index(T, Communication Skills ∈ {Good, Moderate}, Poor) = (8/10) * 0.2194 + (2/10) * 0  0.1755

Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/7)2 - (3/7)2 = 1 – 0.5101  0.489

Gini_Index(T, Communication Skills ∈ {Moderate}) = 1 – (3/3)2 – (0/3)2 = 1-1  0

Gini_Index(T,Communication Skills ∈ {Good, Poor}, Moderate) = (7/10) * 0.489 + (3/10) * 0  0.342

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}) = 1 - (3/5)2 - (2/5)2 = 1 – 0.52 0.48

Gini_Index(T, Communication Skills ∈ {Good}) = 1 - (4/5)2 - (1/5)2 =1 – 0.68  0.32

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (5/10) * 0.48 + (5/10) * 0.32 0.40

Gini_Index for Commuincation Skill

Subset Gini_Index
{ Good, Moderate} Poor 0.1755
{Good, Poor} Moderate 0.3429
{Moderate, Poor} Good 0.40

∆Gini (Communication Skill) = Gini(T) - Gini(T, Communication Skill)

= 0.42 - 0.1755  0.24445

Gini_Index and ∆Gini values of all the attributes

Attribute Gini_index ∆Gini


CGPA 0.1755 0.2445
Interactiveness 0.368 0.052
Practical Knowledge 0.3054 0.1146
Communication Skills 0.1755 0.2445

Since both CGPA and Communication Skills has the same ∆Gini values which is highest among other
attributes we choose either CGPA/ Communication Skills as the best split attribute. Here we have
chosen CGPA

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 25


Artificial Intelligence and Machine Learning (21CS54)

Table: Subset of the Training Dataset after Iteration 1

Sl. No CGPA Interactiveness Practical Knowledge Communication Skill Job Offer


1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Gini_Index(T) = 1 - (7/8)2 - (1/8)2 = 1- 0.766 - 0.0156  0.2184

Attribute- Interactiveness

Interactiveness Job Offer = Yes Job Offer = No


Yes 5 0
No 2 1

Gini_Index(T, Interactiveness ∈ (Yes)) = 1– (5/5)2 – (0/5)2 = 1 - 1  0

Gini_Index(T, Interactiveness ∈ {No}) = 1– (2/3)2 – (1/3)2 = 1– 0.44 – 0.111  0.449

Gini_Index(T, Interactiveness ∈ {Yes, No}) = (7/8) *(0) + (1/8) * 0.449  0.056

Gini_Index of CGPA

Subset Gini_Index
Yes No 0.056
Compute ∆Gini or the best splitting subset of that attribute

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 26


Artificial Intelligence and Machine Learning (21CS54)

∆Gini(Interactiveness) = Gini(T) - Gini(T, Interactiveness)

= 0.2184 – 0.056  0.1624

Attribute – Practical Knowledge

Practical Knowledge Job Offer = Yes Job Offer = No

Very Good 2 0
Good 4 0
Average 1 1

Gini_Index(T, Practical Knowledge ∈ {Very Good, Good}) = 1 - (6/6)2 - (0/6)2 = 1- 1  0

Gini_Index(T, Practical Knowledge ∈ (Average)) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25  0.5

Gini Index(T, Practical Knowledge ∈ {Very Good, Good}, Average) = (6/8)2 *0+(2/8)2*0.5  0.125

Gini_Index(T, Practical Knowledge ∈ {Very Good, Average}) = 1- (3/4)2 – (1/4)2 0.375

Gini_Index(T, Practical Knowledge ∈ (Good)) = 1 – (4/4)2 – (0/4)2 =1- 1 0

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average), Good) = (4/8) * 0.375 + (4/8) * 0  0.187

Gini_Index(T, Practical Knowledge ∈ (Very Good, Average)) = 1- (5/6)2 –(1/6)2  0.278

Gini_Index(T, Practical Knowledge ∈ (Very Good)) = 1 – (2/2)2 –(0/2)2 = 1- 1  0

Gini_Index(T, Practical Knowledge ∈ (Good, Average), Very Good) = (6/8)*0.278 + (2/8)×0  0.2085

Gini_Index for Practical Knowledge

Subset Gini_Index
{Very Good, Good} Average 0.125
{Very Good, Average} Good 0.1875
{Good, Average} Very Good 0.2085

∆Gini (Practical Knowledge) = Gini(T) - Gini(T, Practical Knowledge)

= 0.2184 – 0.125  0.0934

Attribute – Communication Skill

Communication Skill Job Offer = Yes Job Offer = No

Good 4 0
Moderate 3 0
Poor 0 1

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 27


Artificial Intelligence and Machine Learning (21CS54)

Gini_Index(T, Communication Skills ∈ {Good, Moderate}) = 1 – (7/7)2 - (0/7)2 = 1 - 1 0

Gini_Index(T, Communication Skills ∈ {Poor}) = 1 - (0/1)2 - (1/1)2 = 1 – 1  0

Gini_Index(T, Communication Skills ∈ {Good, Moderate}, Poor) = (7/8) * 0 + (1/8) * 0  0

Gini Index(T, Communication Skills ∈ {Good, Poor}) = 1 - (4/5)2 - (1/5)2 = 1 – 0.64 – 0.04 0.32

Gini_Index(T, Communication Skills ∈ {Moderate}) = 1 – (3/3)2 – (0/3)2 = 1-1  0

Gini_Index(T,Communication Skills ∈ {Good, Poor}, Moderate) = (5/8) * 0.32 + (3/8) * 0  0.2

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}) = 1 - (3/4)2 - (1/4)2 0.375

Gini_Index(T, Communication Skills ∈ {Good}) = 1 - (4/4)2 - (0/4)2 =1 – 1 0

Gini_Index(T, Communication Skills ∈ {Moderate, Poor}, Good) = (4/8) * 0.375 + (4/8) * 0 0.1875

Gini_Index for Commuincation Skill

Subset Gini_Index
{ Good, Moderate} Poor 0
{Good, Poor} Moderate 0.2
{Moderate, Poor} Good 0.1875

∆Gini (Communication Skill) = Gini(T) - Gini(T, Communication Skill)

= 0.2184 - 0  0.2184

Gini_Index and ∆Gini values of all the attributes

Attribute Gini_index ∆Gini


Interactiveness 0.056 0.1624
Practical Knowledge 0.125 0.0934
Communication Skills 0 0.2184

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 28


Artificial Intelligence and Machine Learning (21CS54)

Problem 2:

Apply CART algorithm on the below dataset and construct a decision Tree.

Assessment Assignment Project Seminar Result


Good Yes Yes Good Pass
Average Yes No Poor Fail
Good No Yes Good Pass
Poor No No Poor Fail
Good Yes Yes Good Pass
Average No Yes Good Pass
Good No No Fair Pass
Poor Yes Yes Good Fail
Average No No Poor Fail
Good Yes Yes Fair Pass

Solution:

Regression Trees
Regression trees are a variant of decision trees where the target feature is a continuous valued variable.
These trees can be constructed using an algorithm called reduction in variance which uses standard
deviation to choose the best splitting attribute.

Algorithm: Procedure for Constructing Regression Trees

1. Compute standard deviation for each attribute with respect to target attribute.
2. Compute standard deviation for the number of data instances of each distinct value of an attribute.
3. Compute weighted standard deviation for each attribute.
4. Compute standard deviation reduction by subtracting weighted standard deviation for each attribute
from standard deviation of each attribute.
5. Choose the attribute with a higher standard deviation reduction as the best split attribute
6. The best split attribute is placed as the root node.
7. The root node is branched into subtrees with each subtree as an outcome of the test condition of the
root node attribute. Accordingly, the training dataset is also split into different subsets.
8. Recursively apply the same operation for the subset of the training set with the remaining attributes
until a leaf node is derived or no more training instances are available in the subset

Problem refer class notes

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 29


Artificial Intelligence and Machine Learning (21CS54)

VALIDATING AND PRUNING OF DECISION TREES

 Overfitting is a general problem with decision trees. Once the decision tree is constructed it must be
validated for better accuracy and to avoid over-fitting and under-fitting.
 There is always a tradeoff between accuracy and complexity of the tree. The tree must be simple
and accurate.
 If the tree is more complex, it can classify the data instances accurately for the training set but when
test data is given, the tree constructed may perform poorly which means misclassifications are
higher and accuracy is reduced. This problem is called as over-fitting.
 To avoid overfitting of the tree, one need to prune the trees and construct an optimal decision tree.
Trees can be pre-pruned or post-pruned.
 If tree nodes are pruned during construction or the construction is stopped earlier without exploring
the nodes' branches, then it is called as pre-pruning
 If tree nodes are pruned after the construction is over then it is called as post-pruning.
 Basically, the dataset is split into three sets called training dataset, validation dataset and test
dataset. Generally, 40% of the dataset is used for training the decision tree and the remaining 60%
is used for validation and testing.
 Once the decision tree is constructed, it is validated with the validation dataset and the
misclassifications are identified.
 Using the number of instances correctly classified and number of instances wrongly classified,
Average Squared Error (ASE) is computed.
 The tree nodes are pruned based on these computations and the resulting tree is validated until we
get a tree that performs better.
 Cross validation is another way to construct an optimal decision tree.
 Another approach is that after the tree is constructed using the training set, statistical tests like error
estimation and Chi-square test are used to estimate whether pruning or splitting is required for a
particular node to find a better accurate tree.
 The third approach is using a principle called Minimum Description Length which uses a
complexity measure for encoding the training set and the growth of the decision tree is stopped
when the encoding size (i.e., (size(tree)) + size(misclassifications (tree)) is minimized.
 Some of the tree pruning methods are listed below:
1. Reduced Error Pruning
2. Minimum Error Pruning (MEP)
3. Pessimistic Pruning
4. Error-based Pruning (EBP)
5. Optimal Pruning

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 30


Artificial Intelligence and Machine Learning (21CS54)

6. Minimum Description Length (MDL) Pruning


7. Minimum Message Length Pruning
8. Critical Value Pruning

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 31


Artificial Intelligence and Machine Learning (21CS54)

BAYESIAN LEARNING
Bayesian Learning is a learning method that describes and represents knowledge in an uncertain domain
and provides a way to reason about this knowledge using probability measure. It uses Bayes theorem to
infer the unknown parameters of a model.
Bayesian inference is useful in many applications which involve reasoning and diagnosis such as game
theory, medicine, etc Bayesian inference is much more powerful in handling missing data and for
estimating any uncertainty in predictions.

INTRODUCTION TO PROBABILITY-BASED LEARNING


 Probability-based learning is one of the most important practical learning methods which combines
prior knowledge or prior probabilities with observed data.
 Probabilistic learning uses the concept of probability theory that describes how to model randomness,
uncertainty, and noise to predict future events.
 It is a tool for modelling large datasets and uses Bayes rule to infer unknown quantities, predict and
learn from data.
 In a probabilistic model, randomness plays a major role which gives probability distribution a
solution, while in a deterministic model there is no randomness and hence it exhibits the same initial
conditions every time.
 Bayesian learning differs from probabilistic learning as it uses subjective probabilities to infer
parameters of a model.
 Two practical learning algorithms called Naïve Baves learning and Bayesian Belief Network (BBN)
form the major part of Bayesian Learning.
 These algorithms use prior probabilities and apply Baves rule to infer useful information.

FUNDAMENTALS OF BAYES THEOREM

Naive Bayes Model relies on Bayes theorem that works on the principle of three kinds of probabilities
called prior probability, likelihood probability, and posterior probability.

Prior Probability
It is the general probability of an uncertain event before an observation is seen or some evidence is
collected. It is the initial probability that is believed before any new information is collected.

Likelihood Probability
Likelihood probability is the relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 32


Artificial Intelligence and Machine Learning (21CS54)

It is stated as P (Evidence | Hypothesis), which denotes the likeliness of the occurrence of the evidence
given the parameters.

Posterior Probability
It is the updated or revised probability of an event taking into account the observations from the training
data.
P (Hypothesis | Evidence) is the posterior distribution representing the belief about the hypothesis, given
the evidence from the training data.
Therefore, Posterior probability prior probability + new evidence

CLASSIFICATION USING BAYES MODEL

 Naive Bayes Classification models work on the principle of Bayes theorem. Bayes' rule is a mathe-
matical formula used to determine the posterior probability, given prior probabilities of events.
 Generally, Bayes theorem is used to select the most probable hypothesis from data, considering
both prior knowledge and posterior distributions.
 It is based on the calculation of the posterior probability and is stated as:

P (Hypothesis h | Evidence E)

where, Hypothesis h is the target class to be classified and Evidence E is the given test instance.

P (Hypothesis h | Evidence E) =

where,

 P (Hypothesis h) is the prior probability of the hypothesis.


 P (Evidence E) is the prior probability of the evidence E from the training dataset without any
knowledge of which hypothesis holds. It is also called the marginal probability.
 P (Evidence E | Hypothesis h) is the prior probability of Evidence E given Hypothesis h. It is
the likelihood probability of the Evidence E after observing the training data that the
hypothesis h is correct.
 P (Hypothesis h | Evidence E) is the posterior probability of Hypothesis h given Evidence E.
It is the probability of the hypothesis h after observing the training data that the evidence E is
correct.
 Bayes theorem helps in calculating the posterior probability for a number of hypotheses, from
which the hypothesis with the highest probability can be selected.
 This selection of the most probable hypothesis from a set of hypotheses is formally defined as
Maximum A Posteriori (MAP) Hypothesis.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 33


Artificial Intelligence and Machine Learning (21CS54)

Maximum A Posteriori (MAP) Hypothesis, hMAP


 Given a set of candidate hypotheses, the hypothesis which has the maximum value is considered as
the maximum probable hypothesis or most probable hypothesis.
 This most probable hypothesis is called the Maximum A Posteriori Hypothesis hMAP.
 Bayes theorem can be used to find the hMAP
hMAP = max h∈H P(Hypothesis h Evidence E)

= max h∈H

= max P(Evidence E Hypothesis h) * P(Hypothesis h)

Maximum Likelihood (ML) Hypothesis, hML


 Given a set of candidate hypotheses, if every hypothesis is equally probable, only P (E| h) is used to
find the most probable hypothesis.
 The hypothesis that gives the maximum likelihood for P (E | h) is called the Maximum Likelihood
(ML) Hypothesis, hML

hML = max h∈H P(Evidence E Hypothesis h)

Problems

1. If dangerous fires are rare (1%) but smoke is fairly common (10%) due to barbecues, and 90% of
dangerous fires make smoke. Then what is the Probability of dangerous Fire when there is Smoke?

Given:

P(fire) = 1 %

P(smoke) = 10 %

P(smoke| fire) = 90%

P(fire | smoke) = ?

|
P(fire | smoke) =

P(fire | smoke) = ( 90 *1 ) / 10  9%

2. If all rainy days start off cloudy! is 50%. But cloudy mornings are common about 40% of days start
cloudy. And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%). You are
planning a picnic today, but the morning is cloudy. What is the chance of rain during the day?
Given:
P(Rain) = 10%
P(Cloud | Rain) = 50%
P(Cloud) = 40%

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 34


Artificial Intelligence and Machine Learning (21CS54)

|
P(Rain | Cloud) =

P(Rain | Cloud) = (50 * 10) / 40  12.5 %

3. Consider a boy who has a volleyball tournament on the next day, but today he feels sick. It is unusual
that there is only a 40% chance he would fall sick since he is a healthy boy. Now, Find the
probability of the boy participating in the tournament. The boy is very much interested in volley ball,
so there is a 90% probability that he would participate in tournaments and 20% that he will fall sick
given that he participates in the tournament.

Given:
P (Boy participating in the tournament) = 90%
P (He is sick | Boy participating in the tournament) = 20%
P (He is Sick) = 40%
P (Boy participating in the tournament | He is sick) = ?

P(Boy participating in the tournament | He is sick ) =


|
=

= (20 * 90) / 40  45 %

4. Consider a medical diagnosis problem in which there are two alternative hypotheses: that the
patient has particular form of cancer, and (2) that the patient does not. The available data is
from a particular laboratory test with two possible outcomes: + (positive) and - (negative).
We have prior knowledge that over the entire population of people only .008 have this disease.
Furthermore, the lab test is only an imperfect indicator of the disease. The test returns a correct
positive result in only 98% of the cases in which the disease is actually present and a correct
negative result in only 97% of the cases in which the disease is not present. In other cases, the
test returns the opposite result.
Given:

Suppose a new patient is observed for whom the lab test returns a positive (+) result.Should we
diagnose the patient as having cancer or not?

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 35


Artificial Intelligence and Machine Learning (21CS54)

NAIVE BAYES ALGORITHM


 It is a supervised binary class or multi class classification algorithm that works on the principle of
Bayes theorem.
 There is a family of Naïve Bayes classifiers based on a common principle.
 These algorithms classify for datasets whose features are independent and each feature is assumed to
be given equal weightage.
 It particularly works for a large dataset and is very fast. It is one of the most effective and simple
classification algorithms.
 This algorithm considers all features to be independent of each other even though they are
individually dependent on the classified object.
 Each of the features contributes a probability value independently during classification and hence this
algorithm is called as Naive algorithm.
 Some important applications of these algorithms are text classification, recommendation system and
face recognition.

Algorithm: Naïve Bayes


1. Compute the prior probability for the target class.
2. Compute Frequency matrix and likelihood Probability for each of the feature.
3. Use Bayes theorem to calculate the probability of all hypotheses.
4. Use Maximum a Posteriori (MAP) Hypothesis, hMAP to classify the test object to the hypothesis
with the highest probability.

Problem
Consider the below table and apply Naïve Bayes classifier to find the class of the new test instance
{Sunny, Cold, Normal, Weak}

Outlook Temperature Humidity Wind Play Tennis

Sunny Hot High Weak No

Sunny Hot High Strong No

Overcast Hot High Weak Yes

Rain Mild High Weak Yes

Rain Cool Normal Weak Yes

Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes

Sunny Mild High Weak No

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 36


Artificial Intelligence and Machine Learning (21CS54)

Outlook Temperature Humidity Wind Play Tennis

Sunny Cool Normal Weak Yes

Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes

Overcast Mild High Strong Yes

Overcast Hot Normal Weak Yes

Rain Mild High Strong No


Solution:
Prior Probability of Target Concept:
Play Tennis No. of Instance Probability Value
Yes 9 9/14
No 5 5/14

Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature/attribute.
Attribute – Outlook

Frequency matrix Likelihood Probability

Outlook Play Tennis = Play Tennis = Outlook Play Tennis = Play Tennis =
Yes No Yes No
Sunny 2 3 Sunny 2/9 3/5
Overcast 4 0 Overcast 4/9 0/5
Rain 3 2 Rain 3/9 2/5
Total 9 5

Attribute – Temperature

Frequency matrix Likelihood Probability

Temperature Play Tennis Play Tennis Temperature Play Tennis Play Tennis
= Yes = No = Yes = No
Hot 2 2 Hot 2/9 2/5
Cold 3 1 Cold 3/9 1/5
Mild 4 2 Mild 4/9 2/5
Total 9 5

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 37


Artificial Intelligence and Machine Learning (21CS54)

Attribute – Humidity

Frequency matrix Likelihood Probability

Wind Play Tennis Play Tennis Wind Play Tennis Play Tennis
= Yes = No = Yes = No
High 3 4 High 3/9 4/5
Normal 6 1 Normal 6/9 1/5
Total 9 5

Attribute – Wind

Frequency matrix Likelihood Probability

Wind Play Tennis Play Tennis Wind Play Tennis Play Tennis
= Yes = No = Yes = No
Strong 3 3 Strong 3/9 3/5
Weak 6 2 Weak 6/9 2/5
Total 9 5

Step 3: Applying Bayes Theorem for new test data to calculate the probability
P(PlayTennis =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P ( Sunny | Yes) * P (Cold| Yes) * P (Normal | Yes ) * P (Weak| Yes) * P (Yes)
= 2/9 * 3/9 * 6/9 * 6/9 * 9/14  0.0211

P(PlayTennis =No | Test data) = P(Test data | No) * P(No)


= P (Sunny | No) * P (Cold | No) * P (Normal | No ) * P (Weak | No) * P (No)
= 3/5 * 1/5 * 1/5 * 2/5 * 5/14  0.0034

Step 4: Using hMAP we can classify the new test data as Yes. Since the Probability value of P(PlayTennis
=Yes | Test data) = 0.0211 which is maximum.

Problem 2:
Asses the student’s performance using Naïve Bayes algorithm with the dataset provided in Table.
Predict whether a student gets job offer or not in his final year of the course for a given test data {CGPA
≥ 9 , Interactiveness = Yes, Practical Knowledge = Average, Communication Skill = Good)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 38


Artificial Intelligence and Machine Learning (21CS54)

Sl. No CGPA Interactiveness Practical Knowledge Communication Skill Job Offer


1 ≥9 Yes Very Good Good Yes
2 ≥8 No Good Moderate Yes
3 ≥9 No Average Poor No
4 <8 No Average Good No
5 ≥8 Yes Good Moderate Yes
6 ≥9 Yes Good Moderate Yes
7 <8 Yes Good Poor No
8 ≥9 No Very Good Good Yes
9 ≥8 Yes Good Good Yes
10 ≥8 Yes Average Good Yes

Solution:
Prior Probability of Target Concept:
Job Offer No. of Instance Probability Value
Yes 7 7/10
No 3 3/10

Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature/attribute.
Attribute – CGPA
Frequency matrix Likelihood Probability

CGPA Job offer = Yes Job offer = No CGPA Job offer = Yes Job offer = No
≥9 3 1 ≥9 3/7 1/3
≥8 4 0 ≥8 4/7 0/3
<8 0 2 <8 0/7 2/3
Total 7 3

Attribute – Interactiveness
Frequency matrix Likelihood Probability

Interactiveness Job offer = Job offer = Interac iveness Job offer = Job of er
Yes No Yes = No
Yes 5 1 Yes 5/7 1/3
No 2 2 No 2/7 2/3
Total 7 3

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 39


Artificial Intelligence and Machine Learning (21CS54)

Attribute – Practical Knowledge

Frequency matrix Likelihood Probability

Practical Job offer = Job offer = Practical Job offer = Job offer =
Knowledge Yes No Knowledge Yes No
Very Good 2 0 Very Good 2/7 0/3
Average 1 2 Average 1/7 2/3
Good 4 1 Good 4/7 1/3
Total 7 3

Attribute – Communication Skills

Frequency matrix Likelihood Probability

Communication Job offer = Job offer = Communication Job offer = Job offer =
Skills Yes No Skills Yes No
Good 4 1 Good 4/7 1/3
Moderate 3 0 Moderate 3/7 0/3
Poor 0 2 Poor 0/7 2/3
Total 7 3

Step 3: Applying Bayes Theorem for new test data to calculate the probability
P(Job offer =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P ( ≥ 9 | Yes) * P (Yes | Yes) * P (Average | Yes ) * P (Good | Yes) * P (Yes)
= 3/7 * 5/7 * 1/7 * 4/7 * 7/10  0.0175

P(Job offer =No | Test data) = P(Test data | No) * P(No)


= P ( ≥ 9 | No) * P (Yes | No) * P (Average | No ) * P (Good | No) * P (No)
= 1/3 * 1/3 * 2/3 * 1/3 * 3/10  0.0074

Step 4: Using hMAP we can classify the new test data as Yes. Since the Probability value of P(Job offer
=Yes | Test data) = 0.0175 which is maximum.

Zero Probability Error

Consider the test data to be (CGPA ≥ 8, Interactivness = Yes , Practical Knowledge = Average,
Communication Skills = Good)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 40


Artificial Intelligence and Machine Learning (21CS54)

When computing the posterior probability


P(Job offer =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P ( ≥ 8 | Yes) * P (Yes | Yes) * P (Average | Yes ) * P (Good | Yes) * P (Yes)
= 4/7 * 5/7 * 1/7 * 4/7 * 7/10  0.0233

P(Job offer =No | Test data) = P(Test data | No) * P(No)


= P ( ≥ 8 | No) * P (Yes | No) * P (Average | No ) * P (Good | No) * P (No)
= 0/3 * 1/3 * 2/3 * 1/3 * 3/10  0

Since the probability value is Zero, the model fails to predict and this is called as Zero- Probability
Error.
This problem arises because there are no instances for the attribute value CGPA ≥ 8 and Job Offer = No
and hence the probability value of this case is zero.
This probability error can be solved by applying a smoothing technique called Laplace correction.

Given 1000 data instances in the training dataset, if there are zero instances for a particular value of a
feature we can add 1 instance for each attribute value pair of that feature which will not make much
difference for 1000 data instances and the overall probability does not become zero.

Scaled Value Table of Likelihood Probability (CGPA) is

CGPA Job offer = Yes Job offer = No


≥9 300/700 100/300
≥8 400/700 0/300
<8 0/700 200/300

Now add 1 instance for each CGPA-value pair for Job offer = No. Then

CGPA Job offer = Yes Job offer = No


≥9 300/700 101/303
≥8 400/700 1/303
<8 0/700 201/303

With Scaled Values we get


P(Job offer =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P ( ≥ 8 | Yes) * P (Yes | Yes) * P (Average | Yes ) * P (Good | Yes) * P (Yes)
= 400/700 * 500/700 * 100/700 * 400/700 * 700/1003  0.02325

P(Job offer =No | Test data) = P(Test data | No) * P(No)


= P ( ≥ 8 | No) * P (Yes | No) * P (Average | No ) * P (Good | No) * P (No)

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 41


Artificial Intelligence and Machine Learning (21CS54)

= 1/303 * 100/300 * 200/300 * 100/300 * 303/1003  0.0000738


Thus, using Laplace Correction, Zero Probability error can be solved with Naïve Bayes classifier.

Brute Force Bayes Algorithm


 Applying Bayes theorem, Brute Force Bayes algorithm relies on the idea of concept learning wherein
given a hypothesis space H for the training dataset T, the algorithm computes the posterior
probabilities for all the hypothesis hi ∈ H.
 Then, Maximum A Posteriori (MAP) Hypothesis, hMAP is used to output the hypothesis with
maximum posterior probability.
 The algorithm is quite expensive since it requires computations for all the hypotheses.
 Although computing posterior probabilities is inefficient, this idea is applied in various other
algorithms which is also quite interesting.

Bayes Optimal Classifier


 Bayes optimal classifier is a probabilistic model, which in fact, uses the Bayes theorem to find the
most probable classification for a new instance given the training data by combining the predications
of all posterior hypotheses.
 This is different from Maximum A Posteriori (MAP) Hypothesis, hMAP which chooses the maximum
probable hypothesis or the most probable hypothesis.
 Here, a new instance can be classified to a possible classification value Ci,
= max Ci ∑ ∈ | |

Example:
Given the hypothesis space with 4 hypothesis h1, h2, h3 and h4. Determine if the patient is diagnosed
as COVID positive or COVID negative using Bayes Optimal classifier.
Solution: From the training dataset T, the posterior probabilities of the four different hypotheses for
a new instance are given
P(hi | T) P(Covid Positive | P(Covid Negative | hi)
hi)
0.3 0 1
0.1 1 0
0.2 1 0
0.1 1 0

Baves Optimal classifier combines the predictions of hy h and h, which is 0.4 and gives the result that
the patient is COVID positive.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 42


Artificial Intelligence and Machine Learning (21CS54)

∑ ∈ | | = 0.3 × 1  0.3

∑ ∈ | | = 0.1×1+0.2×1+0.1×1  0.4

Since the maximum value is for ∑ ∈ | | the new instance is


diagnosed as COVID Positive.

Gibbs Algorithm
 The main drawback of Bayes optimal classifier is that it computes the posterior probability for all
hypotheses in the hypothesis space and then combines the predictions to classify a new instance.
 Gibbs algorithm is a sampling technique which randomly selects a hypothesis from the hypothesis
space according to the posterior probability distribution and classifies a new instance.
 It is found that the prediction error occurs twice with the Gibbs algorithm when compared to Bayes
Optimal classifier.

NAIVE BAYES ALGORITHM FOR CONTINUOUS ATTRIBUTES


There are two ways to predict with Naive Bayes algorithm for continuous attributes:
1. Discretize continuous feature to discrete feature.
2. Apply Normal or Gaussian distribution for continuous feature.

Gaussian Naive Bayes Algorithm


In Gaussian Naive Bayes, the values of continuous features are assumed to be sampled from a Gaussian
distribution.
Gaussian distribution for continuous feature is calculated using
P (Xi = xk | Cj) = g (xk, µij, σij )

P (Xi = xk | Cj) =

Where Xi is the ith continuous attribute in the given dataset
xk is a value of the attribute.
Cj denotes the jth class of the target feature.
µij denotes the mean of the values of that continuous attribute Xi with respect to the class j of
the target feature.
σij denotes the standard deviation of the values of that continuous attribute Xi w.r.t the class j of
the target feature.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 43


Artificial Intelligence and Machine Learning (21CS54)

Problem
Consider the following table, apply Naïve Bayes algorithm to find the target concept of the new test data
{CGPA =8.5, Interactiveness = Yes}
CGPA Interactiveness Job Offer
9.5 Yes Yes
8.2 No Yes
9.3 No No
7.6 No No
8.4 Yes Yes
9.1 Yes Yes
7.5 Yes No
9.6 No Yes
8.6 Yes Yes
8.3 Yes Yes

Solution

Step 1: Computing the prior probability for the target concept = Job offer

Job Offer No. of Instance Probability Value


Yes 7 7/10
No 3 3/10

Step 2: Computing Frequency matrix and Likelihood Probability for each of the feature/attribute.
Attribute – CGPA: Since the values of the CGPA is continuous we apply Guassian distribution
formula to calculate the likelihood probability.
First we compute
1. mean and standard deviation for CGPA w.r.t target class Job offer = Yes

µij = µCGPA – YES = (9.5+8.2+8.4+9.1+9.6+8.6+8.3) / 7  8.814

σij = σCGPA – YES =

. . . . . . . . . . . . . .
=√  0.5383

2. mean and standard deviation for CGPA w.r.t target class Job offer = No

µij = µCGPA – NO = (9.3+7.6+7.5) / 3  8.133

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 44


Artificial Intelligence and Machine Learning (21CS54)

. . . . . .
σij = σCGPA – NO = √  0.825

Likelihood probability for test data is


. .
P(CGPA =8.5 | Yes) = P (XCGPA = 8.5 | CJO = YES) = .  0.875
. √
. .
P(CGPA =8.5 | No) = P (XCGPA = 8.5 | CJO = NO) = .  0.538
. √

Attribute – Interactiveness

Frequency matrix Likelihood Probability

Interactiveness Job offer = Job offer = Interactiveness Job offer = Job of er


Yes No Yes = No
Yes 5 1 Yes 5/7 1/3
No 2 2 No 2/7 2/3
Total 7 3

Consider the test data to be (CGPA = 8.5 , Interactivness = Yes )

When computing the posterior probability


P(Job offer =Yes | Test data) = P(Test data | Yes) * P(Yes)
= P (8.5 | Yes) * P (Yes | Yes) * P (Yes)
= 0.875 * 5/7 * 7/10  0.4375

P(Job offer =No | Test data) = P(Test data | No) * P(No)


= P (8.5 | No) * P (Yes | No) * P (No)
= 0.538 * 1/3 *3/10  0.0538

Since P(Job offer =Yes | Test data) has the highest probability value i.e 0.4375 the test data will be
classified as ‘Job Offer’ = Yes.

OTHER POPULAR TYPES OF NAIVE BAYES CLASSIFIERS


Some of the popular variants of Bayesian classifier are listed below:

Bernoulli Naive Bayes Classifier


 Bernoulli Naive Bayes works with discrete features.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 45


Artificial Intelligence and Machine Learning (21CS54)

 In this algorithm, the features used for making predictions are Boolean variables that take only two
values either 'yes' or 'no'.
 This is particularly useful for text classification where all features are binary with each feature
containing two values whether the word occurs or not.

Multinomial Naive Bayes Classifier


 This algorithm is a generalization of the Bernoulli Naive Bayes model that works for categorical data
or particularly integer features.
 This classifier is useful for text classification where each feature will have an integer value that
represents the frequency of occurrence of words.

Multi-class Naïve Bayes Classifier


 This algorithm is useful for classification problems with more than two classes where the target
feature contains multiple classes and test instance has to be predicted with the class it belongs to.

Prepared by Ms. Shilpa, Assistant Professor, Dept. of CSE, NCEH 46

You might also like