Department of Computer
Engineering
Subject: Machine Learning & Statistics
Unit VI: Classification Models
Department of Computer
Engineering
Departmental Vision & Mission
Vision
Achieve academic excellence through education in computing, to
create intellectual manpower to explore professional, higher
educational and social opportunities.
Mission
To impart learning by educating students with conceptual knowledge
and hands on practices using modern tools, FOSS technologies and
competency skills there by igniting the young minds for innovative
thinking, professional expertise and research.
UNIT VI
Classification Models
Department of Computer
Engineering
Contents
● Decision tree representation
● Constructing Decision Trees
● Classification and Regression Trees
● Hypothesis space search in decision tree learning
● Bayes' Theorem
● Working of Naïve Bayes' Classifier
● Types of Naïve Bayes Model
● Advantages, Disadvantages and Application of Naïve Bayes Model
Decision Tree Representation
● Decision trees classify instances by sorting them down the tree from
the root to some leaf node, which provides the classification of the
instance.
● Each node in the tree specifies a test of some attribute of the instance,
and each branch descending from that node corresponds to one of the
possible values for this attribute.
● An instance is classified by starting at the root node of the tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example.
● This process is then repeated for the subtree rooted at the new node.
Decision Tree Representation
Decision Tree Representation
Constructing Decision Trees
● Pros of Decision Trees:
Computationally cheap to use, easy for humans to understand learned
results, missing values OK, can deal with irrelevant features
● Cons of Decision Tree:
Prone to overfitting
● Works with:
Numeric values, nominal values
Constructing Decision Trees
● To build a decision tree, you need to make a first decision on the dataset to
dictate which feature is used to split the data. To determine this, you try every
feature and measure which split will give you the best results.
● After that, you’ll split the dataset into subsets. The subsets will then traverse
down the branches of the first decision node. If the data on the branches is the
same class, then you’ve properly classified it and don’t need to continue
splitting it.
● If the data isn’t the same, then you need to repeat the splitting process on this
subset. The decision on how to split this subset is done the same way as the
original dataset, and you repeat this process until you’ve classified all the data.
Constructing Decision Trees
Entropy
● Entropy is an information theory metric that measures the impurity or
uncertainty in a group of observations. It determines how a decision tree chooses
to split data.
● The image below gives a better description of the purity of a set.
Constructing Decision Trees
Entropy
Consider a dataset with N classes. The entropy may be calculated using the formula
below:
Where,
pi is the probability of randomly selecting an example in class i.
Constructing Decision Trees
Entropy
● Let’s have an example to better our understanding of entropy and its
calculation. Let’s have a dataset made up of three colors; red, purple, and
yellow.
● If we have one red, three purple, and four yellow observations in our set,
our equation becomes:
Where pr, pp and py are the probabilities of choosing a red, purple and yellow
example respectively.
Constructing Decision Trees
Entropy
We have pr=1/8 because only ⅛ of the dataset represents red.
3/8 of the dataset is purple hence pp=3/8.
Finally, py=4/8 since half the dataset is yellow.
As such, we can represent py as py=1/2.
Our equation now becomes:
Our entropy would be: 1.41
Constructing Decision Trees
Entropy
You might wonder, what happens when all observations belong to the same
class?
In such a case, the entropy will always be zero.
Such a dataset has no impurity. This implies that such a dataset would not be useful
for learning.
Constructing Decision Trees
Entropy
if we have a dataset with say, two classes, half made up of yellow and the
other half being purple, the entropy will be one.
This kind of dataset is good for learning.
Constructing Decision Trees
Information gain
We can define information gain as a measure of how much information a feature
provides about a class. Information gain helps to determine the order of attributes in the
nodes of a decision tree.
The term Gain represents information gain.
Eparent is the entropy of the parent node and
E_{children} is the average entropy of the child nodes.
Classification & Regression Trees
● CART( Classification And Regression Tree) is a variation of the
decision tree algorithm. It can handle both classification and regression
tasks.
● CART was first produced by Leo Breiman, Jerome Friedman, Richard
Olshen, and Charles Stone in 1984.
● This algorithm builds a tree on the basis of Gini Index.
Classification & Regression Trees
Gini Index
● The Gini index is a metric for the classification tasks in CART. It
stores the sum of squared probabilities of each class.
● It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of
the Gini coefficient.
● It works on categorical variables, provides outcomes either
“successful” or “failure” and hence conducts binary splitting only.
Classification & Regression Trees
Gini Index
The degree of the Gini index varies from 0 to 1,
● Where 0 depicts that all the elements are allied to a certain
class, or only one class exists there.
● The Gini index of value 1 signifies that all the elements are
randomly distributed across various classes, and
● A value of 0.5 denotes the elements are uniformly distributed
into some classes.
Classification & Regression Trees
Gini Index
Mathematically, we can write Gini Impurity as follows:
where pi is the probability of an object being classified to a particular class.
Bayes’ Theorem
● Bayes theorem determines the probability of an event with
uncertain knowledge.
● In probability theory, it relates the conditional probability of two
random events.
Bayes’ Theorem
● Bayes theorem states that:
Where,
P(A|B) is known as posterior, which we need to calculate, and it will be read as
Probability of hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we
calculate the probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
evidence
P(B) is called marginal probability, pure probability of an evidence.
Bayes’ Theorem Example
The Art Competition has entries from three painters: Pam,
Pia and Pablo
● Pam put in 15 paintings, 4% of her works have won First Prize.
● Pia put in 5 paintings, 6% of her works have won First Prize.
● Pablo put in 10 paintings, 3% of his works have won First Prize.
What is the chance that Pam will win First Prize?
Bayes’ Theorem Example
Put in the values:
Bayes’ Theorem Example
Multiply all by 30 (makes calculation easier):
Working of Naïve Bayes' Classifier
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
○ Naïve: It is called Naïve because it assumes that the occurrence of a certain feature
is independent of the occurrence of other features. Such as if the fruit is identified
on the bases of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify that
it is an apple without depending on each other.
○ Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Working of Naïve Bayes' Classifier
● Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
● Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we should
play or not on a particular day according to the weather conditions.
● So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Working of Naïve Bayes' Classifier
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Working of Naïve Bayes' Classifier
Frequency table for the Weather Conditions:
Working of Naïve Bayes' Classifier
Likelihood table weather condition:
Working of Naïve Bayes' Classifier
Applying Bayes' theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So, P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
Working of Naïve Bayes' Classifier
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Working of Naïve Bayes' Classifier
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
○ Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
○ Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It
is primarily used for document classification problems, it means a particular document belongs to which
category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
○ Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables
are the independent Booleans variables. Such as if a particular word is present or not in a document. This
model is also famous for document classification tasks.
Advantages of Naïve Bayes Classifier:
○ Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
○ It can be used for Binary as well as Multi-class Classifications.
○ It performs well in Multi-class predictions as compared to the other Algorithms.
○ It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
○ Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
○ It is used for Credit Scoring.
○ It is used in medical data classification.
○ It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
○ It is used in Text classification such as Spam filtering and Sentiment analysis.