0% found this document useful (0 votes)

36 views100 pages

Unit 4 Learning

Classification is the task of assigning objects to predefined categories. It involves using a classification model or algorithm to map input data to class labels. Some key aspects of classification include building classification models from training data, evaluating model performance using metrics like accuracy, precision, and recall, and visualizing results through a confusion matrix. The confusion matrix provides a way to understand model errors and how well it classifies different classes.

Uploaded by

Juee Jamsandekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views100 pages

Unit 4 Learning

Uploaded by

Juee Jamsandekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

What is Classification

 Classification is the task of assigning objects to one

of the several predefined categories.

 Given a database D={t1,t2,…,tn} and a set of classes

C={C1,…,Cm}, the Classification Problem is to
define a mapping f: D → C where each ti is assigned
to one class.
Classification

Attribute set
Classification Class label
(x) Model (y)

Classification as the task of mapping an input attribute set x into its class label y

 Classification model is useful for:

⚫ Descriptive Modeling
⚫ Predictive Modeling
Classification Examples
 Teachers classify students’ grades as A, B, C, D, or F.

 Identify mushrooms as poisonous or edible.

 Predict when a river will flood.

 Identify individuals with credit risks.

 Speech recognition

 Pattern recognition
Classification Ex: Grading

x
 If x >= 90 then grade =A.
<90 >=90
 If 80<=x<90 then grade =B.
x A
 If 70<=x<80 then grade =C.
 If 60<=x<70 then grade =D. <80 >=80
 If x<50 then grade =F. x B
<70 >=70
Classify the following marks x C
78 , 56 , 99
<50 >=60
F D
Topics Covered
 What is Classification
 General Approach to Classification
 Issues in Classification
 Classification Algorithms
⚫ Statistical Based
• Bayesian Classification
⚫ Distance Based
• KNN
⚫ Decision Tree Based
• ID3
⚫ Neural Network Based
⚫ Rule Based
General approach to Classification

 Two step process:

⚫ Learning step
• Where a classification algorithm builds the classifier
by analyzing or “learning from” a training set made
up of database tuples and their associated class
labels.
⚫ Classification step
• The model is used to predict class labels for given
data.
 Classes must be predefined
 Most common techniques use DTs, NNs, or are
based on distances or statistical methods.
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no IF rank = ‘professor’
Anne Associate Prof 3 no OR years > 6
THEN tenured = ‘yes’
Use the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Defining Classes
Issues in Classification

 Missing Data
⚫ Ignore missing value
⚫ Replace with assumed value

 Measuring Performance
⚫ Classification accuracy on test data
⚫ Confusion matrix
• provides the information needed to determine how
well a classification model performs
Confusion Matrix
Confusion Matrix

Predicted Class
Class= 1 Class= 0
Actual Class= 1 f11 f10
Class Class =0 f01 f00
• Each entry fij in this table denotes the number of records from
class i predicted to be of class j.
• For instance, f01 is the number of records from class 0 incorrectly
predicted as class 1.
• The total number of correct predictions: (f11+ f00)
• The total number of incorrect predictions: (f01 + f10)
Classification Performance

 Definition of the Terms:

⚫ Positive (P) : Observation is positive (for example: is an apple).
⚫ Negative (N) : Observation is not positive (for example: is not an apple).
⚫ True Positive (TP) : Observation is positive, and is predicted to be
positive.
⚫ False Negative (FN) : Observation is positive, but is predicted negative.
⚫ True Negative (TN) : Observation is negative, and is predicted to be
negative.
⚫ False Positive (FP) : Observation is negative, but is predicted positive.
Class Statistics Measures
learn

 Accuracy: Overall, how often is the classifier correct? kitna sahi bola
(TP+TN)/(TP+TN+FP+FN)
 Error Rate: Overall, how often is it wrong?
kitna galat bola
(FP+FN)/(TP+TN+FP+FN)
equivalent to 1 minus Accuracy
 Specificity: measures how exact the assignment to the positive class is
TN/(FP+TN)
 Sensitivity/Recall: Recall can be defined as the ratio of the total number of
correctly classified positive examples divide to the total number of positive
examples.
TP/(TP+FN)
High Recall indicates the class is correctly recognized.
Class Statistics Measures
 Precision: is a measure of how accurate a model’s positive
predictions are. TP/(TP+FP)
⚫ High Precision indicates an example labeled as positive is indeed positive

 F-measure: The F measure (F1 score or F score) is used to evaluate

the overall performance of a classification model and is defined as the
weighted harmonic mean of the precision and recall of the test.

F Score =

 High recall, low precision: This means that most of the positive examples are
correctly recognized (low FN) but there are a lot of false positives.
 Low recall, high precision: This shows that we miss a lot of positive examples
(high FN) but those we predict as positive are indeed positive (low FP)
Example to interpret Confusion Matrix

Classification Rate/Accuracy: (TP + TN) / (TP + TN + FP + FN) =

(100 + 50) /(100 + 5 + 10 + 50) = 0.90

Example to interpret Confusion Matrix

Recall = TP / (TP + FN)

= 100 / (100 + 5) = 0.95

Precision = TP / (TP + FP)

= 100 / (100 + 10) = 0.91

F-measure = (2 * Recall * Precision) / (Recall + Precision)

= (2 * 0.95 * 0.91) / (0.91 + 0.95) = 0.92

Example
 We have a total of 20 cats and dogs and our model
predicts whether it is a cat or not.
 Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’,
‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,
‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]

 Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’,

‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Example
Accuracy
Precision
 Ex 1:- In Spam Detection : Need to
focus on precision

 Suppose mail is not a spam but

model is predicted as spam : FP
(False Positive). We always try to
reduce FP.

 Ex 2:- Precision is important in

music or video recommendation
systems, e-commerce websites, etc.
Wrong results could lead to
customer churn and be harmful to
the business.
Recall

 Ex 1:- suppose person having

cancer (or) not? He is
suffering from cancer but
model predicted as not
suffering from cancer

 Ex 2:- Recall is important in

medical cases where it doesn’t
matter whether we raise a
false alarm but the actual
positive cases should not go
undetected!
Confusion Matrix for Multi-class Classification

 For 5-class problem with classes A,B,C,D,E

For more detail:

https://www.youtube.com/watch?v=FAr2GmW NbT0
Confusion Matrix for Multi-class Classification
Confusion Matrix for Multi-class Classification
Confusion Matrix for Multi-class Classification

Example
Height Example Data
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Confusion Matrix Example

 Using height data example with Output1 correct and

Output2 actual assignment.
 Best solution will have only zeroes outside the diagonal.

Actual Predicted Assignment

Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
Good classifier output
When to use Accuracy / Precision /
Recall / F1-Score?

 Accuracy is used when the True Positives and True Negatives are
more important. Accuracy is a better metric for Balanced Data.

 Whenever False Positive is much more important use Precision.

 Whenever False Negative is much more important use Recall.

 F1-Score is used when the False Negatives and False Positives

are important. F1-Score is a better metric for Imbalanced Data.
Statistical Based Algorithms -
Bayesian Classification
 Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the
probability that a given tuple belongs to a particular
class.
 Based on Bayes rule of conditional probability.
 Assumes that the contribution by all attributes are
independent and each of them contribute equally (hence
the name naive)
 Classification is made by combining the impact that the
different attributes have on the prediction to be made.
Bayes Theorem
 Bayes’ Theorem is a way of finding a probability when we know
certain other probabilities.
 The formula is:

•P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
•P(c) is the prior probability of class.
•P(x|c) is the likelihood which is the probability of the predictor given class.
•P(x) is the prior probability of the predictor.
Bayes Theorem
 Data tuple (A): 35-year-old customer with an income of $40,000
 Hypothesis (B): customer will buy a computer
 Posterior Probability: P(A|B),
⚫ P(A|B) is the likelihood. It represents the probability of observing the data
(35-year-old with an income of $40,000) given that the hypothesis
(customer will buy a computer) is true.
 Prior Probability: P(A),
⚫ is the prior probability of a customer being a 35-year-old with an income
of $40,000.
 Posterior Probability: P(B|A),
⚫ probability of the customer buying a computer given their age and income.

 Prior Probability: P(B),

⚫ probability of the customer buying a computer (regardless of age and
income).
Naïve Bayes Classifier

 Naive Bayes is a kind of classifier which uses the

Bayes Theorem.
 It predicts membership probabilities for each class
such as the probability that given record or data point
belongs to a particular class.
 The class with the highest probability is considered as
the most likely class.
 This is also known as Maximum A Posteriori (MAP).
 Naive Bayes classifier assumes that all the features
are unrelated to each other.
Naïve Bayes Classifier
 In real datasets,

 By substituting for X and expanding using the chain rule:

 For all entries in the dataset, the denominator does not change,
it remain static. Therefore, the denominator can be removed
and a proportionality can be introduced.

 For multivariate classification,

Weather dataset
dependent var The posterior
independent variables
probability can be
calculated by first,
constructing a
frequency table for
each attribute against
the target. Then,
transforming the
frequency tables to
likelihood tables and
finally use the Naive
Bayesian equation to
calculate the posterior
probability for each
class. The class with
the highest posterior
probability is the
outcome of prediction.
Test data

see likelihood table values

Play-Tennis example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N An unseen sample
overcast hot high false P X = <rain, hot, high, false>
rain mild high false P
rain cool normal false P
rain cool normal true N P(X|p) and P(X|n) : Conditional Probabilities
overcast cool normal true P
sunny mild high false N Posterior Probabilities ,
sunny cool normal false P P(X|p)·P(p) =
rain mild normal false P
sunny mild normal true P P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
overcast mild high true P
overcast hot normal false P P(X|n)·P(n) =
rain mild high true N
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
Play-Tennis example
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N

P(p) = 9/14
Prior probability
P(n) = 5/14

Derived probability
Play-Tennis example: classifying X

 An unseen sample X = <rain, hot, high, false>

 P(X|p) and P(X|n) : Conditional Probabilities

 Sample X is classified in class n (don’t play)

Naïve Bayes Classifier: Training Dataset
age income student credit_ratingbuys_compute
<=30 high no fair no
• Class:
<=30 high no excellent no
C1:buys_computer = ‘yes’
31…40 high no fair yes
C2:buys_computer = ‘no’
>40 medium no fair yes

• Data to be classified: >40 low yes fair yes

X = (age <=30, >40 low yes excellent no

Income = medium, 31…40 low yes excellent yes

Student = yes <=30 medium no fair no
Credit_rating = Fair) <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40

>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no

P(buys_computer = “no”) = 5/14= 0.357 31…40

<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes

 Compute P(X|Ci) for each class >40

<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Bayes Theorem Example
Color Type Origin Stolen
Red Sports Domestic Y
Red Sports Domestic N
Red Sports Domestic Y
Yellow Sports Domestic N
Yellow Sports Imported Y
Yellow SUV Imported N
Yellow SUV Imported Y
Yellow SUV Domestic N
Red SUV Imported N
Red Sports Imported Y
Bayes Example(cont’d)

Classify for t = < Red, Domestic, SUV>

P(X/yes) * P(yes) = 0.024

P(X/no) * P(no) = 0.072

As 0.072 > 0.024, the new tuple is classified

as No.
Example: Height Classification

Classify tuple:
t = (Adam , M, 1 .95 m)
Example: Height Classification

There are four tuples classified as short,

eight as medium, and three as tall.

P (short) = 4/ 15 = 0.267
P (medium) = 8/15 = 0.533
P (tall) = 3/1 5 = 0.2

To facilitate classification, we divide the height attribute into six ranges:

(0, 1 .6] , (1.6, 1 .7], (1.7, 1 .8], ( 1 .8, 1 .9], (1.9, 2.0], (2.0, oo)
Example: Height Classification

Probabilities associated with attributes

Example: Height Classification

 To classify t = (Adam , M, 1 .95 m)

 By using the values and associated probabilities of
gender and height, we obtain the following estimates:
P(t|short) = 1 /4 * 0 = 0 Prior Probabilities:
P (t|medium) = 2/8 * 1/8 = 0.031 P(short) = 4/15 = 0.267
P (t|tall) = 3/3 * 1/3 = 0.333
P(medium) = 8/15 = 0.533
Combining these, we get
P(tall) = 3/15 = 0.2
Likelihood of being short = 0 * 0.267 = 0
Likelihood of being medium = 0.031 * 0.533 = 0.0166
Likelihood of being tall = 0.33 * 0.2 = 0.066

We estimate P(t) by summing up the individual likelihood values

P(t) = 0 + 0.0166 + 0.066 = 0.08266
Example: Height Classification

 Finally, we obtain the actual probabilities of each event (Using Bayes Theorem):

 P(short | t) = P(t | short) x P(short) / P(t) = 0*0.0267/0.0826 = 0

 P(medium | t) = P(t | medium) x P(medium) / P(t) = 0.031*0.533/0.0826 = 0.2

 P(tall | t) = P(t | tall) x P(tall) / P(t) = 0.333*0.2/0.0826 = 0.799

 Therefore, based on these probabilities, we classify the new tuple as tall because
it has the highest probability.
Advantages of naïve bayes

 It is easy to use.
 Unlike other classification approaches, only one
scan of the training data is required.
 The naive Bayes approach can easily handle missing
values by simply omitting that probability when
calculating the likelihoods of membership in each
class.
 In cases where there are simple relationships, the
technique often does yield good results.
Disadvantages of naïve bayes

 Although the naive Bayes approach is straightforward

to use, it does not always yield satisfactory results.
 First, the attributes usually are not independent. We
could use a subset of the attributes by ignoring any
that are dependent on others.
 The technique does not handle continuous data.
 Dividing the continuous values into ranges could be
used to solve this problem, but the division of the
domain into ranges is not an easy task, and how this is
done can certainly impact the results.
Decision Tree based Algorithms

 In Decision tree approach, a tree is constructed to

model the classification process.
 Once the tree is built, it is applied to each tuple in the
database and results in a classification for that tuple.
 There are two basic steps in the technique:
⚫ building the tree
⚫ and applying the tree to the database.

Most research has focused on how to build effective trees

as the application process is straightforward.
Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt

Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Parts of a Decision Tree
Decision Tree
Given:
⚫ D = {t1, …, tn} where ti=<ti1, …, tih>
⚫ Attributes {A1, A2, …, Ah}
⚫ Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that
⚫ Each internal node is labeled with attribute, Ai
⚫ Each arc is labeled with predicate which can be
applied to attribute at parent
⚫ Each leaf node is labeled with a class, Cj
Decision Tree based Algorithms

 Solving the classification problem using decision trees

is a two-step process:
⚫ Decision tree induction: Construct a DT using training
data.
⚫ For each tiεD, apply the DT to determine its class.

 DT approaches differ in how the tree is built.

 Algorithms: ID3, C4.5, CART

DT Induction
DT Induction
 The recursive algorithm builds the tree in a top-down fashion.
 Using the initial training data, the "best" splitting attribute is
chosen first. [Algorithms differ in how they determine the "best
attribute" and its "best predicates" to use for splitting. ]
 Once this has been determined, the node and its arcs are
created and added to the created tree.
 The algorithm continues recursively by adding new subtrees to
each branching arc.
 The algorithm terminates when some "stopping criteria" is
reached. [Again, each algorithm determines when to stop the tree
differently. One simple approach would be to stop when the tuples in the
reduced training set all belong to the same class. This class is then used to
label the leaf node created.]
DT Induction

 Splitting attributes: Attributes in the database

schema that will be used to label nodes in the tree
and around which the divisions will take place.

 Splitting predicates: The predicates by which

the arcs in the tree are labeled.
DT Issues

 Choosing Splitting Attributes

 Ordering of Splitting Attributes
 Splits
 Tree Structure
 Stopping Criteria
 Training Data
 Pruning
DT Issues
 Choosing Splitting Attributes
Name Gender Height Output1(Correct) Output2(Actual
Assignment)
Kristina F 1.6m medium Medium
Jim M 2m Tall Short
Maggie F 1.9m Medium Short
Martha F 1.88m Short medium
Stephanie F 1.7m Medium Tall
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Short
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Short
Debbie F 1.8m Tall Medium
Todd M 1.95m Medium Tall
Kim F 1.9m Short Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Short
DT Issues
 Ordering of Splitting Attributes
 The order in which the attributes are chosen is also
important.
DT Issues
 Splits
⚫ With some attributes, the domain is small, so the number of
splits is obvious based on the domain (as with the gender
attribute).
⚫ However, if the domain is continuous or has a large number
of values, the number of splits to use is not easily
determined. Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

DT Issues
 Tree Structure
⚫ a balanced tree with the fewest levels is desirable.
⚫ However, in this case, more complicated comparisons
with multiway branching may be needed.
⚫ Some algorithms build only binary trees.
DT Issues
 Stopping Criteria
⚫ when the training data are perfectly classified.
⚫ when stopping earlier would be desirable to prevent the
creation of larger trees. This is a trade-off between accuracy of
classification and performance.
 Training Data
⚫ The structure of the DT created depends on the training data.
⚫ If the training data set is too small, then the generated tree
might not be specific enough to work properly with the more
general data.
⚫ If the training data set is too large, then the created tree may
overfit.
DT Issues

 Pruning
⚫ Once a tree is constructed, some modifications to the tree
might be needed to improve the performance of the tree
during the classification phase.
⚫ The pruning phase might remove redundant comparisons
or remove subtrees to achieve better performance.
Comparing
Decision
Trees
ID3
 ID3 stands for Iterative Dichotomiser 3
 Creates tree using information theory concepts and tries to
reduce expected number of comparison.
 ID3 chooses split attribute with the highest information
gain:
 Information gain=(Entropy of distribution before the
split)–(entropy of distribution after it)
Entropy
 Entropy
⚫ Is used to measure the amount of uncertainty or surprise
or randomness in a set of data.
⚫ When all data belongs to a single class, entropy is zero as
there is no uncertainty.
⚫ An equally divided sample as an entropy of 1.
⚫ The Mathematical formula for Entropy is –

Where ‘Pi’ is simply the frequentist probability of an

element/class ‘i’ in our data.
How do Decision Trees use Entropy?

 Entropy basically measures the impurity of a node.

 Impurity is the degree of randomness; it tells how random our data is.
 A pure sub-split means that either you should be getting “yes”, or you
should be getting “no”.

left node has low entropy or more purity than right node since left node has a greater number of
“yes” and it is easy to decide here.
Information Gain

 The goal of machine learning is to decrease the uncertainty

or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if
the parent entropy or the entropy of a particular node has
decreased or not.

 For this, we bring a new metric called “Information gain”

which tells us how much the parent entropy has decreased
after splitting it with some feature.
Information Gain

 Suppose our entire population has a total of 30 instances. The

dataset is to predict whether the person will go to the gym or not.
Let’s say 16 people go to the gym and 14 people don’t
 Two features to predict whether he/she will go to the gym or not.
⚫ Feature 1 is “Energy” which takes two values “high” and “low”
⚫ Feature 2 is “Motivation” which takes 3 values “No motivation”,
“Neutral” and “Highly motivated”.

Let’s see how our decision tree will be made using these 2 features.
We’ll use information gain to decide which feature should be the
root node and which feature should be placed after the split.
Information Gain

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99
and after looking at this value of
information gain, we can say that
the entropy of the dataset will
decrease by 0.37 if we make
“Energy” as our root node.
Information Gain

“Energy” feature gives more

reduction which is 0.37 than
the “Motivation” feature. Hence
we will select the feature which
has the highest information
gain and then split the node
based on that feature.
Information Gain

 Gain is defined as the difference between how

much information is needed to make a correct
classification before the split versus how much
information is needed after the split.
Information Gain Example
 Let S=14 examples, 9 positive 5 negative
 Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) • = 0.940
 The attribute is Wind. Values of wind are Weak and Strong
 8 occurrences of weak winds ; 6 occurrences of strong winds
 For the weak winds, 6 are positive and 2 are negative
 For the strong winds, 3 are positive and 3 are negative
 Gain(S,Wind) =
Entropy(S) - (8/14)*Entropy (Weak) -(6/14)*Entropy (Strong)
 Entropy(Weak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
 Entropy(Strong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
 So… 0.940 - (8/14)*0.811 - (6/14)*1.00 • = 0.048
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
ID3 Example (Output1)
 The beginning state of the training data in Table (with the
Output l classification) is that
(4/ 1 5) are short, (8/1 5) are medium, and (3/15) are tall.

 Starting state entropy:

4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384

 Gain using gender:

⚫ Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
⚫ Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
⚫ Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152
⚫ Gain: 0.4384 – 0.34152 = 0.09688
ID3 Example (Output1)
 Gain using height:
(0, 1 .6], (1 .6, 1 .7], ( 1.7, 1 .8], (1.8, 1 .9], (1.9, 2.0], (2.0, ∞)
Entropy calculation:
There are 2 tuples in the first division with entropy (2/2(0) + 0 + 0) = 0
2 in (1.6, 1 .7] with entropy (2/2(0) +0+0) = 0,
3 in (1.7, 1.8] with entropy (0 + 3/3(0) + 0) = 0,
4 in ( 1.8, 1.9] with entropy (0+4/4(0) +0) = 0,
2 in (1.9, 2.0] with entropy (0+ 1/2(0.301)+ 1 /2(0.301)) = 0.301, and
2 in the last with entropy (0 + 0 + 2/2(0)) = 0.
All of these states are completely ordered and thus an entropy of 0 except for
the (1.9, 2.0] state.
 The gain in entropy by using the height attribute is
0.4384 – (2/15)(0.301) = 0.3983
 Choose height as first splitting attribute
Advantages of ID3

 Understandable prediction rules are

created from the training data.
 Builds the fastest tree.

 Builds a short tree.

 Only need to test enough attributes until

all data is classified.
 Finding leaf nodes enables test data to be
pruned, reducing number of tests.
Disadvantages of ID3

 Data may be over-fitted or over classified,

if a small sample is tested.
 Only one attribute at a time is tested for
making a decision.
 Classifying continuous data may be
computationally expensive, as many trees
must be generated to see where to break
the continuum.
Example:Triangles and Squares
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange Data Set:
3 yellow dashed no square
4 red dashed no square
A set of classified objects
5 red solid no square
6 red solid yes triange
7 green solid no square
. .
8 green dashed no triange
9 yellow solid yes square
. .
10 red solid no square
. .
11 green solid yes square
12 yellow dashed yes square
13 yellow solid no square
14 red dashed yes triange
Entropy
• 5 triangles
• 9 squares
. .
• class probabilities
. .
. .

• entropy
. .
. .
. .
. .
red

Color? green
Entropy
reduction .
yellow .
by
data set
partitioning .
.
. . .
.
. .
. .
red

Color? green

.
yellow .

.
.
. . .
.
. .
. .
red
Information Gain

Color? green

.
yellow .

.
.
Information Gain of The
Attribute
 Attributes
⚫ Gain(Color) = 0.246
⚫ Gain(Outline) = 0.151
⚫ Gain(Dot) = 0.048
 Heuristics: attribute with the highest gain is chosen
 So, color is chosen as the root node
. .
. . .
.
. . red

Red

Color?
Color?

green
Green
Yellow
. yellow Gain(outline)
. P(dashed)=3/5, P(solid)=2/5
.
. I(dashed)= -(3/3)log2(3/3) – 0 =0
I(solid)= -0-(2/2)log2(2/2) = 0
Gain(Dot) I(outline)=(3/5).0+(2/5).0 = 0
P(y)=2/5, P(n)=3/5 Gain (outline)= 0.971-0=0.971
I(y)= -(1/2)log2(1/2) – (1/2)log2(1/2)=1
I(n)= -(1/3)log2(1/3) – (2/3)log2(2/3)=
= 0.917
Gain(Outline) = 0.971 – 0 = 0.971
I (Dot)=(2/5).1+(3/5).(0.917) = 0.9502 Gain(Dot) = 0.971 – 0.951 = 0.020
Gain (Dot)= 0.971-0.9502= 0.020
. .
. . .
.
. .
Red
Gain(Outline) = 0.971 – 0.951 = 0.020 bits
Color?
Color? Gain(Dot) = 0.971 – 0 = 0.971 bits

Yellow Green
.
.
.
.
solid
.
Outline

dashed

.
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green

. yellow
.
.
.
solid
.
Outline?

dashed

.
Decision Tree
. .

. .
. .

Color

red green
yellow

Dot square Outline

yes no dashed solid

triangle square triangle square

ACPE Application Forms - Template
No ratings yet
ACPE Application Forms - Template
33 pages
SH - Fall of Troy Semi Fiction PDF
No ratings yet
SH - Fall of Troy Semi Fiction PDF
11 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
CH 6
No ratings yet
CH 6
24 pages
Unit 2 Classification
No ratings yet
Unit 2 Classification
59 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
Lecture 7
No ratings yet
Lecture 7
25 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Risk Security and Regulatory Compliance
No ratings yet
Risk Security and Regulatory Compliance
12 pages
Evaluation Measures For Machine Learning Models
No ratings yet
Evaluation Measures For Machine Learning Models
6 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
14 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
61 pages
Lecture 5
No ratings yet
Lecture 5
21 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
CLASSIFICATION PPT (1)
No ratings yet
CLASSIFICATION PPT (1)
36 pages
Module 6
No ratings yet
Module 6
24 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Confusion Matrix
No ratings yet
Confusion Matrix
43 pages
4.8 Estimating The Performance of A Classifier
No ratings yet
4.8 Estimating The Performance of A Classifier
19 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Chapter 5 Model Evaluation
No ratings yet
Chapter 5 Model Evaluation
21 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Performance Measures - Session 2
No ratings yet
Performance Measures - Session 2
35 pages
ML Notes UT-2
No ratings yet
ML Notes UT-2
19 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
39 pages
Confusion Matrix
No ratings yet
Confusion Matrix
8 pages
Chater 3 Class 10
No ratings yet
Chater 3 Class 10
4 pages
Lecture 20 - Evaluation Metrics
No ratings yet
Lecture 20 - Evaluation Metrics
27 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Unit 3
No ratings yet
Unit 3
13 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Assignment 5
No ratings yet
Assignment 5
22 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Module 2
No ratings yet
Module 2
151 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Evaluating Models CH-3
No ratings yet
Evaluating Models CH-3
5 pages
ClassificationandPrediction Module3
No ratings yet
ClassificationandPrediction Module3
88 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
Classification Metrics
No ratings yet
Classification Metrics
39 pages
Module 7 - Evaluation Measures
No ratings yet
Module 7 - Evaluation Measures
27 pages
Classification Data Mining
No ratings yet
Classification Data Mining
84 pages
Evaluation Metrics-ML
No ratings yet
Evaluation Metrics-ML
16 pages
6 Evaluation
No ratings yet
6 Evaluation
57 pages
Unit 7 Neural Networks
No ratings yet
Unit 7 Neural Networks
92 pages
Unit 3
No ratings yet
Unit 3
53 pages
A082 Practical No 9
No ratings yet
A082 Practical No 9
9 pages
Juee A090 Sas Exp 7
No ratings yet
Juee A090 Sas Exp 7
6 pages
Practical No 6 A082
No ratings yet
Practical No 6 A082
7 pages
Breadth-First Search (BFS)
No ratings yet
Breadth-First Search (BFS)
10 pages
Hands-On Exercise No. 1 Batch-02 Graphic Design Total Marks: 10 Due Date: 04/08/2022
No ratings yet
Hands-On Exercise No. 1 Batch-02 Graphic Design Total Marks: 10 Due Date: 04/08/2022
3 pages
SCRIPT - Camtasia 2020 Essential Training
No ratings yet
SCRIPT - Camtasia 2020 Essential Training
41 pages
2021 Fia f3 Regional Homologation 11.01.21
No ratings yet
2021 Fia f3 Regional Homologation 11.01.21
21 pages
Basics of Design & Graphics - Practice Questions
No ratings yet
Basics of Design & Graphics - Practice Questions
25 pages
Standard Naming Convention Electronic Records Edinburg
No ratings yet
Standard Naming Convention Electronic Records Edinburg
11 pages
WaterShapes - Hydraulics-Hot-Tub-Concrete-Spa-Jets-Hydrotherapy-Venturi-Hartford-Loop
No ratings yet
WaterShapes - Hydraulics-Hot-Tub-Concrete-Spa-Jets-Hydrotherapy-Venturi-Hartford-Loop
7 pages
800 Hotmail Valid by Megalodon
No ratings yet
800 Hotmail Valid by Megalodon
15 pages
CONCLUSION & Reco Exp Heat Pump
No ratings yet
CONCLUSION & Reco Exp Heat Pump
2 pages
Review of Related Systems
No ratings yet
Review of Related Systems
7 pages
Distributed Computing
No ratings yet
Distributed Computing
3 pages
Session 9 Verilog Programming
No ratings yet
Session 9 Verilog Programming
13 pages
Economics Thesis Blue Variant
No ratings yet
Economics Thesis Blue Variant
38 pages
Developing A Data Warehouse For The Healthcare Enterprise Lessons From The Trenches Coll. Download PDF
100% (2)
Developing A Data Warehouse For The Healthcare Enterprise Lessons From The Trenches Coll. Download PDF
65 pages
Comparing Functions Answered
No ratings yet
Comparing Functions Answered
14 pages
Distributed Memory Architecture
No ratings yet
Distributed Memory Architecture
48 pages
AYEDUASE
No ratings yet
AYEDUASE
5 pages
Module1 - ARM Microcontroller MIT Portrait
100% (2)
Module1 - ARM Microcontroller MIT Portrait
21 pages
I Am Neha Jain, Ph.D. Research Scholar of JJT University Jhunjhunu, Doing A Research On "To Study
No ratings yet
I Am Neha Jain, Ph.D. Research Scholar of JJT University Jhunjhunu, Doing A Research On "To Study
5 pages
Schneider Electric - Altivar-31-Variable-Speed-Drives-VFD-Legacy - ATV31HU40N4
No ratings yet
Schneider Electric - Altivar-31-Variable-Speed-Drives-VFD-Legacy - ATV31HU40N4
4 pages
Online STTP Schdule
No ratings yet
Online STTP Schdule
1 page
Development and Control of Virtual Plants in A Co Simulation Environment 1
No ratings yet
Development and Control of Virtual Plants in A Co Simulation Environment 1
35 pages
I. Models Arrius 1A Arrius 2B1 Arrius 2B1A Arrius 2F Arrius 2K1 Arrius 2B2 Arrius 1A1
100% (1)
I. Models Arrius 1A Arrius 2B1 Arrius 2B1A Arrius 2F Arrius 2K1 Arrius 2B2 Arrius 1A1
11 pages
0936E1001R00
No ratings yet
0936E1001R00
1 page
John Snow: Site Supervisor/Solar Installer
No ratings yet
John Snow: Site Supervisor/Solar Installer
3 pages
Flip Flops - Registers and Counters
No ratings yet
Flip Flops - Registers and Counters
42 pages
An Introduction to American Law Third Edition eBook and TestBank Bundle Unlocked Test Bank
No ratings yet
An Introduction to American Law Third Edition eBook and TestBank Bundle Unlocked Test Bank
319 pages
ABHA M1 API Document V1 R1.bab8b1bd
No ratings yet
ABHA M1 API Document V1 R1.bab8b1bd
33 pages
B224 Epcc20 000 CS DRW 1003
No ratings yet
B224 Epcc20 000 CS DRW 1003
7 pages