0% found this document useful (0 votes)
28 views

Mod 3 part1_merged

Module 3 focuses on advanced classification and clustering techniques, detailing decision tree construction principles and algorithms such as ID3 and SLIQ. It covers key concepts like information gain, Gini index, and entropy, along with practical examples of decision tree construction. The module also highlights various classification methods, including Naïve Bayes, KNN, and SVM.

Uploaded by

Insha Nourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Mod 3 part1_merged

Module 3 focuses on advanced classification and clustering techniques, detailing decision tree construction principles and algorithms such as ID3 and SLIQ. It covers key concepts like information gain, Gini index, and entropy, along with practical examples of decision tree construction. The module also highlights various classification methods, including Naïve Bayes, KNN, and SVM.

Uploaded by

Insha Nourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Module - 3 (Advanced classification and

Cluster analysis)

Module – 3.1
(Advanced classification)

Major Source For This Material:


Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Jiawei Han, Micheline Kamber, and Jian Pei

1
Module - 3 (Advanced classification and Cluster
analysis)
1.1 Classification- Introduction, 3.1 Introduction to clustering-
1.2 Decision tree construction Clustering Paradigms,
principle, 3.2 Partitioning Algorithm-
1.3 Splitting indices - PAM,
Information Gain, Gini 3.3 Hierarchical Clustering-
index, DBSCAN,
1.4 Decision tree construction 3.4 Categorical Clustering-
algorithms-ID3, ROCK
1.5 Decision tree construction
with presorting-SLIQ,
Note: Data Analytics Coverage
1. Classification: Naïve Bayes, KNN,
2 Classification Accuracy-
2. Clustering: Hierarchical,
Precision, Recall. Partitioning

2
Module – 3.1 Classification
1.1 Classification- Introduction,
1.2 Decision tree construction
principle,
1.3 Splitting indices -
Information Gain, Gini
index,
1.4 Decision tree construction
algorithms-ID3,
Note: Data Analytics Coverage
1.5 Decision tree construction
1. Classification: Naïve Bayes, KNN
with presorting-SLIQ

3
3.1 Short Notes
3.1.1. Define the entropy of a dataset. Write a formula to compute the
entropy of a two-class dataset. (Slide: 17, 18)
3.1.2. How is Gain Ratio calculated? What is the advantage of Gain Ratio
over Information Gain? (3)
Define information gain and gini index. (Slide 19, 29)
3.1.3. Explain the steps in creating decision tree using ID3 algorithm (4)
Explain the ID3 algorithm for building decision trees - Slide 12,13,14
3.1.4. Discuss the issues in the implementation of a decision tree (3)
What are the challenges in building a decision tree? How are they
overcome?
3.1.5. Explain the construction of decision tree using SLIQ algorithm with
an example. (8)
Explain the working of SLIQ algorithm (6)

4
Regression Vs Classification

5
Classification Methods

◼ Bayes Classification Methods 1.2 Decision tree


construction principle,
◼ K nearest neighbours (KNN)
◼ Support Vector Machine 1.3 Splitting indices -
(SVM) Information Gain, Gini
index,
◼ Neural Networks
1.4 Decision tree
◼ Decision Tree
construction algorithms-
◼ Logistic Regression ID3,
◼ Linear Discriminant Analysis 1.5 Decision tree construction
(LDA) with presorting-SLIQ

6
3.1.6.
Consider the dataset for a binary classification problem with class
label “yes” and “no”. The table shows class labeled dataset of
customers in a bank. Explain information gain attribute selection
measure, and find the information gain of the attribute “age”.

7
Decision Tree Construction - Illustration:
(Note: At First Categorize the attribute “Age”)
age income student credit buys computer
<=30 high no fair no
<=30 high no poor no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes poor no
31…40 low yes poor yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes poor yes
31…40 medium no poor yes
31…40 high yes fair yes
>40 medium no poor no
8
Select the attribute ‘Age’ for root node split.
Sort the data by age categories

Buys Buys
age buys
Yes No
age?
<=30 2 3 no
<=30 no
<=30 no
<=30 yes
<=30 yes
>40 3 2 no <=30 31..40 >40
>40 yes (2:3) (4:0) (3:2)
>40 no
>40 yes
>40 yes
31…40 4 0 yes yes
31…40 yes
31…40 yes
31…40 yes
9
For Age 31 ... 40, select attribute ‘Student’ for node split

Buys Buys Buys Buys


age student buys
Yes No Yes No
<=30 2 3 no 0 3 no
<=30 no no
<=30 no no
<=30 yes 2 0 yes
<=30 yes yes
>40 3 2 no 1 1 no
>40 no yes
>40 yes 2 1 no
>40 yes yes
>40 yes yes
31…40 4 0 yes
31…40 yes
31…40 yes
31…40 yes
10
age?

<=30 overcast
31..40 >40
(2:3) (4:0) (3:2)

Student-Yes/No? yes

Stud-Yes Stud-No
(2:0) (3:0)

yes no
For Age > 40, Select Attribute ‘Credit Rating’ for Split

age?

<=30 31..40
overcast >40
(2:3) (4:0) (3:2)

Student? yes credit rating?

Stud- Stud-No Fair Poor


Yes (0:3) (3:0) (0:2)
(2:0)
yes no no
Decision Tree Construction
◼ Decision tree construction is a top-down recursive tree
induction algorithm, which uses an attribute selection
measure to select the best attribute to split each non leaf
node.
◼ Algorithms like ID3, C4.5, and CART employ different
selection criteria to build efficient decision trees.

13
Algorithm for Decision Tree Construction …
◼ At the start, all the training samples are at the root
◼ Attributes are assumed to be categorical (if the attributes
are continuous-valued, discretize them in advance)
◼ Construct a decision tree by splitting a node and growing
each branch. This is done in a top-down recursive manner,
until the conditions for termination of the algorithm is
reached (PTO). The steps involved are: -
◼ If the tuples are all from the same class, then the node
becomes a leaf, labeled with that class.
◼ Otherwise, partition the node using a selected attribute.
The attribute selection is done based on a heuristic or
statistical measure such as information gain, Gini index,
etc.

14
… Algorithm for Decision Tree Construction
◼ Conditions for termination of the algorithm
◼ All samples for a given node belong to the same class
◼ There are no more attributes for further partitioning. In this
case, majority voting is employed for classifying the leaf
◼ There are no samples left
◼ Other conditions set by the researcher. E.g., maximum tree
depth is attained

15
The ID3 Algorithm for Algorithm
for Decision Tree Construction
◼ ID3 (Iterative dichotomiser 3) is an algorithm for
Decision Tree Construction
◼ It follows the basic steps listed n the ‘Algorithm for
Decision Tree Construction’. In addition, the following
characteristics apply.
◼ The ID3 algorithm assumes that there are only two
class labels, namely, “+” and “−”.
◼ The attributes can be multi-valued
◼ The algorithm uses information gain to select the
attribute for node split.

16
𝐼𝑛𝑓𝑜 𝐷, 𝐴𝑔𝑒 =
Decision Tree 5 4 5
Construction using 𝐼 2,3 + 𝐼 4,0 + 𝐼 3,2
14 14 14
ID3 Algorithm = 0.694
Age Gain(Age) = 0.246
𝐺𝑎𝑖𝑛 𝑖𝑛𝑐𝑜𝑚𝑒 = 0.029
𝐺𝑎𝑖𝑛(𝑠𝑡𝑢𝑑𝑒𝑛𝑡) = 0.151
𝐺𝑎𝑖𝑛 𝑐𝑟𝑒𝑑𝑖𝑡 = 0.048
Young Middle
overcast Senior
(2:3) (4:0) (3:2)

Student? yes credit rating?

Stud- Stud-No Fair Poor


Yes (0:3) (3:0) (0:2)
(2:0)
yes no no
Attribute Selection Measures

◼ Information Gain
◼ Gini Index

18
Entropy
◼ Entropy, in information theory, is a measure of uncertainty
associated with a random variable.

◼ Calculation: For a discrete random variable Y taking m


distinct values {y1, ..., ym},

H(Y) = − σ𝑚
𝑖=1 𝑝𝑖 log 2 ( 𝑝𝑖 ),

where: pi=P(Y = yi)


◼ Higher entropy => higher uncertainty

◼ Lower entropy => lower uncertainty

19
Entropy of a 2-class dataset
◼ 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐃 = 𝑰 𝒑𝟏, 𝒑𝟐 = −𝒑𝟏 𝐥𝐨𝐠 𝟐 ( 𝒑𝟏) − 𝒑𝟐 𝒍𝒐𝒈𝟐 ( 𝒑𝟐)
where:
𝑝1 is the proportion of samples in D belonging to class 1
𝑝2 is the proportion of samples in D belonging to class 2
◼ If 𝑝1 or 𝑝2 equals zero, the entropy = 0. The dataset is pure
◼ For a 2 class dataset, if 𝑝1 = 𝑝2 = 0.5, entropy = 1. The
dataset is perfectly impure (equal probability for each class).
◼ Consider a dataset with 9 objects labeled ‘Yes’ and 5 labeled ‘No’.
𝟗 𝟗 𝟓 𝟓
𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐃 = 𝑰 𝟗, 𝟓 = − 𝐥𝐨𝐠 𝟐 ( ) − 𝐥𝐨𝐠 𝟐 ( )
𝟏𝟒 𝟏𝟒 𝟏𝟒 𝟏𝟒
= 0.940

20
Attribute Selection Measure:
Information Gain
◼ Let pi be the probability that an arbitrary tuple in D belongs
to class Ci. Therefore, pi = |Ci, D| / |D|

◼ Entropy of D = 𝐼𝑛𝑓𝑜(𝐷) = − σ𝑚
𝑖=1 𝑝𝑖 log 2 ( 𝑝𝑖 )

◼ If A takes v values, A splits D into v partitions


𝑣
|𝐷𝑗 |
𝐼𝑛𝑓𝑜𝐴 (𝐷) = ෍ × 𝐼𝑛𝑓𝑜(𝐷𝑗 )
|𝐷|
𝑗=1

◼ Information gained by node split using attribute A,


Information Gain (A) = Info (D) – Info A (D)

◼ Compute the information from all attributes. Select the


attribute with the highest information gain for node split.

21
Decision Tree Construction - Illustration:
(Note: At First Categorize the attribute “Age”)
age income student credit buys computer
<=30 high no fair no
<=30 high no poor no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes poor no
31…40 low yes poor yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes poor yes
31…40 medium no poor yes
31…40 high yes fair yes
>40 medium no poor no
22
Information Gain Example buys computer
Yes
Use the example of a laptop purchase. yes
Identify the best attribute based on yes
the ‘information gain’. yes
yes
In Database ‘D,’ there are 14 samples,
with 9 yes and 5 no.
yes
Yes
Probabilities p1 = 9/14 and p2= 5/14 Yes
𝐼𝑛𝑓𝑜 𝐷 = 𝐼 9,5 = yes
no
9 9 5 5 no
− log 2 ( ) − log 2 ( ) no
14 14 14 14
no
= 0.940 no

23
Information Gain: Consider ‘Age’ for Root Node Split
𝐼𝑛𝑓𝑜 𝐷 = 0.940 buys
age
computer
Attribute ‘age’ splits ‘D’ into 3 subsets.
1-Young yes
1-Young yes
𝐼𝑛𝑓𝑜 𝐷, 𝐴𝑔𝑒 =
1-Young no
1-Young no
5 4 5
𝐼 2,3 + 𝐼 4,0 + 𝐼 3,2 1-Young no
14 14 14 2-Middle yes
2-Middle yes
(5/14)*0.97 + (4/14)*0 + (5/14)*0.97 2-Middle yes
2-Middle yes
= 0.694 3-Senior yes
3-Senior yes
𝐺𝑎𝑖𝑛 𝐷, 𝐴𝑔𝑒
3-Senior yes
= 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜(𝐷, 𝐴𝑔𝑒) = 0.246 3-Senior no
3-Senior no
24
Info. Gain: Consider ‘Income’ for Root Node Split
𝐼𝑛𝑓𝑜 𝐷 = 0.940 buys
income
computer
Attribute ‘income’ splits ‘D’ into 3 subsets. 1-high yes
1-high yes
𝐼𝑛𝑓𝑜 𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒 = 1-high no
1-high no
4 6 4 2-medium yes
𝐼 2,2 + 𝐼 4,2 + 𝐼 3,1
14 14 14 2-medium yes
2-medium yes
2-medium yes
(4/14)*1.0 + (6/14)*0.92 + (4/14)* 0.81
2-medium no
2-medium no
= 0.911
3-low yes
𝐺𝑎𝑖𝑛 𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒 3-low yes
3-low yes
= 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜(𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒) = 0.029 3-low no

25
Info. Gain: Consider ‘Student’ for Root Node Split
𝐼𝑛𝑓𝑜 𝐷 = 0.940 buys
student
computer
Attribute ‘student’ splits ‘D’ into 2 subsets. yes yes
yes yes
𝐼𝑛𝑓𝑜 𝐷, 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 = yes yes
yes yes
7 7 yes yes
𝐼 6,1 + 𝐼 3,4
14 14 yes yes
yes no
no yes
7/14(0.59) + 7/14(0.99) = 0.79
no yes
no yes
𝐺𝑎𝑖𝑛 𝐷, 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 no no
no no
= 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜(𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒) = 0.151 no no
no no

26
Information Gain example
Feature Selection for Root Node Split

𝐺𝑎𝑖𝑛 𝑎𝑔𝑒 = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜𝑎𝑔𝑒 (𝐷) = 0.246


Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048

Based on the above, the attribute “Age” provides maximum


information gain. So, we select ‘Age’ for root node split

27
Attribute Selection Measures

◼ Information Gain
◼ Gini Index

28
Gini Index of a Dataset buys computer
◼ If a data set D contains samples from n yes
classes, the Gini index, Gini (D) is defined as yes
n 2 yes
gini( D) = 1−  p j Yes
j =1 yes
where pj is relative frequency of class j in D yes
Yes
◼ The computer purchase database D has
Yes
customers who buy computers and those who
yes
do not buy. The probability of a customer
buying computer p1 = 9/14 no
no
◼ Probability of not buying computer p2 = 5/14 no
𝐺𝑖𝑛𝑖 𝐷 = 1 − ((9/14)^2 + (5/14)^2) no
= 0.46 no

29
Attribute Selection Measure: Gini Index
◼ If attribute A splits the data set D into two subsets D1
and D2, the resulting Gini index is computed as:-
|𝐷1| |𝐷2|
Gini(D, A) = Gini(D1) + Gini(D2)
|𝐷| |𝐷|
where |D|, |D1| and |D2| are the sizes of the databases
D, D1, and D2 respectively
◼ Reduction in Impurity when ‘A’ splits D into D1 and D2
δ Gini(D, A) = Gini(D) - Gini(D, A)
◼ Compute the Gini index for splitting node ‘D’ using each
attribute. Choose the attribute that results in the
greatest impurity reduction.

30
‘Age’ for Root Node Split
Gini Impurity D = buys
age
computer
1 − ((9/14)^2 + (5/14)^2) = 0.46
1-Young yes
Attribute ‘Age’ splits ‘D’ into 3 subsets 1-Young yes
1-Young no
𝐆𝐢𝐧𝐢 𝐈𝐦𝐩𝐮𝐫𝐢𝐭𝐲 𝐃, 𝐀𝐠𝐞 = 1-Young no
1-Young no
5/14[1 − ((2/5)^2 + (3/5)^2))]
2-Middle yes
+ 4/14[0] 2-Middle yes
+ 5/14[ 1 − ((3/5)^2 + (2/5)^2))] 2-Middle yes
2-Middle yes
= 0.34 3-Senior yes
3-Senior yes
Reduction in Impurity = δ Gini(D, Age) =
3-Senior yes
Gini(D) – Gini(D, Age) = 0.46 – 0.34 = 0.12 3-Senior no
3-Senior no
31
‘Income’ for Root Node Split
Gini Impurity 𝐷 = 𝐺𝑖𝑛𝑖 (𝐷) buys
income
9 2 5 2 computer
=1− + = 0.46 1-high yes
14 14
1-high yes
Attribute ‘income’ splits ‘D’ into 3 subsets. 1-high no
1-high no
𝑮𝒊𝒏𝒊 𝑰𝒎𝒑𝒖𝒓𝒊𝒕𝒚 𝑫, 𝑰𝒏𝒄𝒐𝒎𝒆 = 2-medium yes
4/15[1 − ((2/4)^2 + (2/4)^2))] 2-medium yes
+ 6/15[1 − ((4/6)^2 + (2/6)^2))] 2-medium yes
2-medium yes
+ 4/15[ 1 − ((3/4)^2 + (1/4)^2))] 2-medium no
= 0.44 2-medium no
Reduction in Impurity = δ Gini(D, Income) 3-low yes
3-low yes
= Gini(D) – Gini(D, Income) = 0.46 – 0.44
3-low yes
= 0.02 3-low no

32
DT Using Gini Index 𝑮𝒊𝒏𝒊 𝑰𝒎𝒑𝒖𝒓𝒊𝒕𝒚 𝑫, 𝑨𝒈𝒆 =
5/14[1 − (2/5)^2]
Age 5/14[1 − (3/5)^2)] + 0
= 0.34

Young Middle
overcast Senior 𝑮𝒊𝒏𝒊 𝑰𝒎𝒑𝒖𝒓𝒊𝒕𝒚
(2:3) (4:0) (3:2) (𝑫, 𝑰𝒏𝒄𝒐𝒎𝒆) = 0.44

Student? yes credit rating?

Stud- Stud-No Fair Poor


Yes (0:3) (3:0) (0:2)
(2:0)
yes no no
Gini Index example:
Feature Selection for Root Node Split
◼ Gini(D) = 0.46

◼ Reduction in Impurity δ Gini(D, Age)


= 0.46 – 0.34 = 0.12

◼ Reduction in Impurity δ Gini(D, Income)


= 0.46 – 0.44 = 0.02

◼ From the above analysis, the attribute ‘Age’ reduces Gini


impurity more than ‘Income’ when used for the root node
split..
◼ Therefore, we select ‘Age’ for root node split

34
Comparing Attribute Selection Measures
Information gain Gini index
used in ID3 algorithm used in CART algorithm
biased towards multivalued
biased towards multivalued
attributes. However, the bias is
attributes (e.g., ‘income’ takes
less, compared to Information
the value low, medium, high)
gain
handles multiple classes has difficulty when the number
effectively of classes is large
tends to favor tests that result
in equal-sized partitions and
high purity in both partitions
Requires computation of
Computation is simpler
logarithms
35
Challenges in Decision Tree Construction
1. Overfitting: The tree becomes too complex and captures
noise instead of patterns. Use pruning (removing
unnecessary branches), set minimum split criteria, or apply
ensemble methods like Random Forest (PTO)
2. Handling Missing Data: Missing values affect decision-
making at nodes. Use imputation techniques
3. Class Imbalance: When one class dominates, the tree
may favor it. Use over sampling (SMOTE), under sampling, or
adjust class weights during training.
4. Computational Complexity: Large datasets with many
features slow down tree construction. Use feature selection,
sampling, or scalable algorithms like Rain Forest
5. Bias-Variance Tradeoff: Simple trees have high bias,
deep trees have high variance. Use cross-validation to find
the optimal tree depth.

36
Overfitting
◼ Overfitting:
◼ If there are too many branches, some branches
may reflect anomalies due to noise or outliers
◼ This results in poor accuracy for unseen samples
◼ Two approaches to avoid overfitting
◼ Pre-pruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold. Problem: difficult to
choose an appropriate threshold
◼ Post-pruning: Remove branches from a “fully grown”
tree. Reserve some data as test data to evaluate
and identify the “best pruned tree”

37
For the dataset given below, find the first splitting attribute for the
decision tree by using the ID3 algorithm (4)

38
3.1.7
Consider the following data for a binary
classification problem with class labels C1 and
C2.
i.Calculate the gain in Gini index when splitting
the root node using the attributes ‘A’ and ‘B’.
Which attribute would the decision tree
induction algorithm choose?
ii.Calculate the information gain when splitting
the root node using the attributes ‘A’ and ‘B’.
Which attribute would the decision tree
induction algorithm choose? (8)
(i) Calculation of the gain in Gini index – 2
marks Attribute selection based on gain value
comparison –2 marks
ii) Calculation of the information gain – 2 marks
Attribute selection based on information gain –
2 marks]

39
Gini Index at Root Node

Gini Impurity(D) = Gini(D)


A B Class Gini(D)

T F -
Gini (D) =
F F -
1-((6/10)^2 + (4/10)^2)
F F -
F F - = 0.48
T T -
T F - 6
T F + 4
T T +
T T +
T T +
10 0.48
40
Gini Index, when we split D using attribute A

Gini (D1) =
A Class Subset
1-((0/3)^2 + (3/3)^2) = 0
F - 3 (0:3)
Gini (D2) =
F - D1
F - 1-((4/7)^2 + (3/7)^2) = 0.49
T + 7 (4:3) Gini (D,A) =
T +
0.49*|D2|/|D| =
T +
T + 0.49*7/10 = 0.34
D2
T - Reduction in Impurity =
T -
Gini(D) – Gini(D,A) =
T -
10 0.48 – 0.34 = 0.14
41
Gini Index, when we split D using attribute B

Gini (D1) =
B Class Subset
1-((1/6)^2 + (5/6)^2) = 0.28
F + 6 (1:5)
Gini (D2) =
F -
F - 1-((3/4)^2 + (1/4)^2) = 0.38
D1
F - Gini (D,B) =
F -
0.28*|D1|/|D|+ 0.38*|D2|/|D|
F -
T + 4 (3:1) 0.28*6/10+ 0.38*4/10 = 0.32
T + Reduction in Impurity =
T + D2
Gini(D) – Gini(D,B) =
T -
10 0.48 – 0.32 = 0.16
42
Select Attribute
▪ Reduction in impurity resulting from root node split by attribute
‘B’ is (0.16). It is higher than that of ‘A’ (0.14).
▪ Therefore, select the attribute 'B' for the root node split.

43
3.1.8.
Find the first splitting attribute for the decision tree by using the
ID3 algorithm with the following dataset. (8)
[Finding the first splitting attribute – 4 marks
Calculating the gain values of attributes – 4 marks]

44
3.1.9
◼ See the table below. The goal is to build a decision tree using the ID3
algorithm to predict whether a person buys a computer game based
on the given features. The target variable is "Buy Computer Game,"
which can be either "Yes" or "No." The features are "Genre" (with
values: Action, Puzzle, Adventure) and "Price Range" (with values:
Low, Medium, High). Which node will be selected at the root?
Data Point Price Range Buy Game
1 Action Low Yes
2 Puzzle Medium No
3 Adventure High Yes
4 Action Medium No
5 Puzzle Low Yes
6 Adventure Medium Yes
7 Action High No
8 Puzzle Medium No
9 Adventure Low Yes
10 Action Low Yes
45
The target variable is "Buy Computer Game," which can be either
"Yes" or "No." Which node will be selected at the root?

Data Buy Computer


Genre Price Range
Point Game
1 Action Low Yes
2 Puzzle Medium No
3 Adventure High Yes
4 Action Medium No
5 Puzzle Low Yes
6 Adventure Medium Yes
7 Action High No
8 Puzzle Medium No
9 Adventure Low Yes
10 Action Low Yes

46
Root Node Split by Price

The attribute ‘price’ splits the


Price Buy
root node into 3 subsets: Info
Range Game
Low I(4,0) Low Yes I(4:0)
Medium I(1,3) Low Yes
High I(1,1) Low Yes
Low Yes
The values in brackets indicate Medium Yes I(1:3)
the count of Yes and the count Medium No
of No Medium No
Medium No
High Yes I(1:1)
High No

47
Root Node Split by Price
• I(4,0) = 0
• I(1,3) = - (1/4) log (1/4)
Price Buy
- (3/4).log (3/4)] Info
Range Game
= 0.811
Low Yes I(4:0)
• I(1,1) = 0.5 Low Yes
• Info (D, Price) = weighted Low Yes
sum of the entropy of the Low Yes
three data subsets
Medium Yes I(1:3)
(4/10).I(4,0) + Medium No
(4/10).I(1,3) + Medium No
(2/10).I(1,1); Medium No
High Yes I(1:1)
= 0 + (4/10) 0.811 +
High No
(2/10) 1 = 0.52

48
Sort by Genre; Sort by Price Range

Genre Buy Game Price Range Buy Game


Action Yes Low Yes
Action Yes Low Yes
Action No Low Yes
Action No Low Yes
Adventure Yes Medium Yes
Adventure Yes Medium No
Adventure Yes Medium No
Puzzle Yes Medium No
Puzzle No High Yes
Puzzle No High No
49
Root Node Split by Genre

• Genre wise: 4 action, 3 adventure, and 3 puzzle games

• Info (D, Genre)= (4/10) I(2,2) + (3/10).I(3,0) + (3/10).


I(1,2)

• (4/10) 1 + 0 + (3/10) [(-1/3) log (1/3) – (2/3).log (2/3)]

• 0.4 + (3/10)0.918 = 0.68

• Info (D, Genre) = 0.68

• Recall that Info (D, Price) = 0.42

• Info (D, Price) < Info (D, Genre)

∴ Information-Gain by Price > Information-Gain by Genre

∴ Price range attribute must be considered for root node split


50
3.1.11 (From Data Analytics, S6)

51
52
3.1.12

53
54
Module – 3.1 Classification

1.1 Classification- Introduction,

1.2 Decision tree construction principle,

1.3 Splitting indices -Information Gain, Gini index,

1.4 Decision tree construction algorithms-ID3,

1.5 Decision tree construction with presorting-SLIQ

55
Decision tree construction with presorting-SLIQ

3.1.5. Explain the construction of decision tree using SLIQ algorithm


with an example. (8)
Explain the working of SLIQ algorithm (6)
• SLIQ algorithm explanation – 4 marks
• Example/ diagram 2 marks
• Rerefences
1. https://www.cs.cmu.edu/~natassa/courses/15-
721/papers/mehta96sliq.pdf
2. Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable
classifier for data mining. In Advances in Database Technology—
EDBT'96: 5th International Conference on Extending Database
Technology Avignon, France, March 25–29, 1996 Proceedings 5
(pp. 18-32). Springer Berlin Heidelberg.
SLIQ

• Is a decision tree classifier


• Handles both numeric and categorical attributes.
SLIQ Lists
• Attribute List
• Each entry in the sorted attribute list contains two values.
• Sorted Attribute Values: A list of attribute values in
sorted order.
• Pointers to the Class List: Each attribute value links to
corresponding entries in the class list.
• Class list.
• There is an entry for every data item in this list.
• It consists of
• the data items’ class-label
• and the data item’s location in the decision tree.
Dataset
Index | Age | Income | Education Level | Class Label |

0 | 25 | 50000 | High School | Yes |


1 | 30 | 60000 | College | No |
2 | 30 | 70000 | College | Yes |
3 | 35 | 80000 | Graduate | Yes |
4 | 35 | 90000 | Graduate | No |

Attribute List Class List


Age | Pointer to Class List Index | Class Label | Node |
--------------------------- ----------------------------
25 | Index 0 0 | Yes | N0 |
30 | Index 1, Index 2 1 | No | N1 |
35 | Index 3, Index 4 2 | Yes | N2 |
3 | Yes | N3 |
4 | No | N4 |
SLQ Tree Growth Algorithm
The SLIQ (Supervised Learning in Quest) algorithm constructs
a decision tree in a level-wise (breadth-first) manner rather
than the traditional recursive depth-first approach.
1. Initial Step – Prepare sorted attribute list and class list
▪ The dataset is first sorted for each attribute separately.
▪ This sorting is done once at the beginning to avoid
repeated sorting during the tree growth.
▪ Each attribute value links to corresponding entries in the
class list.
▪ The class list consists of one set entry for each data
item - the data items’ class label and the data item’s
location (node number) in the decision tree.
▪ Initially, assume that all data items are in the root node
SLQ Tree Growth Algorithm
2. Entropy Calculation at Each Level (Impure Frontier Nodes)
• This is an iterative step.
• At each step, the algorithm processes all the impure
nodes at the current frontier simultaneously (breadth-
first) to effect node split.
• For splitting a node
• Scan all the pre-sorted attribute lists. For each
attribute, calculate the entropy for all distinct attribute
values.
• Select the best attribute for splitting the given node
SLQ Tree Growth Algorithm
3. The frontier nodes are split, and the tree is expanded to a
new frontier. Scan the sorted attribute lists once to update
the new node location in the class list
4. Repeat steps 2 and 3 until termination criteria is met
• Leaf nodes are pure
• No further significant splits can be made.
• A predefined stopping criterion (like a minimum number
of samples per node) is met.
Key Advantages of SLIQ
1. Optimized Sorting: Numeric attributes are sorted once,
reducing computational overhead.
2. Breadth-first tree-growing strategy.
• All nodes at a level are processed simultaneously. Therefore,
redundant calculations are avoided, and splitting is efficient.
• This strategy also enables SLIQ to classify disk-resident datasets.
3. Uses a fast sub-setting algorithm to determine splits for
categorical attributes.
4. Uses a tree-pruning algorithm based on the minimum
description length (MDL) principle. This algorithm is
inexpensive, and results in compact and accurate trees.
The combination of the above techniques enables SLIQ to scale
for large data sets and classify data sets with many
classes, many attributes, and large data sets.
Example …
Dataset, Sorted Attribute Lists, and Class List

TRAINING DATA AFTER PRE−SORTING

Class Class
Data List List
Index Age Salary Class Age Index Salary Index Class Leaf
1 30 65 G 23 2 15 2 1 G N1
2 23 15 B 30 1 40 4 2 B N1
3 40 75 G 40 3 60 6 3 G N1
4 55 40 B 45 6 65 1 4 B N1
5 55 100 G 55 5 75 3 5 G N1
6 45 60 G 55 4 100 5 6 G N1
Example continued …
• Assume that the root node is split using (Age<= 35).
• The left and right child are split using income attribute

65 (Age <= 35)


30 G

23 15 B
(Salary <= 40) (Salary <= 50)
40 75 G

55 40 B

55 100 G

45 60 G
SLIQ -
More Details and Illustrations

66
MDL
• The pruning strategy used in SLIQ is based on the principle of
Minimum Description Length (MDL)
• The MDL principle states that the best model for encoding data is
the one that minimizes the sum of the cost of describing the
data in terms of the model and the cost of describing the
model.
• If M is a model that encodes the data D, the total cost of the
encoding is:
cost(M, D) = cost(D|M) + cost(M),
where cost(D|M) is the number of bits of encoding the data given
a model M and cost(M) is the cost of encoding the model M.
• The models are the set of trees obtained by pruning the initial
decision tree T, and the data is the training data set S. The
objective of MDL pruning is to find the subtree of T that best
describes the training set S.
Subsetting for Categorical Attributes
• Let S be the set of possible values of a categorical attribute A
• The evaluation of all the subsets of S can be very expensive,
if the cardinality of S is large.
• SLIQ uses a hybrid approach to overcome this issue. If the
cardinality of S is less than a threshold, MAXSETSIZE, then all
of the subsets of S are evaluated
• Otherwise, a greedy algorithm is used to obtain the desired
subset. The greedy algorithm starts with an empty subset S’
and adds that one element of S to S’ which gives the best
split. The process is repeated until there is no improvement
in the splits. This hybrid approach finds the optimal subset, if
S is small. It also performs well for larger subsets.
Module – 3.2
Classification Accuracy-Precision, Recall.
1.1 Classification- Introduction,
1.2 Decision tree construction
principle,
1.3 Splitting indices -
Information Gain, Gini
index,
1.4 Decision tree construction
algorithms-ID3,
1.5 Decision tree construction
with presorting-SLIQ
2 Classification Accuracy-
Precision, Recall.

1
3.2.1 Exercise
Suppose the dataset had 9700 cancer-free images from
10000 images from patients. A clinic conducts cancer tests.
The test finds 230 positive. However, 140 are wrongly
categorized as positive. Find precision, recall and accuracy.
Is it a good classifier? Justify

2
3.2.1 Exercise
Actual

+ 300

- 9700

10000

The dataset has 9700 cancer-free images from 10000


images from patients.

3
3.2.1 Exercise
+ -

Predicted 230 9770 10000

There are 10000 images. A clinic conducts cancer tests.


The test finds 230 positive.

4
3.2.1 Exercise
+ - Actual

+ 300

- 9700

Predicted 230 9770 10000

The dataset has 9700 cancer-free images from 10000


images from patients.
A clinic conducts cancer tests. The test finds 230
positive.

5
3.2.1 Exercise
+ - Actual

+ 300

- 140 (FP) 9700

Predicted 230 9770 10000

A clinic conducts cancer tests. The test finds 230


positive. However, 140 are wrongly categorized as
positive – False Positive

6
3.2.1 Exercise
+ - Actual

+ 90 (TP) 300

- 140 (FP) 9700

Predicted 230 9770 10000

The test finds 230 positive. However, 140 are wrongly


categorized as positive – False Positive.
Therefore, 90 are True positive

7
3.2.1 Exercise
+ - Actual

+ 90 (TP) 210 (FN) 300

- 140 (FP) 9700

Predicted 230 9770 10000

The test finds 230 positive. However, 140 are wrongly


categorized as positive – False Positive.
Therefore, 90 are True Positive.
210 are False Negative

8
3.2.1 Exercise
+ - Actual

+ 90 (TP) 210 (FN) 300

- 140 (FP) 9560 (TN) 9700

Predicted 230 9770 10000

Total images 10000; Cancer-free images 9700.


A clinic conducts cancer tests. The test finds 230
positive. But 140 are wrongly categorized as positive.

9
3.2.1 Exercise
Actual \
Predicted + - Recall
class

+ TP FN TP / (TP+FN)

- FP TN

Accuracy =
Precision TP / (TP+FP) (TP+TN) /
(TP + FP + FN + TN)

10
3.2.1 Exercise
+ - Recall =
90 / (90+210)
+ 90 (TP) 210 (FN)
= 30%
- 140 (FP) 9560 (TN) 9700
Precision =
10000
90 / (90+140) = 39%

Precision = TP / (TP+FP) = 90 / (90+140) = 39%


Recall = TP / (TP+FN) = 90 / (90+210) = 30%
Accuracy = (TP+TN) / {TP + FP + FN + TN}
= (90 + 9560) / 10000 = 96.5%
11
Precision, Recall, Accuracy – A Comparison
◼ Accuracy is best when both false positives and false negatives
need balance (e.g., surveillance studies like tracking covid cases
globally).
◼ Precision matters when false positives are risky (e.g., immunity
tests).
◼ Recall matters when missing real cases (or false negatives) is
dangerous (e.g., the presence of cancer, airport screening).

12
Classifier Performance:
Precision and Recall, and F-measures

Accuracy = {TP + TN} / {TP + FP + FN + TN}

13
3.2.2 Exercise
◼ A binary classification result, where the correct labels are [T, T,
F, T, F, T, F, T] and the predicted labels are [T, F, T, T, F, F, F,
T]. Assume T means “true” (the desired class) and F (“false”) is
the “default” class. Compute accuracy, and compute recall,
precision, and Accuracy

• True labels:

• [ T, T, F, T, F, T, F, T]

• Prediction:

• [ T, F, T, T, F, F, F, T]

14
3.2.2 Exercise (1. Actual, 2. Prediction)
[ T, T, F, T, F, T, F, T]
True Positives (3)
[ T, F, T, T, F, F, F, T]

[ T, T, F, T, F, T, F, T]
False Positives (1)
[ T, F, T, T, F, F, F, T]

[ T, T, F, T, F, T, F, T]
True Negatives (2)
[ T, F, T, T, F, F, F, T]

[ T, T, F, T, F, T, F, T]
False Negatives (2)
[ T, F, T, T, F, F, F, T]

15
3.2.2 Exercise
◼ Prediction : [ T, F, T, T, F, F, F, T]
◼ True labels: [ T, T, F, T, F, T, F, T]

◼ TP = 3 + - Recall =
+ 3 (TP) 2 (FN) 3 / (3+2)
◼ FP = 1
- 1 (FP) 2 (TN)
◼ FN = 2
◼ TN = 2 Precision = 3/(3+1)

◼ P= Precision= TP/(TP+FP) = 3 / (3+1)= 3/4


◼ R= Recall= TP/(TP+FN) = 3 / (3+2)= 3/5
◼ Accuracy = (TP+TN) / (TP + FP + FN + TN) = (3+2) /8 = 5/8

16
3.2.3 Exercise
◼ Suppose a computer program for recognizing dogs in
photographs identifies eight dogs in a picture containing 12
dogs and some cats. Of the eight dogs identified, five actually
are dogs, while the rest are cats. Compute accuracy, and
compute recall, precision, and F1 Score

17
3.2.3 Exercise
Actual

Dogs 12

Cats ?

a picture containing 12 dogs and some cats

18
3.2.3 Exercise
Dogs Cats

Predicted 8

a computer program identifies 8 dogs.


Of the 8 dogs identified, 5 actually are dogs, while the
rest are cats.

19
3.2.3 Exercise
Dogs Cats Actual

Dogs 5 12

Cats 3 ?

Predicted 8

A picture containing 12 dogs and some cats.


A computer identifies 8 dogs. Of these, 5 are actually
dogs

20
3.2.3 Exercise
Dogs Cats Actual

Dogs 5 7 12

Cats 3 ?

Predicted 8

A picture containing 12 dogs and some cats.


Computer identifies 8 dogs. Of these, 5 are actually dogs.
Therefore, there are 7 cats

21
3.2.3 Exercise
Dogs Cats Actual

Dogs 5 (TP) 7 (FN) 12

Cats 3 (FP) ?

Predicted 8

P = Precision= TP/(TP+FP) = 5 / (5+7) = 5 / 12


R = Recall = TP/(TP+FN) = 5 / (5+3) = 5 / 8
F Score = 2.P.R / (P+R) = 0.5

22
3.2.3 Exercise
Dogs Cats Actual

Dogs 5 (TP) 7 (FN) 12

Cats 3 (FP) TN? ?

Predicted 8

Accuracy = (TP+TN) / (TP + FP + FN + TN);


The number of cats is not given. So, let us consider
only the accuracy of predicting dogs alone.
Accuracy = TP / (TP + FN) = 5 / 12

23
June 2023
A database contains 80 records on a particular topic of which 55
are relevant to a certain investigation.
A search was conducted on that topic and 50 records were
retrieved.
Of the 50 records retrieved, 40 were relevant.
Construct the confusion matrix and calculate the precision and
recall scores for the search.

24
3.2.3 Exercise
Actual

Relevant 55
Not
Relevant

80

A database contains 80 records on a particular topic of


which 55 are relevant to a certain investigation

25
3.2.3 Exercise
Not
Relevant
Relevant
40

Predicted 50

A search was conducted on that topic


50 records were retrieved.
Of the records retrieved, 40 were relevant

26
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 55
Not
Relevant

Predicted 50 80

A database contains 80 records on a particular topic of


which 55 are relevant to a certain investigation.
A search was conducted on that topic. 50 records were
retrieved. Of the records retrieved, 40 were relevant

27
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 15 55
Not
10
Relevant

Predicted 50 80

28
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 (TP) 15 (FN) 55
Not
10 (FP) TN?
Relevant

Predicted 50 80

P = Precision= 40 / (40 + 10) = 40 / 50


R = Recall = 40 / (40 + 15) = 40 / 55
F Score = P.R / (P+R)

29
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 (TP) 15 (FN) 55
Not
10 (FP) TN?
Relevant

Predicted 50 80

Accuracy = (TP+TN) / (TP + FP + FN + TN);


TN is not given. So, let us consider only the accuracy of
predicting relevant documents.
Accuracy = TP / (TP + FN) = 40 / 55

30
May 2024
Draw the confusion matrix and calculate precision and recall of the given
data. (3)
Data Target Prediction
1 cat cat
2 cat dog
3 dog dog
4 dog dog
5 dog cat

31

You might also like