Mod 3 part1_merged
Mod 3 part1_merged
Cluster analysis)
Module – 3.1
(Advanced classification)
1
Module - 3 (Advanced classification and Cluster
analysis)
1.1 Classification- Introduction, 3.1 Introduction to clustering-
1.2 Decision tree construction Clustering Paradigms,
principle, 3.2 Partitioning Algorithm-
1.3 Splitting indices - PAM,
Information Gain, Gini 3.3 Hierarchical Clustering-
index, DBSCAN,
1.4 Decision tree construction 3.4 Categorical Clustering-
algorithms-ID3, ROCK
1.5 Decision tree construction
with presorting-SLIQ,
Note: Data Analytics Coverage
1. Classification: Naïve Bayes, KNN,
2 Classification Accuracy-
2. Clustering: Hierarchical,
Precision, Recall. Partitioning
2
Module – 3.1 Classification
1.1 Classification- Introduction,
1.2 Decision tree construction
principle,
1.3 Splitting indices -
Information Gain, Gini
index,
1.4 Decision tree construction
algorithms-ID3,
Note: Data Analytics Coverage
1.5 Decision tree construction
1. Classification: Naïve Bayes, KNN
with presorting-SLIQ
3
3.1 Short Notes
3.1.1. Define the entropy of a dataset. Write a formula to compute the
entropy of a two-class dataset. (Slide: 17, 18)
3.1.2. How is Gain Ratio calculated? What is the advantage of Gain Ratio
over Information Gain? (3)
Define information gain and gini index. (Slide 19, 29)
3.1.3. Explain the steps in creating decision tree using ID3 algorithm (4)
Explain the ID3 algorithm for building decision trees - Slide 12,13,14
3.1.4. Discuss the issues in the implementation of a decision tree (3)
What are the challenges in building a decision tree? How are they
overcome?
3.1.5. Explain the construction of decision tree using SLIQ algorithm with
an example. (8)
Explain the working of SLIQ algorithm (6)
4
Regression Vs Classification
5
Classification Methods
6
3.1.6.
Consider the dataset for a binary classification problem with class
label “yes” and “no”. The table shows class labeled dataset of
customers in a bank. Explain information gain attribute selection
measure, and find the information gain of the attribute “age”.
7
Decision Tree Construction - Illustration:
(Note: At First Categorize the attribute “Age”)
age income student credit buys computer
<=30 high no fair no
<=30 high no poor no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes poor no
31…40 low yes poor yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes poor yes
31…40 medium no poor yes
31…40 high yes fair yes
>40 medium no poor no
8
Select the attribute ‘Age’ for root node split.
Sort the data by age categories
Buys Buys
age buys
Yes No
age?
<=30 2 3 no
<=30 no
<=30 no
<=30 yes
<=30 yes
>40 3 2 no <=30 31..40 >40
>40 yes (2:3) (4:0) (3:2)
>40 no
>40 yes
>40 yes
31…40 4 0 yes yes
31…40 yes
31…40 yes
31…40 yes
9
For Age 31 ... 40, select attribute ‘Student’ for node split
<=30 overcast
31..40 >40
(2:3) (4:0) (3:2)
Student-Yes/No? yes
Stud-Yes Stud-No
(2:0) (3:0)
yes no
For Age > 40, Select Attribute ‘Credit Rating’ for Split
age?
<=30 31..40
overcast >40
(2:3) (4:0) (3:2)
13
Algorithm for Decision Tree Construction …
◼ At the start, all the training samples are at the root
◼ Attributes are assumed to be categorical (if the attributes
are continuous-valued, discretize them in advance)
◼ Construct a decision tree by splitting a node and growing
each branch. This is done in a top-down recursive manner,
until the conditions for termination of the algorithm is
reached (PTO). The steps involved are: -
◼ If the tuples are all from the same class, then the node
becomes a leaf, labeled with that class.
◼ Otherwise, partition the node using a selected attribute.
The attribute selection is done based on a heuristic or
statistical measure such as information gain, Gini index,
etc.
14
… Algorithm for Decision Tree Construction
◼ Conditions for termination of the algorithm
◼ All samples for a given node belong to the same class
◼ There are no more attributes for further partitioning. In this
case, majority voting is employed for classifying the leaf
◼ There are no samples left
◼ Other conditions set by the researcher. E.g., maximum tree
depth is attained
15
The ID3 Algorithm for Algorithm
for Decision Tree Construction
◼ ID3 (Iterative dichotomiser 3) is an algorithm for
Decision Tree Construction
◼ It follows the basic steps listed n the ‘Algorithm for
Decision Tree Construction’. In addition, the following
characteristics apply.
◼ The ID3 algorithm assumes that there are only two
class labels, namely, “+” and “−”.
◼ The attributes can be multi-valued
◼ The algorithm uses information gain to select the
attribute for node split.
16
𝐼𝑛𝑓𝑜 𝐷, 𝐴𝑔𝑒 =
Decision Tree 5 4 5
Construction using 𝐼 2,3 + 𝐼 4,0 + 𝐼 3,2
14 14 14
ID3 Algorithm = 0.694
Age Gain(Age) = 0.246
𝐺𝑎𝑖𝑛 𝑖𝑛𝑐𝑜𝑚𝑒 = 0.029
𝐺𝑎𝑖𝑛(𝑠𝑡𝑢𝑑𝑒𝑛𝑡) = 0.151
𝐺𝑎𝑖𝑛 𝑐𝑟𝑒𝑑𝑖𝑡 = 0.048
Young Middle
overcast Senior
(2:3) (4:0) (3:2)
◼ Information Gain
◼ Gini Index
18
Entropy
◼ Entropy, in information theory, is a measure of uncertainty
associated with a random variable.
H(Y) = − σ𝑚
𝑖=1 𝑝𝑖 log 2 ( 𝑝𝑖 ),
19
Entropy of a 2-class dataset
◼ 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐃 = 𝑰 𝒑𝟏, 𝒑𝟐 = −𝒑𝟏 𝐥𝐨𝐠 𝟐 ( 𝒑𝟏) − 𝒑𝟐 𝒍𝒐𝒈𝟐 ( 𝒑𝟐)
where:
𝑝1 is the proportion of samples in D belonging to class 1
𝑝2 is the proportion of samples in D belonging to class 2
◼ If 𝑝1 or 𝑝2 equals zero, the entropy = 0. The dataset is pure
◼ For a 2 class dataset, if 𝑝1 = 𝑝2 = 0.5, entropy = 1. The
dataset is perfectly impure (equal probability for each class).
◼ Consider a dataset with 9 objects labeled ‘Yes’ and 5 labeled ‘No’.
𝟗 𝟗 𝟓 𝟓
𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐃 = 𝑰 𝟗, 𝟓 = − 𝐥𝐨𝐠 𝟐 ( ) − 𝐥𝐨𝐠 𝟐 ( )
𝟏𝟒 𝟏𝟒 𝟏𝟒 𝟏𝟒
= 0.940
20
Attribute Selection Measure:
Information Gain
◼ Let pi be the probability that an arbitrary tuple in D belongs
to class Ci. Therefore, pi = |Ci, D| / |D|
◼ Entropy of D = 𝐼𝑛𝑓𝑜(𝐷) = − σ𝑚
𝑖=1 𝑝𝑖 log 2 ( 𝑝𝑖 )
21
Decision Tree Construction - Illustration:
(Note: At First Categorize the attribute “Age”)
age income student credit buys computer
<=30 high no fair no
<=30 high no poor no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes poor no
31…40 low yes poor yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes poor yes
31…40 medium no poor yes
31…40 high yes fair yes
>40 medium no poor no
22
Information Gain Example buys computer
Yes
Use the example of a laptop purchase. yes
Identify the best attribute based on yes
the ‘information gain’. yes
yes
In Database ‘D,’ there are 14 samples,
with 9 yes and 5 no.
yes
Yes
Probabilities p1 = 9/14 and p2= 5/14 Yes
𝐼𝑛𝑓𝑜 𝐷 = 𝐼 9,5 = yes
no
9 9 5 5 no
− log 2 ( ) − log 2 ( ) no
14 14 14 14
no
= 0.940 no
23
Information Gain: Consider ‘Age’ for Root Node Split
𝐼𝑛𝑓𝑜 𝐷 = 0.940 buys
age
computer
Attribute ‘age’ splits ‘D’ into 3 subsets.
1-Young yes
1-Young yes
𝐼𝑛𝑓𝑜 𝐷, 𝐴𝑔𝑒 =
1-Young no
1-Young no
5 4 5
𝐼 2,3 + 𝐼 4,0 + 𝐼 3,2 1-Young no
14 14 14 2-Middle yes
2-Middle yes
(5/14)*0.97 + (4/14)*0 + (5/14)*0.97 2-Middle yes
2-Middle yes
= 0.694 3-Senior yes
3-Senior yes
𝐺𝑎𝑖𝑛 𝐷, 𝐴𝑔𝑒
3-Senior yes
= 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜(𝐷, 𝐴𝑔𝑒) = 0.246 3-Senior no
3-Senior no
24
Info. Gain: Consider ‘Income’ for Root Node Split
𝐼𝑛𝑓𝑜 𝐷 = 0.940 buys
income
computer
Attribute ‘income’ splits ‘D’ into 3 subsets. 1-high yes
1-high yes
𝐼𝑛𝑓𝑜 𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒 = 1-high no
1-high no
4 6 4 2-medium yes
𝐼 2,2 + 𝐼 4,2 + 𝐼 3,1
14 14 14 2-medium yes
2-medium yes
2-medium yes
(4/14)*1.0 + (6/14)*0.92 + (4/14)* 0.81
2-medium no
2-medium no
= 0.911
3-low yes
𝐺𝑎𝑖𝑛 𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒 3-low yes
3-low yes
= 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜(𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒) = 0.029 3-low no
25
Info. Gain: Consider ‘Student’ for Root Node Split
𝐼𝑛𝑓𝑜 𝐷 = 0.940 buys
student
computer
Attribute ‘student’ splits ‘D’ into 2 subsets. yes yes
yes yes
𝐼𝑛𝑓𝑜 𝐷, 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 = yes yes
yes yes
7 7 yes yes
𝐼 6,1 + 𝐼 3,4
14 14 yes yes
yes no
no yes
7/14(0.59) + 7/14(0.99) = 0.79
no yes
no yes
𝐺𝑎𝑖𝑛 𝐷, 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 no no
no no
= 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜(𝐷, 𝐼𝑛𝑐𝑜𝑚𝑒) = 0.151 no no
no no
26
Information Gain example
Feature Selection for Root Node Split
27
Attribute Selection Measures
◼ Information Gain
◼ Gini Index
28
Gini Index of a Dataset buys computer
◼ If a data set D contains samples from n yes
classes, the Gini index, Gini (D) is defined as yes
n 2 yes
gini( D) = 1− p j Yes
j =1 yes
where pj is relative frequency of class j in D yes
Yes
◼ The computer purchase database D has
Yes
customers who buy computers and those who
yes
do not buy. The probability of a customer
buying computer p1 = 9/14 no
no
◼ Probability of not buying computer p2 = 5/14 no
𝐺𝑖𝑛𝑖 𝐷 = 1 − ((9/14)^2 + (5/14)^2) no
= 0.46 no
29
Attribute Selection Measure: Gini Index
◼ If attribute A splits the data set D into two subsets D1
and D2, the resulting Gini index is computed as:-
|𝐷1| |𝐷2|
Gini(D, A) = Gini(D1) + Gini(D2)
|𝐷| |𝐷|
where |D|, |D1| and |D2| are the sizes of the databases
D, D1, and D2 respectively
◼ Reduction in Impurity when ‘A’ splits D into D1 and D2
δ Gini(D, A) = Gini(D) - Gini(D, A)
◼ Compute the Gini index for splitting node ‘D’ using each
attribute. Choose the attribute that results in the
greatest impurity reduction.
30
‘Age’ for Root Node Split
Gini Impurity D = buys
age
computer
1 − ((9/14)^2 + (5/14)^2) = 0.46
1-Young yes
Attribute ‘Age’ splits ‘D’ into 3 subsets 1-Young yes
1-Young no
𝐆𝐢𝐧𝐢 𝐈𝐦𝐩𝐮𝐫𝐢𝐭𝐲 𝐃, 𝐀𝐠𝐞 = 1-Young no
1-Young no
5/14[1 − ((2/5)^2 + (3/5)^2))]
2-Middle yes
+ 4/14[0] 2-Middle yes
+ 5/14[ 1 − ((3/5)^2 + (2/5)^2))] 2-Middle yes
2-Middle yes
= 0.34 3-Senior yes
3-Senior yes
Reduction in Impurity = δ Gini(D, Age) =
3-Senior yes
Gini(D) – Gini(D, Age) = 0.46 – 0.34 = 0.12 3-Senior no
3-Senior no
31
‘Income’ for Root Node Split
Gini Impurity 𝐷 = 𝐺𝑖𝑛𝑖 (𝐷) buys
income
9 2 5 2 computer
=1− + = 0.46 1-high yes
14 14
1-high yes
Attribute ‘income’ splits ‘D’ into 3 subsets. 1-high no
1-high no
𝑮𝒊𝒏𝒊 𝑰𝒎𝒑𝒖𝒓𝒊𝒕𝒚 𝑫, 𝑰𝒏𝒄𝒐𝒎𝒆 = 2-medium yes
4/15[1 − ((2/4)^2 + (2/4)^2))] 2-medium yes
+ 6/15[1 − ((4/6)^2 + (2/6)^2))] 2-medium yes
2-medium yes
+ 4/15[ 1 − ((3/4)^2 + (1/4)^2))] 2-medium no
= 0.44 2-medium no
Reduction in Impurity = δ Gini(D, Income) 3-low yes
3-low yes
= Gini(D) – Gini(D, Income) = 0.46 – 0.44
3-low yes
= 0.02 3-low no
32
DT Using Gini Index 𝑮𝒊𝒏𝒊 𝑰𝒎𝒑𝒖𝒓𝒊𝒕𝒚 𝑫, 𝑨𝒈𝒆 =
5/14[1 − (2/5)^2]
Age 5/14[1 − (3/5)^2)] + 0
= 0.34
Young Middle
overcast Senior 𝑮𝒊𝒏𝒊 𝑰𝒎𝒑𝒖𝒓𝒊𝒕𝒚
(2:3) (4:0) (3:2) (𝑫, 𝑰𝒏𝒄𝒐𝒎𝒆) = 0.44
34
Comparing Attribute Selection Measures
Information gain Gini index
used in ID3 algorithm used in CART algorithm
biased towards multivalued
biased towards multivalued
attributes. However, the bias is
attributes (e.g., ‘income’ takes
less, compared to Information
the value low, medium, high)
gain
handles multiple classes has difficulty when the number
effectively of classes is large
tends to favor tests that result
in equal-sized partitions and
high purity in both partitions
Requires computation of
Computation is simpler
logarithms
35
Challenges in Decision Tree Construction
1. Overfitting: The tree becomes too complex and captures
noise instead of patterns. Use pruning (removing
unnecessary branches), set minimum split criteria, or apply
ensemble methods like Random Forest (PTO)
2. Handling Missing Data: Missing values affect decision-
making at nodes. Use imputation techniques
3. Class Imbalance: When one class dominates, the tree
may favor it. Use over sampling (SMOTE), under sampling, or
adjust class weights during training.
4. Computational Complexity: Large datasets with many
features slow down tree construction. Use feature selection,
sampling, or scalable algorithms like Rain Forest
5. Bias-Variance Tradeoff: Simple trees have high bias,
deep trees have high variance. Use cross-validation to find
the optimal tree depth.
36
Overfitting
◼ Overfitting:
◼ If there are too many branches, some branches
may reflect anomalies due to noise or outliers
◼ This results in poor accuracy for unseen samples
◼ Two approaches to avoid overfitting
◼ Pre-pruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold. Problem: difficult to
choose an appropriate threshold
◼ Post-pruning: Remove branches from a “fully grown”
tree. Reserve some data as test data to evaluate
and identify the “best pruned tree”
37
For the dataset given below, find the first splitting attribute for the
decision tree by using the ID3 algorithm (4)
38
3.1.7
Consider the following data for a binary
classification problem with class labels C1 and
C2.
i.Calculate the gain in Gini index when splitting
the root node using the attributes ‘A’ and ‘B’.
Which attribute would the decision tree
induction algorithm choose?
ii.Calculate the information gain when splitting
the root node using the attributes ‘A’ and ‘B’.
Which attribute would the decision tree
induction algorithm choose? (8)
(i) Calculation of the gain in Gini index – 2
marks Attribute selection based on gain value
comparison –2 marks
ii) Calculation of the information gain – 2 marks
Attribute selection based on information gain –
2 marks]
39
Gini Index at Root Node
T F -
Gini (D) =
F F -
1-((6/10)^2 + (4/10)^2)
F F -
F F - = 0.48
T T -
T F - 6
T F + 4
T T +
T T +
T T +
10 0.48
40
Gini Index, when we split D using attribute A
Gini (D1) =
A Class Subset
1-((0/3)^2 + (3/3)^2) = 0
F - 3 (0:3)
Gini (D2) =
F - D1
F - 1-((4/7)^2 + (3/7)^2) = 0.49
T + 7 (4:3) Gini (D,A) =
T +
0.49*|D2|/|D| =
T +
T + 0.49*7/10 = 0.34
D2
T - Reduction in Impurity =
T -
Gini(D) – Gini(D,A) =
T -
10 0.48 – 0.34 = 0.14
41
Gini Index, when we split D using attribute B
Gini (D1) =
B Class Subset
1-((1/6)^2 + (5/6)^2) = 0.28
F + 6 (1:5)
Gini (D2) =
F -
F - 1-((3/4)^2 + (1/4)^2) = 0.38
D1
F - Gini (D,B) =
F -
0.28*|D1|/|D|+ 0.38*|D2|/|D|
F -
T + 4 (3:1) 0.28*6/10+ 0.38*4/10 = 0.32
T + Reduction in Impurity =
T + D2
Gini(D) – Gini(D,B) =
T -
10 0.48 – 0.32 = 0.16
42
Select Attribute
▪ Reduction in impurity resulting from root node split by attribute
‘B’ is (0.16). It is higher than that of ‘A’ (0.14).
▪ Therefore, select the attribute 'B' for the root node split.
43
3.1.8.
Find the first splitting attribute for the decision tree by using the
ID3 algorithm with the following dataset. (8)
[Finding the first splitting attribute – 4 marks
Calculating the gain values of attributes – 4 marks]
44
3.1.9
◼ See the table below. The goal is to build a decision tree using the ID3
algorithm to predict whether a person buys a computer game based
on the given features. The target variable is "Buy Computer Game,"
which can be either "Yes" or "No." The features are "Genre" (with
values: Action, Puzzle, Adventure) and "Price Range" (with values:
Low, Medium, High). Which node will be selected at the root?
Data Point Price Range Buy Game
1 Action Low Yes
2 Puzzle Medium No
3 Adventure High Yes
4 Action Medium No
5 Puzzle Low Yes
6 Adventure Medium Yes
7 Action High No
8 Puzzle Medium No
9 Adventure Low Yes
10 Action Low Yes
45
The target variable is "Buy Computer Game," which can be either
"Yes" or "No." Which node will be selected at the root?
46
Root Node Split by Price
47
Root Node Split by Price
• I(4,0) = 0
• I(1,3) = - (1/4) log (1/4)
Price Buy
- (3/4).log (3/4)] Info
Range Game
= 0.811
Low Yes I(4:0)
• I(1,1) = 0.5 Low Yes
• Info (D, Price) = weighted Low Yes
sum of the entropy of the Low Yes
three data subsets
Medium Yes I(1:3)
(4/10).I(4,0) + Medium No
(4/10).I(1,3) + Medium No
(2/10).I(1,1); Medium No
High Yes I(1:1)
= 0 + (4/10) 0.811 +
High No
(2/10) 1 = 0.52
48
Sort by Genre; Sort by Price Range
51
52
3.1.12
53
54
Module – 3.1 Classification
55
Decision tree construction with presorting-SLIQ
Class Class
Data List List
Index Age Salary Class Age Index Salary Index Class Leaf
1 30 65 G 23 2 15 2 1 G N1
2 23 15 B 30 1 40 4 2 B N1
3 40 75 G 40 3 60 6 3 G N1
4 55 40 B 45 6 65 1 4 B N1
5 55 100 G 55 5 75 3 5 G N1
6 45 60 G 55 4 100 5 6 G N1
Example continued …
• Assume that the root node is split using (Age<= 35).
• The left and right child are split using income attribute
23 15 B
(Salary <= 40) (Salary <= 50)
40 75 G
55 40 B
55 100 G
45 60 G
SLIQ -
More Details and Illustrations
66
MDL
• The pruning strategy used in SLIQ is based on the principle of
Minimum Description Length (MDL)
• The MDL principle states that the best model for encoding data is
the one that minimizes the sum of the cost of describing the
data in terms of the model and the cost of describing the
model.
• If M is a model that encodes the data D, the total cost of the
encoding is:
cost(M, D) = cost(D|M) + cost(M),
where cost(D|M) is the number of bits of encoding the data given
a model M and cost(M) is the cost of encoding the model M.
• The models are the set of trees obtained by pruning the initial
decision tree T, and the data is the training data set S. The
objective of MDL pruning is to find the subtree of T that best
describes the training set S.
Subsetting for Categorical Attributes
• Let S be the set of possible values of a categorical attribute A
• The evaluation of all the subsets of S can be very expensive,
if the cardinality of S is large.
• SLIQ uses a hybrid approach to overcome this issue. If the
cardinality of S is less than a threshold, MAXSETSIZE, then all
of the subsets of S are evaluated
• Otherwise, a greedy algorithm is used to obtain the desired
subset. The greedy algorithm starts with an empty subset S’
and adds that one element of S to S’ which gives the best
split. The process is repeated until there is no improvement
in the splits. This hybrid approach finds the optimal subset, if
S is small. It also performs well for larger subsets.
Module – 3.2
Classification Accuracy-Precision, Recall.
1.1 Classification- Introduction,
1.2 Decision tree construction
principle,
1.3 Splitting indices -
Information Gain, Gini
index,
1.4 Decision tree construction
algorithms-ID3,
1.5 Decision tree construction
with presorting-SLIQ
2 Classification Accuracy-
Precision, Recall.
1
3.2.1 Exercise
Suppose the dataset had 9700 cancer-free images from
10000 images from patients. A clinic conducts cancer tests.
The test finds 230 positive. However, 140 are wrongly
categorized as positive. Find precision, recall and accuracy.
Is it a good classifier? Justify
2
3.2.1 Exercise
Actual
+ 300
- 9700
10000
3
3.2.1 Exercise
+ -
4
3.2.1 Exercise
+ - Actual
+ 300
- 9700
5
3.2.1 Exercise
+ - Actual
+ 300
6
3.2.1 Exercise
+ - Actual
+ 90 (TP) 300
7
3.2.1 Exercise
+ - Actual
8
3.2.1 Exercise
+ - Actual
9
3.2.1 Exercise
Actual \
Predicted + - Recall
class
+ TP FN TP / (TP+FN)
- FP TN
Accuracy =
Precision TP / (TP+FP) (TP+TN) /
(TP + FP + FN + TN)
10
3.2.1 Exercise
+ - Recall =
90 / (90+210)
+ 90 (TP) 210 (FN)
= 30%
- 140 (FP) 9560 (TN) 9700
Precision =
10000
90 / (90+140) = 39%
12
Classifier Performance:
Precision and Recall, and F-measures
13
3.2.2 Exercise
◼ A binary classification result, where the correct labels are [T, T,
F, T, F, T, F, T] and the predicted labels are [T, F, T, T, F, F, F,
T]. Assume T means “true” (the desired class) and F (“false”) is
the “default” class. Compute accuracy, and compute recall,
precision, and Accuracy
• True labels:
• [ T, T, F, T, F, T, F, T]
• Prediction:
• [ T, F, T, T, F, F, F, T]
14
3.2.2 Exercise (1. Actual, 2. Prediction)
[ T, T, F, T, F, T, F, T]
True Positives (3)
[ T, F, T, T, F, F, F, T]
[ T, T, F, T, F, T, F, T]
False Positives (1)
[ T, F, T, T, F, F, F, T]
[ T, T, F, T, F, T, F, T]
True Negatives (2)
[ T, F, T, T, F, F, F, T]
[ T, T, F, T, F, T, F, T]
False Negatives (2)
[ T, F, T, T, F, F, F, T]
15
3.2.2 Exercise
◼ Prediction : [ T, F, T, T, F, F, F, T]
◼ True labels: [ T, T, F, T, F, T, F, T]
◼ TP = 3 + - Recall =
+ 3 (TP) 2 (FN) 3 / (3+2)
◼ FP = 1
- 1 (FP) 2 (TN)
◼ FN = 2
◼ TN = 2 Precision = 3/(3+1)
16
3.2.3 Exercise
◼ Suppose a computer program for recognizing dogs in
photographs identifies eight dogs in a picture containing 12
dogs and some cats. Of the eight dogs identified, five actually
are dogs, while the rest are cats. Compute accuracy, and
compute recall, precision, and F1 Score
17
3.2.3 Exercise
Actual
Dogs 12
Cats ?
18
3.2.3 Exercise
Dogs Cats
Predicted 8
19
3.2.3 Exercise
Dogs Cats Actual
Dogs 5 12
Cats 3 ?
Predicted 8
20
3.2.3 Exercise
Dogs Cats Actual
Dogs 5 7 12
Cats 3 ?
Predicted 8
21
3.2.3 Exercise
Dogs Cats Actual
Cats 3 (FP) ?
Predicted 8
22
3.2.3 Exercise
Dogs Cats Actual
Predicted 8
23
June 2023
A database contains 80 records on a particular topic of which 55
are relevant to a certain investigation.
A search was conducted on that topic and 50 records were
retrieved.
Of the 50 records retrieved, 40 were relevant.
Construct the confusion matrix and calculate the precision and
recall scores for the search.
24
3.2.3 Exercise
Actual
Relevant 55
Not
Relevant
80
25
3.2.3 Exercise
Not
Relevant
Relevant
40
Predicted 50
26
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 55
Not
Relevant
Predicted 50 80
27
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 15 55
Not
10
Relevant
Predicted 50 80
28
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 (TP) 15 (FN) 55
Not
10 (FP) TN?
Relevant
Predicted 50 80
29
3.2.3 Exercise
Not
Relevant Actual
Relevant
Relevant 40 (TP) 15 (FN) 55
Not
10 (FP) TN?
Relevant
Predicted 50 80
30
May 2024
Draw the confusion matrix and calculate precision and recall of the given
data. (3)
Data Target Prediction
1 cat cat
2 cat dog
3 dog dog
4 dog dog
5 dog cat
31