0% found this document useful (0 votes)
18 views50 pages

CSE4261 Lecture-10

Cse

Uploaded by

asad chowdhury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views50 pages

CSE4261 Lecture-10

Cse

Uploaded by

asad chowdhury
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Advanced Analytical Theory and

Methods: Classification

Prof. Dr. Shamim Akhter


Professor, Dept. of CSE
Ahsanullah University of Science and Technology
Classification
• In classification learning,
– a classifier is presented with a set of already
classified examples; from these examples, the
classifier learns to assign unseen examples.
– assign class labels to new observations.
• Logistic regression is one of the popular classification
methods.
Tree-Based Learning
• Segmenting the predictor space into several
simple regions.
– Apply some splitting rules
• Rather training and observation and classification
– Formulate a tree-based learning

Decision Tree Methods


A decision tree (also called a prediction tree) uses a tree structure to
specify sequences of decisions and consequences.
Given input X= {x1, x2, … xn), the goal is to predict a response or
output variable Y. Each set member { x1,x2, ... xn} is called an input
variable.
Decision Tree
• Decision trees have two varieties: classification
trees and regression trees.
– Classification trees usually apply to output variables
that are categorical-often binary in nature, such as yes
or no, purchase or not purchase, and so on.
– Regression trees, on the other hand, can apply to
output variables that are numeric or continuous, such
as the predicted price of a consumer good or the
likelihood that a subscription will be purchased.
Solution:
Regression

Regression
Regression
E=1-P(m==k)
General Algorithm
Example
There are 26x32x42=9,216 possible combinations of values for the
input attributes. But we are given the correct output for only 12 of
them; each of the other 9,204 could be either true or false; we don’t
know.
Objective: find the smallest tree that is consistent with
the inputted data set

Unfortunately, it is intractable to find a guaranteed smallest consistent tree.


But applying some simple heuristics, find one that is close to the smallest.
LEARN-DECISION-TREE algorithm adopts a greedy divide-and-conquer strategy: always test
the most important attribute first, then recursively solve the smaller subproblems that are
defined by the possible results of the test.
By “most important attribute,” means the one that makes the most difference to the
classification of an example.
 The decision tree learning algorithm chooses the attribute with the
highest IMPORTANCE.
 How to measure importance?
 using the notion of information gain, which is defined in terms of entropy,
which is the fundamental quantity in information theory.
Entropy
Entropy is a measure of the uncertainty of a random variable;
the more information, the less entropy.
To know the outcome of the
event, need to decrease
uncertainty to 0 (i.e.
certainty to 1). If the
probability of some event A
is p, knowing its outcome
means decreasing the
uncertainty by the factor of
1/p.
Bernoulli distribution So, we need lg(1/p) number
of bits to know about the
event, which is equal to -
lg(p).
Cross-Entropy
• The average number of bits needed to know about
the event is different from the average number of
bits used to transfer the information.
• Cross-entropy is the average number of bits used to
transfer the information. The cross-entropy is always
greater than or equal to the entropy.
Let’s take a probability distribution with four possible outcomes —
with probabilities of 0.5, 0.25, 0.125, and 0.125. If we use two bits to
transfer this information, the cross entropy becomes 2. Wait, what is
the entropy in this case?
Entropy = 0.5 * lg(2) + 0.25 * lg(4) + 0.125 * lg(8) + 0.125 * lg(8)
= 0.5 + 0.5 + 0.375 + 0.375 = 1.75
The entropy in this case is 1.75 but we used 2 bits to transfer this
information, so cross entropy = 2. This difference between cross-entropy
and entropy is called KL Divergence.
Attribute Selection Measure 1: Information Gain

• Test attributes are selected based on a heuristics or statistical


measure (e.g., information gain) Used in ID3
Example: Play Tennis Yes/Not
Selecting Next Attribute
Temp Hum Wind Play
D1 Hot High Weak No
D2 Hot High Strong No
D8 Mild High Weak No
D9 Cool Normal Weak Yes
D11 Mild Normal Strong Yes
H(Sunny)=-2/5*log2/5 – 3/5log3/5=-0.4log0.4
– 0.6log0.6= 0.53+0.44=.97
log2x=log10x/log102

-(2/5(-0.5log(0.5)-0.5log(0.5))=-0.4) = .570
Attribute Selection Measure 2: Gain Ratio

The highest number of distinct values, student


ID

Super attributes: Such an attribute will split into as many partitions as the number of
values and each partition would be impure i.e. information gain would be highest and entropy
would be zero which is not good for training a machine learning model. It would lead to
overfitting the model.
Example: Gain Ratio
Attribute Selection Measure 3: Gini Index

the proportion of a class.


Confusion Matrix-Binary Classification
Sensitivity = How many positive records are correctly predicted?
Specificity = How many negative records were correctly predicted?
Precision = How many correct predictions out of all predictions?
Recall- How many actual records were correctly predicted? F1 Score is a measure that
combines recall and precision.
As we have seen there is a
trade-off between precision
and recall, F1 can therefore be
used to measure how
effectively our models make
that trade-off.
Confusion Matrix-Multi-Class Classification
Setosa Versicolor Virginica
TP C1=16 C5=17 C9=11
FP C4+C7 C2+C8 C3+C6
0 0 1
FN C2+C3 C4+C6 C7+C8
0 1 0
TN C5+C6+C8 C1+C3+C7+ C1+C2+C4
+C9 C9 +C5
29 27 33

• FN: The False-negative value for a class will be the sum of values of corresponding rows
except for the TP value.
• FP: The False-positive value for a class will be the sum of values of the corresponding column
except for the TP value.
• TN: The True-negative value for a class will be the sum of the values of all columns and rows
except the values of that class that we are calculating the values for.
• TP: The True-positive value is where the actual value and predicted value are the same.
Confusion Matrix-Multi-Class Classification

The x-axis defines the model’s expected values, and the y-axis defines the actual values. The
10x12(MxN) matrix structure was used to build this matrix. According to our model, some
images have numerous objects detected while others have no objects detected at all. As a
result, the matrix has two additional columns labeled Multiple obj detection and No Object
Detection, respectively.

In the figure, the number of nonobject detection images is zero, but four images are identified
as multiples of person for Person2, Person3, Person6, and Person10.
Confusion Matrix-Multi-Class Classification
• Precision: measures the model's ability to identify
instances of a particular class correctly.

• Recall: is the fraction of instances in a class that the


model correctly classified out of all instances in that
class.

• Accuracy measures the proportion of correctly classified


cases from the total number of objects in the dataset.
Macro-averaging VS Micro-averaging
• Macro-Averaging: Average the
precision and recall across all
classes to get the final macro-
averaged precision and recall scores.
• A macro-average will compute the
metric independently for each class
and then take the average (hence
treating all classes equally).
• Micro-averaging: a micro-average
will aggregate the contributions of all
classes to compute the average
metric.
• In a multi-class classification setup,
micro-average is preferable if you
suspect a class imbalance. Itdistribution
gives equal importance to every individual prediction, regardless of the class
in the dataset. This makes it a useful metric for understanding the
model’s performance on a global scale, particularly when classes are well-balanced.
Receiver Operator Characteristics (ROC)
• Evaluating different machine learning configurations
– May have dozens, hundreds, or thousands of confusion
matrices (diff classification threshold)
– Tedious to review-summarize with a receiver operator
characteristic (ROC) curve.
True Positive Rate (TPR) is a synonym for recall and is

False Positive Rate (FPR) is defined as :

A ROC curve plots TPR vs. FPR at different classification


thresholds.
Lowering the classification threshold a logistic
regression models more items as positive, thus
increasing both False Positives and True Positives.
• Which Threshold is Better?
– From ROC we can determine without creating confusion matrix
Area Under the ROC Curve (AUC)
• To compute the points in a ROC curve, we could evaluate a logistic
regression model many times with different classification thresholds, which
would be inefficient.
• Fortunately, an efficient, sorting-based algorithm can provide this
information to us, called AUC.
• AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the ROC as a whole curve (think integral calculus)
from (0,0) to (1,1).
ROC for Logistic Regression

ROC for Random Forest

• Which ROC is Better?


– From ROC we can determine the best classification model
• People often replace the False Positive Rate
with Precision.
Example
ID Outlook Temperature Humidity Wind Play Tennis

1 Sunny >25 High Weak No


2 Sunny >25 High Strong No
3 Overcast >25 High Weak Yes
4 Rain 15-25 High Weak Yes
5 Rain <15 Normal Weak Yes
6 Rain <15 Normal Strong No
7 Overcast <15 Normal Strong Yes
8 Sunny 15-25 High Weak No
9 Sunny <15 Normal Weak Yes
10 Rain 15-25 Normal Weak Yes
11 Sunny 15-25 Normal Strong Yes
12 Overcast 15-25 High Strong Yes
13 Overcast >25 Normal Weak Yes
14 Rain 15-25 High Strong No

40
Tree induction example
 Entropy of data S
Info(S) = -9/14(log2(9/14))-
(9/14))-5/14(log2(5/14)) = 0.94

 Split data by attribute Outlook


S[9+, 5-
5-] Sunny [2+,3-
[2+,3-]
Outlook Overcast [4+,0-
[4+,0-]
Rain [3+,2-
[3+,2-]
Gain(Outlook) = 0.94 – 5/14[-
5/14[-2/5(log2(2/5))-
(2/5))-3/5(log2(3/5))]
– 4/14[-
4/14[-4/4(log2(4/4))-
(4/4))-0/4(log2(0/4))]
– 5/14[-
5/14[-3/5(log2(3/5))-
(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25

41
Tree induction example
 Split data by attribute Temperature
S[9+, 5-5-] <15 [3+,1-
[3+,1-]
Temperature 15-
15-25 [4+,2-
[4+,2-]
>25 [2+,2-
[2+,2-]

Gain(Temperature) = 0.94 – 4/14[-


4/14[-3/4(log2(3/4))-
(3/4))-1/4(log2(1/4))]
– 6/14[-
6/14[-4/6(log2(4/6))-
(4/6))-2/6(log2(2/6))]
– 4/14[-
4/14[-2/4(log2(2/4))-
(2/4))-2/4(log2(2/4))]
= 0.94 – 0.91 = 0.03

42
Tree induction example
 Split data by attribute Humidity
High [3+,4-
[3+,4-]
S[9+, 5-
5-] Humidity
Normal [6+, 1-
1-]
Gain(Humidity) = 0.94 – 7/14[-
7/14[-3/7(log2(3/7))-
(3/7))-4/7(log2(4/7))]
– 7/14[-
7/14[-6/7(log2(6/7))-
(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
 Split data by attribute Wind
Weak [6+, 2-
2 -]
S[9+, 5-
5-] Wind
Strong [3+, 3-
3 -]
Gain(Wind) = 0.94 – 8/14[-
8/14[-6/8(log2(6/8))-
(6/8))-2/8(log2(2/8))]
– 6/14[-
6/14[-3/6(log2(3/6))-
(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05

43
Tree induction example

Outlook Tempera Humidity Wind Play


ture Tennis
Sunny >25 High Weak No
Sunny >25 High Strong No
Gain(Outlook) = 0.25
Gain(Temperature)=0.03
Overcast >25 High Weak Yes
Gain(Humidity) = 0.15
Rain 15-
15-25 High Weak Yes Gain(Wind) = 0.05
Rain <15 Normal Weak Yes
Rain <15 Normal Strong No
Overcast <15 Normal Strong Yes
Outlook
Sunny 15-
15-25 High Weak No
Sunny <15 Normal Weak Yes
Rain 15-
15-25 Normal Weak Yes
Sunny Overcast Rain
Sunny 15-
15-25 Normal Strong Yes
Overcast 15-
15-25 High Strong Yes
?? Yes ??
Overcast >25 Normal Weak Yes

Rain 15-
15-25 High Strong No
44
 Entropy of branch Sunny
Info(Sunny) = -2/5(log2(2/5))-
(2/5))-3/5(log2(3/5)) = 0.97

 Split Sunny branch by attribute Temperature


Gain(Temperature)
<15 [1+,0-
[1+,0-]
Sunny[2+,3-
Sunny[2+,3-] = 0.97
Temperature 15-
15-25 [1+,1-
[1+,1-] – 1/5[-
1/5[-1/1(log2(1/1))-
(1/1))-0/1(log2(0/1))]
>25 [0+,2-
[0+,2-] – 2/5[-
2/5[-1/2(log2(1/2))-
(1/2))-1/2(log2(1/2))]
– 2/5[-
2/5[-0/2(log2(0/2))-
(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.57

 Split Sunny branch by attribute Humidity


Gain(Humidity)
Sunny[2+,3-
Sunny[2+,3-] High [0+,3-
[0+,3-] = 0.97
Humidity – 3/5[-
3/5[-0/3(log2(0/3))-
(0/3))-3/3(log2(3/3))]
Normal [2+, 0-
0 -] – 2/5[-
2/5[-2/2(log2(2/2))-
(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97
 Split Sunny branch by attribute Wind
Gain(Wind)
Weak [1+, 2-
2 -] = 0.97
Sunny[2+, 3-
3 -]
– 3/5[-
3/5[-1/3(log2(1/3))-
(1/3))-2/3(log2(2/3))]
Wind
– 2/5[-
2/5[-1/2(log2(1/2))-
(1/2))-1/2(log2(1/2))]
Strong [1+, 1-
1 -] = 0.97 – 0.95= 0.02

45
Tree induction example

Outlook

Sunny Overcast Rain

Humidity Yes ??

High Normal

No Yes

46
 Entropy of branch Rain
Info(Rain) = -3/5(log2(3/5))-
(3/5))-2/5(log2(2/5)) = 0.97

 Split Rain branch by attribute Temperature


Gain(Outlook)
Rain[3+,2-
Rain[3+,2-] <15 [1+,1-
[1+,1-] = 0.97
Temperature 15-
15-25 [2+,1-
[2+,1-] – 2/5[-
2/5[-1/2(log2(1/2))-
(1/2))-1/2(log2(1/2))]
– 3/5[-
3/5[-2/3(log2(2/3))-
(2/3))-1/3(log2(1/3))]
>25 [0+,0-
[0+,0-]
– 0/5[-
0/5[-0/0(log2(0/0))-
(0/0))-0/0(log2(0/0))]
= 0.97 – 0.95 = 0.02
 Split Rain branch by attribute Humidity
Gain(Humidity)
Rain[3+,2-
Rain[3+,2-] High [1+,1-
[1+,1-] = 0.97
Humidity – 2/5[-
2/5[-1/2(log2(1/2))-
(1/2))-1/2(log2(1/2))]
Normal [2+, 1-
1 -] – 3/5[-
3/5[-2/3(log2(2/3))-
(2/3))-1/3(log2(1/3))]
= 0.97 – 0.95 = 0.02
 Split Rain branch by attribute Wind
Gain(Wind)
Rain[3+,2-
Rain[3+,2-] Weak [3+, 0-
0 -]
= 0.97
Wind – 3/5[-
3/5[-3/3(log2(3/3))-
(3/3))-0/3(log2(0/3))]
Strong [0+, 2-
2 -] – 2/5[-
2/5[-0/2(log2(0/2))-
(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97

47
Tree induction example

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Weak Strong

No Yes Yes No

48
Example: Gini Index
Gini index for the root node for Student
Background attribute.

The overall Gini Index for this split


The overall Gini Index for this split with the work status variable

The overall Gini Index for this split with the online variable

The Gini Index is lowest for the Student Background variable.


Thus, we will pick this variable for the root node.

You might also like