0% found this document useful (0 votes)
24 views159 pages

Data Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views159 pages

Data Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Data classification

Xuan–Hieu Phan

VNU University of Engineering and Technology


[email protected]

Updated: September 5, 2023

data analysis and mining course @ Xuan–Hieu Phan data classification 1 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 2 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 3 / 159
Data classification

Classification is one of the central data mining problems.


Supervised learning vs. semi–supervised learning.
Various approaches and techniques.
Many real–life problems can be seen as data classification.
Have a wide range of applications.

data analysis and mining course @ Xuan–Hieu Phan data classification 4 / 159
Classification problem

Let X = (X1 , X2 , . . . , Xd ) be a d–dimensional space, where each attribute/variable


Xj is numeric or categorical.
Let C = {c1 , c2 , . . . , ck } be a set of k distinct class labels.
Let D = {(xi , yi )}ni=1 be a training dataset consisting of n data examples (a.k.a data
instances, data observations, data points, data tuples) xi = (x1 , x2 , . . . , xd ) ∈ X
together with their true class label yi ∈ C.
Data classification, in the simplest form, is a supervised learning problem that learns
a classification model (i.e., a classifier ) f based on the training data D in order to
map any given data instance x into its most likely class c:

f : X −→ C (1)

Basically, f is an efficient and effective classification model if it is robust (i.e.,


accurate and reliable), compact, fast, and scalable.

data analysis and mining course @ Xuan–Hieu Phan data classification 5 / 159
Classification methods

Naive Bayes
Decision trees
Rule–based classification
k–nearest neighbors (k–NN)
Logistic regression
Support vector machines (SVMs)
Neural networks
Ensemble learning: bagging, boosting, random forests
Etc.

data analysis and mining course @ Xuan–Hieu Phan data classification 6 / 159
Applications of data classification
Document classification or categorization
Spam filtering
OCR (optical character recognition) and handwriting recognition
Pattern/object classification/recognition (in computer vision)
Various classification in natural language processing
Agriculture, e.g., fruit categorization (type, shape, quality, . . . )
Medicine and biology: disease, drug, MRI, gene classification, . . .
Finance and banking: credit rating, fraud detection, etc.
Telecommunication: fraud detection, churn prediction, etc.
Online marketing, e–commerce: user analysis and understanding (demographic
prediction like gender, age; income prediction)
And various applications in education, transportation, fintech, online social network
and social media, etc.
data analysis and mining course @ Xuan–Hieu Phan data classification 7 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 8 / 159
Classification problem revisited

Let X = (X1 , X2 , . . . , Xd ) be a d–dimensional space, where each attribute/variable


Xj is numeric or categorical.
Let C = {c1 , c2 , . . . , ck } be a set of k distinct class labels.
Let D = {(xi , yi )}ni=1 be a training dataset consisting of n data examples (a.k.a data
instances, data observations, data points, data tuples) xi = (x1 , x2 , . . . , xd ) ∈ X
together with their true class label yi ∈ C.
Data classification, in the simplest form, is a supervised learning problem that learns
a classification model (i.e., a classifier ) f based on the training data D in order to
map any given data instance x into its most likely class c:

f : X −→ C (2)

Basically, f is an efficient and effective classification model if it is robust (i.e.,


accurate and reliable), compact, fast, and scalable.

data analysis and mining course @ Xuan–Hieu Phan data classification 9 / 159
Bayes classification
The Bayes classifier directly uses the Bayes theorem to predict the class for a new
data instance, x.
It estimates the posterior probability P (ci |x) for each class ci , and chooses the class
that has the largest probability. The predicted class for x is given as:

ŷ = arg max P (ci |x) (3)


ci

The Bayes theorem allows us to invert the posterior probability in terms of the
likelihood and prior probability, as follows:
P (x|ci ) · P (ci )
P (ci |x) = (4)
P (x)
where P (x|ci ) is the likelihood, defined as the probability of observing x assuming
that the true class is ci , P (ci ) is the prior probability of class ci , and P (x) is the
probability of observing x from any of the k classes, given as
P (x) = kj=1 P (x|cj ) · P (cj )
P

data analysis and mining course @ Xuan–Hieu Phan data classification 10 / 159
Bayes classification (cont’d)

Since P (x) is fixed for a given data instance, the Bayes classification equation (3) can be
rewritten as:

ŷ = arg max P (ci |x)


ci
P (x|ci ) · P (ci )
= arg max (5)
ci P (x)
= arg max P (x|ci ) · P (ci )
ci

In other words, the predicted class of an instance x essentially depends on the likelihood
of that class and its prior probability.

data analysis and mining course @ Xuan–Hieu Phan data classification 11 / 159
Training Bayes classification model

To train the Bayes classifier (i.e., P (x|ci ) · P (ci )), we need to


estimate the empirical prior probability P̂ (ci ), and
estimate the empirical likelihood P̂ (x|ci )
from the training data D = {(xi , yi )}ni=1

data analysis and mining course @ Xuan–Hieu Phan data classification 12 / 159
Estimating the empirical prior probability P̂ (ci )

Let Di denote the subset of data instances in D that are labeled with class ci :

Di = {xj ∈ D has class yj = ci }

Let n be the size of the whole training dataset n = |D|, and let the size of each
class–specific subset Di be given as |Di | = ni . The empirical prior probability for
class ci can be estimated from the training data as follows:
ni
P̂ (ci ) = (6)
n

data analysis and mining course @ Xuan–Hieu Phan data classification 13 / 159
Estimating the empirical likelihood P̂ (x|ci )

To estimate the empirical likelihood P̂ (x|ci ), we have to estimate the joint


probability of x across all the d dimensions, i.e., we have to estimate
P̂ (x = (x1 , x2 , . . . , xd )|ci ).
However, estimating the joint probability P̂ (x|ci ) faces several challenges:
The lack of enough data to reliably estimate the joint probability density or mass
function, especially for high dimensional data.
Computing the joint probability is complicated and time–consuming.

We need to simply the estimation of P̂ (x|ci ) in some way to avoid computing the
joint probability.

data analysis and mining course @ Xuan–Hieu Phan data classification 14 / 159
Naive Bayes classification

The naive Bayes approach makes a simple assumption that all the attributes X1 ,
X2 , . . . , Xd are independent.
This leads to a much simpler, though surprisingly effective classifier in practice.
The independence assumption immediately implies that the likelihood can be
decomposed into a product of dimension–wise probabilities as follows:
P (x|ci ) = P (x1 , x2 , . . . , xd |ci )
d
Y (7)
= P (xj |ci )
j=1

Attribute Xj can be either numeric or categorical. Depending on the type of Xj , the


estimation of P̂ (xj |ci ) is different.

data analysis and mining course @ Xuan–Hieu Phan data classification 15 / 159
Estimating P̂ (xj |ci ) when Xj is numeric

For a numeric attribute Xj , we make the default assumption that it is normally


distributed for each ci .
2 denote the mean and variance of attribute X given class c . Then,
Let µij and σij j i
the likelihood is given as:
( )
2 1 (xj − µij )2
P (xj |ci ) ∝ f (xj |µij , σij ) = √ exp − 2 (8)
2πσij 2σij

When µij and σij 2 are replaced by the sample mean µ̂ and the sample variance σ̂ 2
ij ij
of all the values of the attribute Xj in the subset of the training data
Di = {xj ∈ D has class yj = ci }, respectively, P̂ (xj |ci ) is estimated as follows:
( )
2 1 (xj − µ̂ij )2
P̂ (xj |ci ) ∝ f (xj |µ̂ij , σ̂ij ) = √ exp − 2 (9)
2πσ̂ij 2σ̂ij

data analysis and mining course @ Xuan–Hieu Phan data classification 16 / 159
Estimating P̂ (xj |ci ) when Xj is categorical

Suppose Xj have mj categorical values {v1j , v2j , . . . , vm jj }

Estimating P̂ (xj |ci ) is actually estimating the conditional probability for a


particular value vrj (1 ≤ r ≤ mj ) of Xj as follows:
ni (vrj )
P̂ (Xj = vrj |ci ) = (10)
ni
where ni is the size of Di and ni (vrj ) is the number of data tuples in Di that the
value of attribute Xj is equal to vrj .

There are many cases ni (vrj ) = 0 in the training data (i.e., zero probability problem),
leading to a wrong classification for future data instances, we need to adjust or
smooth (e.g., Laplacian smoothing) the above estimation as follows:
ni (vrj ) + 1
P̂ (Xj = vrj |ci ) = (11)
ni + mj

data analysis and mining course @ Xuan–Hieu Phan data classification 17 / 159
Sample of AllElectronics customer database

Class-labeled training data tuples from the AllElectronics customer database [1]

data analysis and mining course @ Xuan–Hieu Phan data classification 18 / 159
Example of smoothing to avoid zero probability

Suppose that for the class buys_computer = yes in some larger training data
containing 1000 data tuples, in which 0 tuples with income = low, 990 tuples with
income = medium, and 10 tuples with income = high.
The probabilities of these events, without the Laplacian correction, are
0
P̂ (income = low | buys_computer = yes) = 1000 = 0.0
990
P̂ (income = medium | buys_computer = yes) = 1000 = 0.990
10
P̂ (income = high | buys_computer = yes) = 1000 = 0.010

After applying Laplacian smoothing:


1
P̂ (income = low | buys_computer = yes) = 1003 = 0.001
991
P̂ (income = medium | buys_computer = yes) = 1003 = 0.988
11
P̂ (income = high | buys_computer = yes) = 1003 = 0.011

data analysis and mining course @ Xuan–Hieu Phan data classification 19 / 159
Example of Naive Bayes classification

Suppose we have built a naive Bayes classifier from the training data in the previous
slide. How can we classify the following data instance?
x = (age = youth, income = medium, student = yes, credit_rating = fair)

We need to maximize P̂ (x|ci )P̂ (ci ), for i = 1, 2. P̂ (ci ), the prior probability of each
class, can be computed based on the training tuples:
9
P̂ (buy_computer = yes) = 14 = 0.643
5
P̂ (buy_computer = no) = 14 = 0.357

data analysis and mining course @ Xuan–Hieu Phan data classification 20 / 159
Example of Naive Bayes classification (cont’d)

To compute P̂ (x|ci ), for i = 1, 2, we compute the following conditional probabilities:


2
P̂ (age = youth|buy_computer = yes) = 9 = 0.222
3
P̂ (age = youth|buy_computer = no) = 5 = 0.600
4
P̂ (income = medium|buy_computer = yes) = 9 = 0.444
2
P̂ (income = medium|buy_computer = no) = 5 = 0.400
6
P̂ (student = yes|buy_computer = yes) = 9 = 0.667
1
P̂ (student = yes|buy_computer = no) = 5 = 0.200
6
P̂ (credit_rating = fair|buy_computer = yes) = 9 = 0.667
2
P̂ (credit_rating = fair|buy_computer = no) = 5 = 0.400

data analysis and mining course @ Xuan–Hieu Phan data classification 21 / 159
Example of Naive Bayes classification (cont’d)
Using these probabilities, we obtain
P̂ (x|buys_computer = yes) = P̂ (age = youth|buy_computer = yes)
×P̂ (income = medium|buy_computer = yes)
×P̂ (student = yes|buy_computer = yes)
×P̂ (credit_rating = fair|buy_computer = yes)
= 0.222 × 0.444 × 0.667 × 0.667 = 0.044
Similarly,
P̂ (x|buys_computer = no) = 0.600 × 0.400 × 0.200 × 0.400 = 0.019

To find the class, ci , that maximizes P̂ (x|ci )P̂ (ci ), we compute


P̂ (x|buys_computer = yes)P̂ (buy_computer = yes) = 0.044 × 0.643 = 0.028
P̂ (x|buys_computer = no)P̂ (buy_computer = no) = 0.019 × 0.357 = 0.007
Therefore, the model predicts buys_computer = yes for x.
data analysis and mining course @ Xuan–Hieu Phan data classification 22 / 159
Naive Bayes classification: pros and cons

Pros:
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction.
When assumption of independence holds, a Naive Bayes classifier performs very well
and you need less training data compared to other models like logistic regression.
It performs well in case of categorical variables.
Cons:
If categorical variable has a category (in test or future data), which was not observed
in training data, then model will assign a 0 (zero) probability and will be unable to
make a prediction. To solve this, we can use the smoothing technique.
Model parameters are not estimated from data rigorously in terms of mathematical
optimization.
The independence assumption is strong, which normally does not hold in real data.
Normal distribution assumption for numeric attributes is also a strong assumption.

data analysis and mining course @ Xuan–Hieu Phan data classification 23 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 24 / 159
Classification problem revisited

Let X = (X1 , X2 , . . . , Xd ) be a d–dimensional space, where each attribute/variable


Xj is numeric or categorical.
Let C = {c1 , c2 , . . . , ck } be a set of k distinct class labels.
Let D = {(xi , yi )}ni=1 be a training dataset consisting of n data examples (a.k.a data
instances, data observations, data points, data tuples) xi = (x1 , x2 , . . . , xd ) ∈ X
together with their true class label yi ∈ C.
Data classification, in the simplest form, is a supervised learning problem that learns
a classification model (i.e., a classifier ) f based on the training data D in order to
map any given data instance x into its most likely class c:

f : X −→ C (12)

Basically, f is an efficient and effective classification model if it is robust (i.e.,


accurate and reliable), compact, fast, and scalable.

data analysis and mining course @ Xuan–Hieu Phan data classification 25 / 159
Decision tree classification

A decision tree for the concept buys_computer, indicating whether a customer is likely to purchase a computer.
Each internal node represents a test on an attribute. Each leaf node represents a class (either buys_computer =
yes or buys_computer = no). [1]

Decision tree induction is the learning of decision trees from class–labeled


training data tuples.
A decision tree is a flowchart–like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node holds a class label. The topmost node in a tree is the root node.
data analysis and mining course @ Xuan–Hieu Phan data classification 26 / 159
How are decision trees used for classification?

Given a new data tuple, x ∈ X, for which the associated class label is unknown.
The attribute values of the tuple x are tested against the decision tree. A path is
traced from the root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classification rules.

data analysis and mining course @ Xuan–Hieu Phan data classification 27 / 159
Why are decision tree classifiers so popular?

The construction of decision tree classifiers does not require any domain
knowledge or parameter setting, and therefore is appropriate for exploratory
knowledge discovery.
Decision trees can handle multidimensional data.
Their representation of acquired knowledge in tree form is intuitive and generally
easy to assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy. However, successful use may
depend on the data at hand. Decision trees have been used in many application
areas such as medicine, manufacturing and production, financial analysis,
astronomy, and molecular biology.

data analysis and mining course @ Xuan–Hieu Phan data classification 28 / 159
Decision tree induction algorithms

ID3, C4.5, and CART are the most popular tree induction algorithms.
C4.5: An improvement of ID3 (R. Quinlan, 1993)
ID3: Iterative Dichotomiser 3 (R. Quinlan, 1986)
CART: Classification and regression trees (L. Breiman et al., 1984)

ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which
decision trees are constructed in a top–down recursive divide–and–conquer
manner.
Most algorithms for decision tree induction also follow a top–down approach, which
starts with a training set of data tuples and their associated class labels. The
training set is recursively partitioned into smaller subsets as the tree is being built.

data analysis and mining course @ Xuan–Hieu Phan data classification 29 / 159
A basic decision tree induction algorithm
1: procedure GenerateDecisionTree(D, Attributes, SelectAttribute)
2: output: A decision tree.
3: create a node N ;
4: if (tuples in D are all of the same class c) then
5: return N as a leaf node labeled with the class c; // stop at a leaf
6: end if
7: if (Attributes is empty) then
8: return N as a leaf node labeled with the majority class in D; // stop at a leaf
9: end if
10: apply SelectAttribute(D, Attributes) to find the “best” splitting–criterion;
11: label node N with splitting–criterion;
12: if (splitting–attribute is discrete-valued and multiway splits allowed) then
13: Attributes ←− Attributes \ splitting–attribute; // remove this attribute
14: end if
15: for (each outcome j of splitting–criterion) do
16: // partition the tuples and grow subtrees for each partition
17: let Dj be the set of data tuples in D satisfying outcome j; // a partition
18: if (Dj is empty) then
19: attach a leaf labeled with the majority class in D to node N ;
20: else
21: node N ′ ←− GenerateDecisionTree(Dj , Attributes, SelectAttribute); // recursive function call
22: attach node N ′ to node N ;
23: end if
24: end for
25: return N ;
26: end procedure
data analysis and mining course @ Xuan–Hieu Phan data classification 30 / 159
The algorithm explained
If the data tuples in D are all of the same class, then node N becomes a leaf and
is labeled with that class (lines 4, 5).
If there are no candidate attributes to examine, then N also becomes a leaf
node and is labeled with the class most appears in D (lines 7, 8).
Otherwise, the algorithm calls SelectAttribute method to determine the splitting
criterion. The splitting criterion tells us which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into individual
classes (line 10).
The splitting criterion also tells us which branches to grow from node N with
respect to the outcomes of the test. More specifically, the splitting criterion
indicates the splitting attribute and may also indicate either a split-point or a
splitting subset.
The splitting criterion is determined so that, ideally, the resulting partitions at each
branch are as “pure” as possible. A partition is pure if all the tuples in it belong to
the same class.
data analysis and mining course @ Xuan–Hieu Phan data classification 31 / 159
The algorithm explained (cont’d)

The node N is labeled with the splitting criterion, which serves as a test at the node
(line 11).
A branch is grown from node N for each of the outcomes of the splitting criterion.
The tuples in D are partitioned accordingly (lines 15, 17).
Suppose let A be the splitting attribute and A has v distinct values, {a1 , a2 , . . . , av },
based on the training data. Then there are three possible scenarios:
A is discrete-valued;
A is continuous-valued; or
A is discrete-valued and a binary tree must be produced.

data analysis and mining course @ Xuan–Hieu Phan data classification 32 / 159
Three possibilities for partitioning data tuples

Three possibilities for partitioning tuples: (a) If A is discrete-valued; (b) If A is continuous-valued; (c) If A is
discrete-valued and a binary tree must be produced. [1]
data analysis and mining course @ Xuan–Hieu Phan data classification 33 / 159
A is discrete–valued

In this case, the outcomes of the test at node N correspond directly to the known
values of A. A branch is created for each known value, aj , of A and labeled with
that value.
Partition Dj is the subset of class-labeled tuples in D having value aj of A.
Because all the data tuples in a given partition have the same value for A, A need
not be considered in any future partitioning of the tuples in this branch. Therefore,
it is removed from Attributes.

data analysis and mining course @ Xuan–Hieu Phan data classification 34 / 159
A is continuous–valued

In this case, the test at node N has two possible outcomes, corresponding to the
conditions A ≤ split–point and A > split–point, respectively, where split–point is the
split–point returned by SelectAttribute method as part of the splitting criterion.
In practice, the split–point, a, is often taken as the midpoint of two known adjacent
values of A and therefore may not actually be a preexisting value of A from the
training data.
Two branches are grown from N and labeled according to the previous outcomes.
The tuples are partitioned such that D1 holds the subset of class-labeled tuples in D
for which A ≤ split–point, while D2 holds the rest.

data analysis and mining course @ Xuan–Hieu Phan data classification 35 / 159
A is discrete–valued and a binary tree must be produced

The test at node N is of the form “A ∈ SA ?,” where SA is the splitting subset for A,
returned by SelectAttribute method as part of the splitting criterion.
It is a subset of the known values of A. If a given tuple has value aj of A and if
aj ∈ SA , then the test at node N is satisfied.
Two branches are grown from N .
By convention, the left branch out of N is labeled yes so that D1 corresponds to
the subset of class-labeled tuples in D that satisfy the test. The right branch out
of N is labeled no so that D2 corresponds to the subset of class-labeled tuples from
D that do not satisfy the test.

data analysis and mining course @ Xuan–Hieu Phan data classification 36 / 159
The algorithm explained (cont’d)

The algorithm uses the same process recursively to form a decision tree for the
tuples at each resulting partition, Dj , of D (lines 21, 22).
The recursive partitioning stops only when any one of the following terminating
conditions is true:
All the tuples in partition D (at node N ) belong to the same class (lines 4, 5).
There are no remaining attributes on which the tuples may be further partitioned
(lines 7, 8). In this case, majority voting is employed. This involves converting node
N into a leaf and labeling it with the most common class in D. Alternatively, the
class distribution of the node tuples may be stored.
There are no tuples for a given branch, that is, a partition Dj is empty (line 18). In
this case, a leaf is created with the majority class in D (line 19).

The resulting decision tree is returned (line 25).

data analysis and mining course @ Xuan–Hieu Phan data classification 37 / 159
The computational complexity of the algorithm

The computational complexity of the algorithm given the training dataset D is

O(d × |D| × log(|D|)


where d is the number of attributes describing the data tuples in D, and |D| is the total
number of data tuples in D.

data analysis and mining course @ Xuan–Hieu Phan data classification 38 / 159
Attribute selection measures

An attribute selection measure is a heuristic for selecting the splitting criterion


that “best” separates a given data partition, D, of class-labeled training tuples
into individual classes.
If we were to split D into smaller partitions according to the outcomes of the
splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall
into a given partition would belong to the same class).
Conceptually, the “best” splitting criterion is the one that most closely results in
such a scenario. Attribute selection measures are also known as splitting rules
because they determine how the tuples at a given node are to be split.
The attribute selection measure provides a ranking for each attribute describing
the given training tuples. The attribute having the best score for the measure is
chosen as the splitting attribute for the given tuples.

data analysis and mining course @ Xuan–Hieu Phan data classification 39 / 159
Attribute selection measures (cont’d)

If the splitting attribute is continuous–valued or if we are restricted to binary


trees, then, respectively, either a split point or a splitting subset must also be
determined as part of the splitting criterion.
The tree node created for partition D is labeled with the splitting criterion, branches
are grown for each outcome of the criterion, and the tuples are partitioned
accordingly.
There are three popular attribute selection measures: (1) information gain, (2) gain
ratio, and (3) Gini index.
Let D, the data partition, be a training set of class-labeled tuples. Suppose the class
label attribute has k distinct values defining k distinct classes, ci (for i = 1..k). Let
Dci be the set of tuples of class ci in D. Let |D| and |Dci | denote the number of
tuples in D and Dci , respectively.

data analysis and mining course @ Xuan–Hieu Phan data classification 40 / 159
Information gain

ID3 uses information gain (IG) as its attribute selection measure. This measure is
based on pioneering work by Claude Shannon on information theory, which
studied the value or “information content” of messages.
Let node N represent or hold the tuples of partition D. The attribute with the
highest information gain is chosen as the splitting attribute for node N .
This attribute minimizes the information needed to classify the tuples in the
resulting partitions and reflects the least randomness or “impurity” in these
partitions.
Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is
found.

data analysis and mining course @ Xuan–Hieu Phan data classification 41 / 159
Information gain: entropy of D by class distribution

The expected information needed to classify a tuple in D is given by


Xk
Info(D) = − pi log2 (pi ) (13)
i=1

where pi is the non-zero probability that an arbitrary tuple in D belongs to class ci


|Dci |
and is estimated by |D| .
A log function to the base 2 is used, because the information is encoded in bits.
Info(D) is just the average amount of information needed to identify the class
label of a tuple in D.
Note that, at this point, the information we have is based solely on the proportions
of tuples of each class. Info(D) is also known as the entropy of D.

data analysis and mining course @ Xuan–Hieu Phan data classification 42 / 159
Information gain: splitting attribute and data partitions

Now, suppose we were to partition the tuples in D on some attribute A having v


distinct values, {a1 , a2 , . . . , av }, as observed from the training data.
If A is discrete-valued, these values correspond directly to the v outcomes of a test
on A. Attribute A can be used to split D into v partitions or subsets,
{D1 , D2 , . . . , Dv }, where Dj contains those tuples in D that have outcome aj of A.
These partitions would correspond to the branches grown from node N .
Ideally, we would like this partitioning to produce an exact classification of the data
tuples. That is, we would like for each partition to be pure.
However, it is quite likely that the partitions will be impure (e.g., where a partition
may contain a collection of tuples from different classes rather than from a single
class).

data analysis and mining course @ Xuan–Hieu Phan data classification 43 / 159
Information gain (cont’d)

How much more information would we still need (after the partitioning) to arrive at
an exact classification? This amount is measured by
v
X |Dj |
Info A (D) = × Info(Dj ) (14)
|D|
j=1
|Dj |
The term |D| acts as the weight of the partition Dj .

Info A (D) is the expected information required to classify a tuple from D based on
the partitioning by A. The smaller the expected information (still) required,
the greater the purity of the partitions.

data analysis and mining course @ Xuan–Hieu Phan data classification 44 / 159
Information gain (cont’d)

Information gain is defined as the difference between the original information


requirement (i.e., based on just the proportion of classes) and the new
requirement (i.e., obtained after partitioning on A). That is,

IG(A) = Info(D) − Info A (D) (15)

In other words, IG(A) tells us how much would be gained by branching on A. It is


the expected reduction in the information requirement caused by knowing the value
of A.
The attribute A with the highest information gain, IG(A), is chosen as the
splitting attribute at node N .
This is equivalent to saying that we want to partition on the attribute A that would
do the “best classification,” so that the amount of information still required to
finish classifying the tuples is minimal (i.e., minimum Info A (D)).

data analysis and mining course @ Xuan–Hieu Phan data classification 45 / 159
Sample of AllElectronics customer database

Class-labeled training data tuples from the AllElectronics customer database [1]

data analysis and mining course @ Xuan–Hieu Phan data classification 46 / 159
Example of information gain computation

The data sample (previous slide) presents a training set, D, of 14 class-labeled


tuples randomly selected from the AllElectronics customer database.
Each attribute is discrete-valued.
The class label attribute, buys_computer, has two distinct values (namely, yes, no);
therefore, there are two distinct classes (i.e., k = 2). Let class c1 correspond to yes
and class c2 correspond to no. There are nine tuples of class yes and five tuples of
class no.
Initially, a root node N is created for the tuples in D.
To find the splitting criterion for these tuples, we must compute the information
gain of each attribute. We first compute the expected information needed to classify
a tuple in D:
9 9 5 5
Info(D) = − log − log = 0.940 bits
14 2 14 14 2 14

data analysis and mining course @ Xuan–Hieu Phan data classification 47 / 159
Example of information gain computation (cont’d)
Next, we need to compute the expected information requirement for each attribute.
Let’s start with the attribute age.
We need to look at the distribution of yes and no tuples for each category of age.
For the age category “youth,” there are 2 yes tuples and 3 no tuples. For the
category “middle aged,” there are 4 yes tuples and 0 no tuples. For the category
“senior,” there are 3 yes tuples and 2 no tuples.
The expected information needed to classify a tuple in D if the tuples are
partitioned according to age is
 
5 2 2 3 3
Info age (D) = − log2 − log2
14 5 5 5 5
 
4 4 4
+ − log2
14 4 4
 
5 3 3 2 2
+ − log2 − log2 = 0.694 bits
14 5 5 5 5
data analysis and mining course @ Xuan–Hieu Phan data classification 48 / 159
Example of information gain computation (cont’d)
Then, IG(age) is
IG(age) = Info(D) − Info age (D) = 0.940 − 0.694 = 0.246 bits
Similarly, we can compute
IG(income) = 0.029 bits,
IG(student) = 0.151 bits, and
IG(credit_rating) = 0.048 bits.

Because age has the highest information gain among the attributes, it is selected
as the splitting attribute. Node N is labeled with age, and branches are grown for
each of the attribute’s values.
The tuples are then partitioned accordingly, as shown the next slide. Notice that the
tuples falling into the partition for age = “middle_aged” all belong to the same
class. Because they all belong to class yes, a leaf should therefore be created at the
end of this branch and labeled yes.

data analysis and mining course @ Xuan–Hieu Phan data classification 49 / 159
Decision tree with the first splitting attribute age

The decision tree built based on the AllElectronics customer training data with the first splitting attribute age.
The tree is still under construction. [1]

data analysis and mining course @ Xuan–Hieu Phan data classification 50 / 159
Information gain for continuous–valued attribute

Suppose attribute A is continuous–valued and A has v distinct values in the training


data D.
Sorting the values of A in increasing order.
Typically, the midpoint between each pair of adjacent values is considered as a
possible split-point. Therefore, given v values of A, then v − 1 possible splits are
evaluated. For example, the midpoint between the values ai and ai+1 of A is
ai + ai+1
2
For each possible split-point for A, we evaluate Info A (D), where the number of
partitions is two.
The point with the minimum expected information requirement for A is selected as
the split_point for A. D1 is the set of tuples in D satisfying A ≤ split_point, and D2
is the set of tuples in D satisfying A > split_point.

data analysis and mining course @ Xuan–Hieu Phan data classification 51 / 159
Gain ratio

The information gain measure is biased toward tests with many outcomes. That
is, it prefers to select attributes having a large number of values. For example,
consider an attribute that acts as a unique identifier such as customer_id or
phone_number.
A split on customer_id would result in a large number of partitions (as many as
there are values), each one containing just one data tuple.
Because each partition is pure, the information required to classify data set D
based on this partitioning would be Info customer_id (D) = 0. Therefore, the
information gained by partitioning on this attribute is maximal.
However, such a partitioning is clearly useless for classification.

data analysis and mining course @ Xuan–Hieu Phan data classification 52 / 159
Gain ratio (cont’d)
C4.5, a successor of ID3, uses an extension to information gain known as gain
ratio, which attempts to overcome this bias.
It applies a kind of normalization to information gain using a “split information”
value defined analogously with Info(D) as
v
X |Dj | |Dj |
SplitInfo A (D) = − log2 (16)
|D| |D|
j=1

This value represents the potential information generated by splitting the training
data set, D, into v partitions, corresponding to the v outcomes of a test on attribute
A. Split information is large when v is large.
The gain ratio is defined as
IG(A)
GainRatio(A) = (17)
SplitInfo A (D)

The attribute with the maximum gain ratio is selected as the splitting attribute.
data analysis and mining course @ Xuan–Hieu Phan data classification 53 / 159
Gain ratio example

A test on income splits D into three partitions, namely low, medium, and high,
containing 4, 6, and 4 data tuples, respectively.
The gain ration of income is

4 4 6 6 4 4
SplitInfo income (D) = − log − log − log
14 2 14 14 2 14 14 2 14
= 1.557

0.029
IG(income) = 0.029. Therefore, GainRatio(income) = 1.557 = 0.019

data analysis and mining course @ Xuan–Hieu Phan data classification 54 / 159
Gini index
The Gini index is used in CART. Using the notation previously described, the Gini
index measures the impurity of D, a data partition or set of training tuples, as
X k
Gini(D) = 1 − p2i (18)
i=1
where pi is the probability that a tuple in D belongs to class ci and is estimated by
|Dci |
|D| . The sum is computed over k classes.

The Gini index considers a binary split for each attribute. Let’s first consider the
case where A is a discrete-valued attribute having v distinct values, {a1 , a2 , . . . , av },
occurring in D.
To determine the best binary split on A, we examine all the possible subsets that
can be formed using known values of A. Each subset, SA , can be considered as a
binary test for attribute A of the form “A ∈ SA ”?
Given a tuple, this test is satisfied if the value of A for the tuple is among the values
listed in SA .
data analysis and mining course @ Xuan–Hieu Phan data classification 55 / 159
Gini index (cont’d)

If A has v possible values, then there are 2v possible subsets.


For example, if income has three possible values, namely {low, medium, high}, then
the possible subsets are {low, medium, high}, {low, medium}, {low, high}, {medium,
high}, {low }, {medium}, {high}, and {}.
We exclude the power set, {low, medium, high}, and the empty set from
consideration since, conceptually, they do not represent a split. Therefore, there are
2v − 2 possible ways to form two partitions of the data, D, based on a binary split
on A.
When considering a binary split, we compute a weighted sum of the impurity of
each resulting partition. For example, if a binary split on A partitions D into D1
and D2 , the Gini index of D given that partitioning is
|D1 | |D2 |
Gini A (D) = Gini(D1 ) + Gini(D2 ) (19)
|D| |D|

data analysis and mining course @ Xuan–Hieu Phan data classification 56 / 159
Gini index (cont’d)

For each attribute, each of the possible binary splits is considered. For a
discrete-valued attribute, the subset that gives the minimum Gini index for that
attribute is selected as its splitting subset.
For continuous-valued attributes, each possible split-point must be considered. The
strategy is similar to that described earlier for information gain, where the
midpoint between each pair of (sorted) adjacent values is taken as a possible
split-point.
The point giving the minimum Gini index for a given (continuous-valued)
attribute is taken as the split-point of that attribute.
Recall that for a possible split-point of A, D1 is the set of tuples in D satisfying A ≤
split_point, and D2 is the set of tuples in D satisfying A > split_point.

data analysis and mining course @ Xuan–Hieu Phan data classification 57 / 159
Gini index (cont’d)

The reduction in impurity that would be incurred by a binary split on a discrete-


or continuous-valued attribute A is

∆Gini(A) = Gini(D) − Gini A (D) (20)

The attribute that maximizes the reduction in impurity (or, equivalently, has
the minimum Gini index) is selected as the splitting attribute.
This attribute and either its splitting subset (for a discrete-valued splitting
attribute) or split-point (for a continuous-valued splitting attribute) together form
the splitting criterion.

data analysis and mining course @ Xuan–Hieu Phan data classification 58 / 159
Gini index example

The Gini index to compute the impurity of D:


 2  2
9 5
Gini(D) = 1 − − = 0.459
14 14

To find the splitting criterion for the tuples in D, we need to compute the Gini index
for each attribute.
Let’s start with the attribute income and consider each of the possible splitting
subsets. Consider the subset {low, medium}. This would result in 10 tuples in
partition D1 satisfying the condition “income ∈ {low, medium}.” The remaining
four tuples of D would be assigned to partition D2 . The Gini index value computed
based on this partitioning is

data analysis and mining course @ Xuan–Hieu Phan data classification 59 / 159
Gini index example (cont’d)

10 4
Gini income∈{low,medium} (D) = Gini(D1 ) + Gini(D2 )
14 14
 2  2 !  2  2 !
10 7 3 4 2 2
= 1− − + 1− −
14 10 10 14 4 4
=0.443 = Gini income∈{high} (D)

Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the
subsets {low, high} and {medium}) and 0.450 (for the subsets {medium, high} and
{low }).
Therefore, the best binary split for attribute income is on {low, medium} (or
{high}) because it minimizes the Gini index.
∆Gini income∈{low,medium} (D) = 0.459 − 0.443 = 0.016

data analysis and mining course @ Xuan–Hieu Phan data classification 60 / 159
Gini index example (cont’d)

Evaluating age, we obtain {youth, senior } (or {middle aged }) as the best split for
age with a Gini index of 0.357; the attributes student and credit_rating are both
binary, with Gini index values of 0.367 and 0.429, respectively.
The attribute age and splitting subset {youth, senior } therefore give the minimum
Gini index overall, with a reduction in impurity of 0.459 − 0.357 = 0.102.
The binary split “age ∈ {youth, senior }?” results in the maximum reduction in
impurity of the tuples in D and is returned as the splitting criterion.
Node N is labeled with the criterion, two branches are grown from it, and the tuples
are partitioned accordingly.

data analysis and mining course @ Xuan–Hieu Phan data classification 61 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 62 / 159
Rule–based classification

1 If–then rules for classification


2 Rule extraction from a decision tree
3 Rule induction using a sequential covering algorithm
4 Rule quality measures
5 Rule prunning

data analysis and mining course @ Xuan–Hieu Phan data classification 63 / 159
If–then rules for classification

Rules are a good way of representing information or bits of knowledge.


A rule–based classifier uses a set of IF–THEN rules for classification. An
IF–THEN rule is an expression of the form
IF condition THEN conclusion
An example is rule R1 :
IF age=youth AND student=yes THEN buys_computer=yes
The “IF” part (or left side) of a rule is known as the rule antecedent. The “THEN”
part (or right side) is the rule consequent. In the rule antecedent, the condition
consists of one or more attribute tests (e.g., age = youth and student = yes) that
are logically ANDed. The rule consequent contains a class prediction. R1 can also
be written as
(age=youth) ∧ (student=yes) ⇒ (buys_computer=yes)

data analysis and mining course @ Xuan–Hieu Phan data classification 64 / 159
Coverage and accuracy of rules

If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a
given tuple, we say that the rule antecedent is satisfied (or simply, that the rule is
satisfied) and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a data tuple, x, from
a class-labeled data set, D, let ncovers be the number of tuples covered by R; ncorrect
be the number of tuples correctly classified by R; and |D| be the number of tuples in
D. The coverage and accuracy of rule R are defined as
ncovers
coverage(R) = (21)
|D|
ncorrect
accuracy(R) = (22)
ncovers

data analysis and mining course @ Xuan–Hieu Phan data classification 65 / 159
Sample of AllElectronics customer database

Class-labeled training data tuples from the AllElectronics customer database [1]

data analysis and mining course @ Xuan–Hieu Phan data classification 66 / 159
Coverage and accuracy of rules (cont’d)

Considering the rule R1 :

(age=youth) ∧ (student=yes) ⇒ (buys_computer=yes)

And looking at the training data consisting of 14 data tuples in the previous slide. The
coverage and accuracy of R1 are:

ncovers 2
coverage(R1 ) = = = 14.28% (23)
|D| 14

ncorrect 2
accuracy(R1 ) = = = 100% (24)
ncovers 2

data analysis and mining course @ Xuan–Hieu Phan data classification 67 / 159
How rules can be used to classify data

How rule-based classification predict the class label of a given data tuple, x. If a rule
is satisfied by x, the rule is said to be triggered. For example, the following data
tuple:
x = (age=youth, income=medium, student=yes, credit_rating=fair )
will trigger the rule R1 :
(age=youth) ∧ (student=yes) ⇒ (buys_computer=yes)
If R1 is the only rule satisfied, then the rule fires by returning the class prediction
for x.
Note that triggering does not always mean firing because there may be more
than one rule that is satisfied. If more than one rule is triggered, we have a
potential problem. What if they each specify a different class? Or what if no rule
is satisfied by x?

data analysis and mining course @ Xuan–Hieu Phan data classification 68 / 159
What if satisfied rules of x specify different classes?

If more than one rule is triggered, we need a conflict resolution strategy to figure
out which rule gets to fire and assign its class prediction to x.
There are two main strategies: size ordering and rule ordering.
The size ordering scheme assigns the highest priority to the triggering rule that
has the “toughest” requirements, where toughness is measured by the rule
antecedent size. That is, the triggering rule with the most attribute tests is fired.
The rule ordering scheme prioritizes the rules beforehand. The ordering may be
class-based or rule-based. With class-based ordering, the classes are sorted in
order of decreasing “importance” such as by decreasing order of prevalence. That is,
all the rules for the most prevalent (or most frequent) class come first, the rules for
the next prevalent class come next, and so on.
Alternatively, they may be sorted based on the misclassification cost per class.

data analysis and mining course @ Xuan–Hieu Phan data classification 69 / 159
What if satisfied rules of x specify different classes? (cont’d)

With rule-based ordering, the rules are organized into one long priority list,
according to some measure of rule quality, such as accuracy, coverage, or size
(number of attribute tests in the rule antecedent), or based on advice from domain
experts.
When rule ordering is used, the rule set is known as a decision list. With rule
ordering, the triggering rule that appears earliest in the list has the highest priority,
and so it gets to fire its class prediction. Any other rule that satisfies x is ignored.
Most rule-based classification systems use a class-based rule-ordering strategy.
How can we classify x if there is no satisfied rules for it? In this case, a fallback or
default rule can be set up to specify a default class, based on a training set. This
may be the class in majority or the majority class of the tuples that were not
covered by any rule. The default rule is evaluated at the end, if and only if no other
rule covers x.

data analysis and mining course @ Xuan–Hieu Phan data classification 70 / 159
Rule extraction from a decision tree
To extract rules from a decision tree, one rule is created for each path from the root to a
leaf node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent. The leaf node holding the class prediction is the rule consequent.

data analysis and mining course @ Xuan–Hieu Phan data classification 71 / 159
Rule induction using a sequential covering algorithm

IF-THEN rules can be extracted directly from the training data (i.e., without having
to generate a decision tree first) using a sequential covering algorithm.
The name comes from the notion that the rules are learned sequentially (one at a
time), where each rule for a given class will ideally cover many of the class’s tuples
(and hopefully none of the tuples of other classes).
Sequential covering algorithms are the most widely used approach to mining
disjunctive sets of classification rules.
There are many sequential covering algorithms. Popular variations include AQ,
CN2, and the more recent RIPPER. The general strategy is as follows. Rules are
learned one at a time. Each time a rule is learned, the tuples covered by the rule are
removed, and the process repeats on the remaining tuples.
This sequential learning of rules is in contrast to decision tree induction. Because
the path to each leaf in a decision tree corresponds to a rule, we can consider
decision tree induction as learning a set of rules simultaneously.

data analysis and mining course @ Xuan–Hieu Phan data classification 72 / 159
Basic sequential covering algorithm

1: procedure SequentialCoveringAlgorithm(D, AttributesValues)


2: output: A set of IF–THEN rules.
3: RuleSet ←− {}; // initial set of rules learned is empty
4: for (each class c ∈ C) do
5: repeat
6: Rule ←− LearnOneRule(D, AttributesValues, c);
7: remove tuples covered by Rule from D;
8: RuleSet ←− RuleSet ∪ {Rule}; // add new rule to the rule set
9: until terminating condition;
10: end for
11: return RuleSet;
12: end procedure

data analysis and mining course @ Xuan–Hieu Phan data classification 73 / 159
How are rules learned?

Ideally, when learning a rule for a class, c, the rule tries to cover all (or many) of the
training tuples of class c and none (or few) of the tuples from other classes. In this
way, the rules learned should be of high accuracy.
The rules need not necessarily be of high coverage. This is because we can have
more than one rule for a class, so that different rules may cover different tuples
within the same class.
The process continues until the terminating condition is met, such as when there are
no more training tuples or the quality of a rule returned is below a user-specified
threshold.
The LearnOneRule procedure finds the “best” rule for the current class, given the
current set of training tuples.

data analysis and mining course @ Xuan–Hieu Phan data classification 74 / 159
How are rules learned? (cont’d)

Typically, rules are grown in a general-to-specific manner. We can think of this as a


beam search, where we start off with an empty rule and then gradually keep
appending attribute tests to it.
We append by adding the attribute test as a logical conjunct to the existing
condition of the rule antecedent.
Suppose our training set, D, consists of loan application data. Attributes regarding
each applicant include their age, income, education_level, residence,
credit_rating, and the term_of_the_loan. The classifying attribute is
loan_decision, which indicates whether a loan is accepted (considered safe) or
rejected (considered risky).

data analysis and mining course @ Xuan–Hieu Phan data classification 75 / 159
How are rules learned? (cont’d)

To learn a rule for the class “accept,” we start off with the most general rule
possible, that is, the condition of the rule antecedent is empty. The rule is
IF THEN loan_decision = accept
We then consider each possible attribute test that may be added to the rule. These
can be derived from the parameter AttributesValues, which contains a list of
attributes with their associated values.
For example, for an attribute–value pair (att, val), we can consider attribute tests
such as att = val, att ≤ val, att > val, and so on.
Typically, the training data will contain many attributes, each of which may have
several possible values. Finding an optimal rule set becomes computationally
explosive.

data analysis and mining course @ Xuan–Hieu Phan data classification 76 / 159
General-to-specific search through rule space

A general-to-specific search through rule space [1]

data analysis and mining course @ Xuan–Hieu Phan data classification 77 / 159
How are rules learned? (cont’d)
Instead, LearnOneRule adopts a greedy depth-first strategy.
Each time it is faced with adding a new attribute test (conjunct) to the current rule,
it picks the one that most improves the rule quality, based on the training
samples.
Suppose LearnOneRule finds that the attribute test income = high best improves
the accuracy of our current (empty) rule. We append it to the condition, so that the
current rule becomes
IF income = high THEN loan_decision = accept
Each time we add an attribute test to a rule, the resulting rule should cover
relatively more of the “accept” tuples.
During the next iteration, we again consider the possible attribute tests and end up
selecting credit rating = excellent. Our current rule grows to become
IF income = high AND rating = excellent THEN loan_decision = accept
The process repeats, each step we continue to greedily grow rules until the resulting
rule meets an acceptable quality level.
data analysis and mining course @ Xuan–Hieu Phan data classification 78 / 159
Learning rules with beam search

Greedy search does not allow for backtracking. At each step, we heuristically add
what appears to be the best choice at the moment.
What if we unknowingly made a poor choice along the way?
To lessen the chance of this happening, instead of selecting the best attribute test to
append to the current rule, we can select the best k attribute tests.
In this way, we perform a beam search of width k, wherein we maintain the k best
candidates overall at each step, rather than a single best candidate.

data analysis and mining course @ Xuan–Hieu Phan data classification 79 / 159
Rule quality measures

Rules for the class loan_decision = accept, showing accept (a) and reject (r) tuples [1]

(1) Considering the combination of accuracy and coverage.


(2) Entropy
(3) Information gain
data analysis and mining course @ Xuan–Hieu Phan data classification 80 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 81 / 159
Instance–based learning

Instance–based learning is a family of learning techniques that compares new


instances with the instances in the training data in order to predict the labels for
the new instances.
Instance–based learning does not perform generalization but relying on the whole
training data stored in the memory. That is why this approach is called
memory–based learning or lazy learning.
Typical instance–based techniques are k–nearest neighbors (k–NN), locally weighted
regression (LWR), learning vector quantization (LVQ), etc.
Advantages: (1) easy to implement, (2) no need to train a model, (3) adapting to
new data easily.
Disadvantages: (1) memory complexity (storing the whole data), (2) classification
cost is high - normally need an approximate search solution, (3) sensitive to noise
(lack of generalization).

data analysis and mining course @ Xuan–Hieu Phan data classification 82 / 159
k–nearest neighbors (k–NN)

k–nearest neighbors (k–NN), the most popular instance–based learning technique, is


non–parametric supervised learning.
k–NN can be used for both classification and regression.
k–NN predicts the label for a new instance z by perform a voting from k nearest
instances of z: {x1 , x2 , . . . , xk }.
Voting can be uniform (majority voting).
Voting can be based on weights/distances of xi .

Distance measure between data instances is important.


Choosing the right value for k is important, and normally depending on the (nature
of) data.
Finding/searching the most k nearest instances is very time–consuming. This needs
a fast distance computation or a smarter approximate search.

data analysis and mining course @ Xuan–Hieu Phan data classification 83 / 159
k–nearest neighbors algorithm

Class-labeled data: D = {(xi , yi )}ni=1 .


Input: A new instance z; the number of selected neighbors: k.
Step 1: Calculate distances from z to all xi ∈ D.
Step 2: Find k most closest neighbors of z.
Step 3: Vote and select the most likely label for z.
data analysis and mining course @ Xuan–Hieu Phan data classification 84 / 159
k–nearest neighbors algorithm (cont’d)

Distance similarity: normally Euclidean


Calculating the (Euclidean) distance: can be optimized.
Finding most similar neighbors: normal pairwise matching or approximate search,
e.g., locality sensitive hashing (LSH), string edit distance, fuzzy string search,
simstring, Faiss (an efficient similarity search library from Facebook).
Label selection can be normal majority voting or weighted voting (distances can be
weight values).
Selecting a suitable k value by exploring the data and performing the experiments.

data analysis and mining course @ Xuan–Hieu Phan data classification 85 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 86 / 159
Classification problem revisited

Let X = (X1 , X2 , . . . , Xd ) be a d–dimensional space, where each attribute/variable


Xj is numeric or categorical.
Let C = {c1 , c2 , . . . , ck } be a set of k distinct class labels.
Let D = {(xi , yi )}ni=1 be a training dataset consisting of n data examples (a.k.a data
instances, data observations, data points, data tuples) xi = (x1 , x2 , . . . , xd ) ∈ X
together with their true class label yi ∈ C.
Data classification, in the simplest form, is a supervised learning problem that learns
a classification model (i.e., a classifier ) f based on the training data D in order to
map any given data instance x into its most likely class c:

f : X −→ C (25)

Basically, f is an efficient and effective classification model if it is robust (i.e.,


accurate and reliable), compact, fast, and scalable.

data analysis and mining course @ Xuan–Hieu Phan data classification 87 / 159
Logistic regression
Logistic regression is a form of log–linear model family (i.e., taking logarithm and
get a linear combination of the parameters of the model).
Logistic regression is also known as maximum entropy model (maxent) that is
widely used for classification, particularly for sparse data like text and natural
language.
First, considering logistic regression for binary classification (y ∈ C = {0, 1}).
Multiclass classification will be considered later.
Logistic regression has the following form:
1
hθ (x) = g(θ T x) = T (26)
1 + e−θ x
where:
x = (1, x1 , x2 , . . . , xd ) is a (d + 1)–dimension vector representing data instance x.
θ = (θ0 , θ1 , θ2 , . . . , θd ) is d + 1 parameters of the model.
θ T x = θ0 + θ1 x1 + θ2 x2 + · · · + θd xd is the linear combination of θ and x.
data analysis and mining course @ Xuan–Hieu Phan data classification 88 / 159
Logistic (or sigmoid) function

The model use the logistic or sigmoid function:


1
g(z) = (27)
1 + e−z

Notice that g(z) tends towards 1 as z → ∞; and g(z) tends towards 0 as z → −∞.
When z = 0, g(z) = 0.5. Moreover, g(z), and hence also hθ (x), is always bounded
between 0 and 1.
Sigmoid function is smooth and differentiable everywhere. Its derivative is:

g ′ (z) = g(z)(1 − g(z)) (28)

The next slide depicts the form of g(z).

data analysis and mining course @ Xuan–Hieu Phan data classification 89 / 159
Logistic (or sigmoid) function visualization

g(z) function [source: from Andrew Ng’s lecture note]

1
g(z) =
1 + e−z
data analysis and mining course @ Xuan–Hieu Phan data classification 90 / 159
Logistic regression (cont’d)
In logistic regression is a discriminative model, i.e., directly modeling the conditional
probability of the output class given the input P (y|x; θ) rather than using the joint
probability P (y, x; θ)
Let us assume that:
P (y = 1|x; θ) = hθ (x) (29)
P (y = 0|x; θ) = 1 − hθ (x) (30)

The above equations can be put into a more compact form as:

P (y | x; θ) = (hθ (x))y (1 − hθ (x))1−y (31)

Once the model has been trained (i.e., the parameters θ̂ have been estimated), the
classification is performed as follows: x is classified into class 1 (positive) if
T
P (y = 1|x; θ̂) > P (y = 0|x; θ̂), i.e., hθ̂ (x) > 0.5 or θ̂ x > 0. Otherwise, x is
classified into class 0 (negative).
data analysis and mining course @ Xuan–Hieu Phan data classification 91 / 159
Training logistic regression model
Training logistic regression model is based on the maximum likelihood estimation
(MLE), i.e., finding the parameters θ̂ that maximize the likelihood function.
The likelihood function of the model is the product of the probabilities of all the
training examples in D:
Yn
L(θ, D) = P (yi |xi ; θ) (32)
i=1
n
Y
= (hθ (xi ))yi (1 − hθ (xi ))1−yi (33)
i=1
For the sake of computation and avoiding possible numerical errors, log–likelihood
function is normally used:
ℓ(θ, D) = log L(θ, D) (34)
Xn
= [yi log hθ (xi ) + (1 − yi ) log(1 − hθ (xi ))] (35)
i=1
data analysis and mining course @ Xuan–Hieu Phan data classification 92 / 159
Partial derivative of the log–likelihood function

n 
∂g(θ T xi )

∂ℓ(θ, D) X 1 1
= yi − (1 − yi )
∂θj
i=1
g(θ T xi ) 1 − g(θ T xi ) ∂θj
n
∂(θ T xi )
 
X 1 1 T T
= yi − (1 − yi ) g(θ x i )(1 − g(θ x i ))
i=1
g(θ T xi ) 1 − g(θ T xi ) ∂θj
n
X
yi (1 − g(θ T xi )) − (1 − yi )g(θ T xi ) xij
 
=
i=1
Xn
= [yi − g(θ T xi )]xij
i=1
Xn
= [yi − hθ (xi )]xij (36)
i=1

data analysis and mining course @ Xuan–Hieu Phan data classification 93 / 159
Training model with gradient ascent

Gradient ascent is a first–order iterative optimization algorithm that attempts to


find the (local) maximum of a function by taking repeated steps in the direction of
the gradient of the function at the current point (i.e., the steepest ascent).
Objective function: ℓ(θ, D)
At each step, update θ:
n
X
θj ← θj + α [yi − hθ (xi )]xij (37)
i=1
Pd 2
Objective function with L2 regularization: ℓ(θ, D) − λ j=1 θj /2

At each step, update θ:


" n #
X
θj ← θj + α [yi − hθ (xi )]xij − λθj (38)
i=1

data analysis and mining course @ Xuan–Hieu Phan data classification 94 / 159
Training model with gradient descent
Gradient descent is a first–order iterative optimization algorithm that attempts to
find the (local) minimum of a function by taking repeated steps in the opposite
direction of the gradient of the function at the current point (i.e., the direction of
steepest descent).
Objective function: −ℓ(θ, D)
At each step, update θ:
n
X
θj ← θj − α [hθ (xi ) − yi ]xij (39)
i=1
Pd 2
Objective function with L2 regularization: −ℓ(θ, D) + λ j=1 θj /2

At each step, update θ:


" n #
X
θj ← θj − α [hθ (xi ) − yi ]xij + λθj (40)
i=1

data analysis and mining course @ Xuan–Hieu Phan data classification 95 / 159
Training model with stochastic gradient ascent/descent

Gradient ascent and descent are batch training algorithms. They would be slow if
the size of training data is large.
Stochastic gradient ascent and descent are online optimization algorithms that
consider one data example at a time to update the model parameters. They are
normally more efficient than batch algorithms.
Parameter update in stochastic gradient ascent:

θj ← θj + α[(yi − hθ (xi ))xij − λθj ] (41)

Parameter update in stochastic gradient descent:

θj ← θj − α[(hθ (xi ) − yi )xij + λθj ] (42)

data analysis and mining course @ Xuan–Hieu Phan data classification 96 / 159
Multinomial logistic regression

Multinomial logistic regression (MLR) is a classification method that


generalizes logistic regression to multiclass problems.
It is known as various names: multiclass logistic regression, softmax
regression, or conditional maximum entropy (maxent). It is a special form of
log–linear models.
MLR is commonly used in text mining and natural language processing (NLP)
where the data are normally (very) sparse. In the NLP community, it is commonly
known as (conditional) maximum entropy (maxent).
Maxent is based on the principle of maximum entropy. And interestingly, the
estimated models using the both approaches (i.e., maximum likelihood and
maximum entropy) agree with each other.
Given an input data instance, the output of a MLR model is a discrete probability
distribution. The most likely class of the data instance is the class with the highest
probability value.

data analysis and mining course @ Xuan–Hieu Phan data classification 97 / 159
Multinomial logistic regression model

In multinomial logistic regression, the probability that an input data instance x


belongs to a class cq ∈ C = {c1 , c2 , . . . , ck } has the following exponential form:
1
P (y = cq |x; Θ) = exp(θ Tq x) (43)
ZΘ (x)
where ZΘ (x) is the normalization factor:
k
X
ZΘ (x) = exp(θ Tr x) (44)
r=1

Then, the MLR can be rewritten as follows:

exp( dj=0 θqj xj )


exp(θ Tq x)
P
P (y = cq |x; Θ) = Pk T
= Pk Pd (45)
r=1 exp(θ r x) r=1 exp( j=0 θrj xj )

data analysis and mining course @ Xuan–Hieu Phan data classification 98 / 159
Classification with multinomial logistic regression
Once trained, the model for each class cq ∈ C = {c1 , c2 , . . . , ck }:
T
exp( dj=0 θ̂qj xj )
P
exp(θ̂ q x)
P (y = cq |x; Θ̂) = P T
= Pk Pd (46)
k
r=1 exp(θ̂ r x) r=1 exp( j=0 θ̂rj xj )

Given an input data instance x, the most likely output class of x is ŷ:

ŷ = cq∗ = arg max P (y = cq |x; Θ̂)


q=1..k
T
exp(θ̂ q x) T
= arg max P T
= arg max exp(θ̂ q x)
q=1..k k q=1..k
r=1 exp(θ̂ r x)
d
T X
= arg max θ̂ q x = arg max θ̂qj xj (47)
q=1..k q=1..k
j=0

data analysis and mining course @ Xuan–Hieu Phan data classification 99 / 159
MLR model parameters
Different from logistic regression for binary classification, in multiclass setting, each
class cq ∈ C (q = 1..k) has its own parameters θ q = (θq0 , θq1 , θq2 , . . . , θqd ).
And the parameters of the model, Θ, can be seen as a matrix of individual
parameters as follows:
θ  θ
1 10 θ11 θ12 ··· θ1d 
θ 2  θ20 θ21 θ22 ··· θ2d 
.  . .. .. .. .. 
 ..   .. . . . . 
Θ= θ q  = θq0
 
θq1 θq2 ··· θqd 
 (48)
 ..   .. .. .. .. .. 
   
. . . . . .
θk θk0 θk1 θk2 ··· θkd

When applying MLR (i.e., maxent) into natural language processing problems, the
parameters of the model are formulated as a flat list; and each parameter is
associated with a feature (a combination of a class and a characteristic of data
instances).
data analysis and mining course @ Xuan–Hieu Phan data classification 100 / 159
Likelihood and log–likelihood functions

Likelihood function:
n
Y
L(Θ, D) = P (yi |xi ; Θ)
i=1
n
Y exp(θ Tyi xi )
= Pk T
(49)
i=1 r=1 exp(θ r xi )

Log–likelihood function:
n
X
ℓ(Θ, D) = log L(Θ, D) = log P (yi |xi ; Θ)
i=1
n k
" !#
X X
= θ Tyi xi − log exp(θ Tr xi ) (50)
i=1 r=1

data analysis and mining course @ Xuan–Hieu Phan data classification 101 / 159
Partial derivative of the log–likelihood function
n k
" !#
∂ℓ(Θ, D) X ∂ T ∂ X
= (θ xi ) − log exp(θ Tr xi )
∂θqj ∂θqj yi ∂θqj
i=1 r=1
n k
" !#
X 1 ∂ X
= I(yi = cq )xij − Pk T
exp(θ Tr xi )
r=1 exp(θ r i x ) ∂θ qj
i=1 r=1
n
" #
X 1 T
= I(yi = cq )xij − Pk T
exp(θ q xi )xij
i=1 r=1 exp(θ r x i )
n
" #
X exp(θ Tq xi )
= I(yi = cq )xij − Pk T
xij
i=1 r=1 exp(θ r xi )
n
X n
X
= I(yi = cq )xij − P (y = cq |xi ; Θ)xij (51)
i=1 i=1

where I is the indicator function. I(yi = cq ) = 1 if yi = cq and I(yi = cq ) = 0 if yi ̸= cq .


data analysis and mining course @ Xuan–Hieu Phan data classification 102 / 159
Partial derivative of the log–likelihood function (cont’d)
If we take the average of the log-likelihood function on the training data D, we have:
n
1Y exp(θ Tyi xi )
L(Θ, D) = Pk T
(52)
n r=1 exp(θ r xi )
i=1

And the partial derivative becomes:


n n
∂ℓ(Θ, D) 1X 1X
= I(yi = cq )xij − P (y = cq |xi ; Θ)xij (53)
∂θqj n n
i=1 i=1

The first term n1 ni=1 I(yi = cq )xij is the empirical expectation of the feature or
P
dimension j regarding class cq over the (observed) training data.
The second term n1 ni=1 P (y = cq |xi ; Θ)xij is the model expectation of the feature
P
or dimension j regarding the model P (y = cq |xi ; Θ).

data analysis and mining course @ Xuan–Hieu Phan data classification 103 / 159
Training multinomial logistic regression model

We also add the regularization to the log–likelihood function.


The objective function is convex and this ensures to find the global optimum.
Training MLR model can be based on iterative or optimization methods.
Iterative methods: including generalized iterative scaling (GIS) or its improvement
called improved iterative scaling (IIS).
First–order optimization methods: like (stochastic) gradient ascent/descent,
conjugate gradient.
Second–order optimization methods: Newton and quasi–Newton methods.
quasi–Newton methods like L-BFGS (limited memory quasi–Newton method) can
avoid the explicit estimation of the Hessian matrix; approximating based on the
value and the gradient at each step. Second–order methods like L–BFGS are proved
to be more efficient than GIS, IIS and first–order optimization techniques.

data analysis and mining course @ Xuan–Hieu Phan data classification 104 / 159
The advantages of logistic regression

Has a rigorous mathematical optimization. The objective function is convex,


ensuring to reach the global optimum.
Allow dependencies among features (do not need the independence assumption like
naive Bayes); that is, logistic regression allows overlapping features.
Handling sparse data very well (for texts and natural language data).
The output is a probability distribution form (i.e., softmax form). Easily to select
the second best, the third best, etc.
Efficient, in terms of classification performance as well as computation (e.g., use
online and incremental training with stochastic gradient descent).

data analysis and mining course @ Xuan–Hieu Phan data classification 105 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 106 / 159
Classification model assessment

1 Classification performance measures


2 Model evaluation and selection

data analysis and mining course @ Xuan–Hieu Phan data classification 107 / 159
Classification performance measures

Error rate and accuracy


Confusion matrix
Precision, recall, F –measure
Performance measurement for binary classification
Receiver operating characteristic (ROC) curve
Area under the curve (AUC)
Precision–recall curve and imbalanced data
Speed, robustness, scalability, and interpretability

data analysis and mining course @ Xuan–Hieu Phan data classification 108 / 159
Classification problem revisited

Let X = (X1 , X2 , . . . , Xd ) be a d–dimensional space, where each attribute/variable


Xj is numeric or categorical.
Let C = {c1 , c2 , . . . , ck } be a set of k distinct class labels.
Let D = {(xi , yi )}m
i=1 be a training dataset consisting of m data examples (a.k.a
data instances, data observations, data points, data tuples) xi = (x1 , x2 , . . . , xd ) ∈ X
together with their true class label yi ∈ C.
Data classification, in the simplest form, is a supervised learning problem that learns
a classification model (i.e., a classifier ) f based on the training data D in order to
map any given data instance x into its most likely class c:

f : X −→ C (54)

Basically, f is an efficient and effective classification model if it is robust (i.e.,


accurate and reliable), compact, fast, and scalable.

data analysis and mining course @ Xuan–Hieu Phan data classification 109 / 159
Model evaluation on test data

Given the classifier f that has been trained on the training data D using a particular
classification method like decision tree, naive Bayes, logistic regression, etc.
Let De = {(xi , yi )}ni=1 be the test dataset including n data instances xi ∈ X together
with their true class labels yi .
The model f will be evaluated on the test data De in order to answer various
questions about the performance of f , such as, the error rate, accuracy, precision,
recall, etc.
For each data instance xi ∈ De , yi is its true class label, and let ŷi be the class label
predicted by the classifier f , that is, ŷi = f (xi ).
Then, the performance assessment of f on De is based on the comparison between
the true class labels y = {y1 , y2 , . . . , yn } and the corresponding predicted class labels
ŷ = {ŷ1 , ŷ2 , . . . , ŷn }.

data analysis and mining course @ Xuan–Hieu Phan data classification 110 / 159
Error rate and accuracy

The error rate or misclassification rate of f on the test set De is the fraction of
incorrect predictions over the size of De :
n
1X
ErrorRate = I(ŷi =
̸ yi ) (55)
n
i=1

where n = |De | and I is the indicator function has the value 1 when its argument is
true, and 0 otherwise. The lower the error rate, the better the classifier.
The accuracy of f is the fraction of correct predictions over the size of the test set:
n
1X
Accuracy = I(ŷi = yi ) = 1 − ErrorRate (56)
n
i=1

Accuracy gives an estimate of the probability of a correct prediction, thus, the


higher the accuracy, the better the classifier.

data analysis and mining course @ Xuan–Hieu Phan data classification 111 / 159
Using accuracy or error rate?

Example: an accuracy increase from 55% (baseline) to 60%:


60−55
Accuracy increase: 5 percentage points, i.e., 55 = 9.1%.
45−40
Error rate reduction: 5 percentage points, i.e., 45 = 11.1%.

Example: an accuracy increase from 94% (baseline) to 95.5%:


Accuracy increase: 1.5 percentage points ( 95.5−94
94 = 1.6%)
6−4.5
Error rate reduction: 1.5 percentage points, i.e., 6 = 25%
data analysis and mining course @ Xuan–Hieu Phan data classification 112 / 159
Confusion matrix (or contingency table)

nii : the number of instances correctly classified into class ci .


nij (i ̸= j): the number of instances truely belonging to class ci but misclassified to
class cj by the model.
Row sum: ni = kj=1 nij ; Column sum: mj = ki=1 nij
P P

|De | = n = ki=1 ni = kj=1 mj = ki=1 kj=1 nij


P P P P
Pk
nii
Accuracy = i=1
n
data analysis and mining course @ Xuan–Hieu Phan data classification 113 / 159
What can we observe from the confusion matrix?

Which classes have high accuracy and which have less.


Which classes are easily misclassified to each other.
A robust model means the cells on the main diagonal are large and all the other
cells are small.
A perfect model is that all the cells except those on the main diagonal are zero.

data analysis and mining course @ Xuan–Hieu Phan data classification 114 / 159
Example of confusion matrix

Observations:
Accuracy = 4+1+10+4
30 = 0.633; ErrorRate = 1 − 0.633 = 0.367
From row c2 : 2 instances of class c2 are misclassified to c3 and c4 .
From column c3 : in total of 15 instances classified to class c3 , only 10 instances are
correct.
c1 and c2 are not ambiguous because n12 = 0 and n21 = 0.
c3 and c4 are more ambiguous because n34 and n43 are quite large.

data analysis and mining course @ Xuan–Hieu Phan data classification 115 / 159
Normalized confusion matrix

Row normalization:
Pk
Row sum: ni = j=1 nij
nij
Normalized value: xij = ni ; then {xij }kj=1 is a distribution.

Column normalization:
Pk
Column sum: mj = i=1 nij
n
Normalized value: yij = mijj ; then {yij }ki=1 is a distribution.

data analysis and mining course @ Xuan–Hieu Phan data classification 116 / 159
Precision and recall
Let Pci and Rci be the precision and recall of class ci .
Let:
#true(ci ) be the total number of instances truly belonging to ci .
#predicted(ci ) be the number of instances classified into ci .
#correct(ci ) be the number of instances correctly classified to ci .
Pn
#correct(ci ) j=1 I(ŷj = yj = ci ) nii
Pci = = Pn = (57)
#predicted(ci ) j=1 I(ŷj = ci ) mi
Pn
#correct(ci ) j=1 I(ŷj = yj = ci ) nii
Rci = = Pn = (58)
#true(ci ) j=1 I(yj = ci ) ni

where I is the indicator function has the value 1 when its argument is true, and 0
otherwise.
data analysis and mining course @ Xuan–Hieu Phan data classification 117 / 159
Examples of precision and recall

The values on the main diagonal of the row–normalized confusion matrix are the
recall values of the classes:
Rc1 = 0.666; Rc2 = 0.333; Rc3 = 0.714; Rc4 = 0.571
The values on the main diagonal of the column–normalized confusion matrix are
the precision values of the classes:
Pc1 = 0.666; Pc2 = 0.5; Pc3 = 0.666; Pc4 = 0.571

data analysis and mining course @ Xuan–Hieu Phan data classification 118 / 159
F1 –measure and Fβ –measure
F1 –measure is the hamonic mean of precision and recall. Let F1 –measureci be the
F1 –measure of class ci :
Pci · Rci
F1 –measureci = 2 · (59)
Pci + Rci
Example:
0.666 · 0.714
F1 –measurec3 = 2 · = 0.689
0.666 + 0.714

The general F –measure is Fβ –measure. Let Fβ –measureci be the Fβ –measure of


class ci :
Pc · Rci
Fβ –measureci = (1 + β 2 ) · 2 i (60)
β · Pci + Rci
When β > 1, recall is more important than precision; when β < 1, precision is more
important than recall; Normally, β = 1 and precision and recall are treated equally.
data analysis and mining course @ Xuan–Hieu Phan data classification 119 / 159
Micro and macro average of precision and recall
There are two ways of calculating average of precision, recall, and F –measure. They
are micro and macro average.
Micro average:
P Pk
c ∈C #correct(ci ) nii
microP = P i = i=1 = Accuracy (61)
ci ∈C #predicted(c i ) n
P Pk
ci ∈C #correct(ci ) nii
microR = P = i=1 = Accuracy (62)
ci ∈C #true(ci ) n
Macro average:
k
1X
macroP = P ci (63)
k
i=1
k
1X
macroR = Rci (64)
k
i=1

data analysis and mining course @ Xuan–Hieu Phan data classification 120 / 159
Example of micro and macro average

microP , microR, and microF –measure are all the same Accuracy. They reflect the
classification performance on the test data as a whole, without looking at the
performance of individual classes.
macroP , macroR, and macroF –measure, on the other hand, are calculated base on
individual classes. Both major and minor classes are treated equally. A minor class
with poor performance can affect the macro average significantly.
Example, the minor class c3 in the table above makes macro average less optimistic
than the micro average.
data analysis and mining course @ Xuan–Hieu Phan data classification 121 / 159
Performance measurement for binary classification

Positive class is the class of interest, e.g., Covid positive, spam mail, or churning
customers.
TP, TN, FP (error type I), FN (error type II):
True positive (TP): correctly classified as positive.
True negagive (TN): correctly classified as negative.
False positive (FP): incorrectly classified as positive. Sometimes, FP is more
important, e.g, in spam filtering (spam is positive).
False negative (FN): incorrectly classified as negative. Sometimes, FN is more serious,
e.g., in HIV diagnosis.
Usually, both FP and FN errors are all important.
data analysis and mining course @ Xuan–Hieu Phan data classification 122 / 159
Performance measurement for binary classification (cont’d)

T P +T N T P +T N
accuracy, recognition rate T P +T N +F P +F N = n
F P +F N F P +F N
error rate, misclassification rate T P +T N +F P +F N = n
TP
sensitivity, true positive rate (TPR), recallpos T P +F N
TN
specificity, true negative rate (TNR), recallneg F P +T N
TP
precisionpos T P +F P
TN
precisionneg T N +F N
FP
false positive rate (FPR), false alarm rate F P +T N = 1 - specificity
FN
false negative rate (FNR), miss detection rate T P +F N = 1 - sensitivity

data analysis and mining course @ Xuan–Hieu Phan data classification 123 / 159
ROC curve (receiver operating characteristic)

Receiver operating characteristic (ROC) curve was first used during World War II
for the analysis of radar signals, and then in signal detection theory.
ROC curve is popular tool to assess the performance for binary classification in a
number of ways:
Assess the model’s ability of discrimination or separation between the positive and
negative classes.
Adjust (increase/decrease) FPR and FNR by changing the separation threshold.
Identify an optimal separation threshold for the model.
How is a ROC curve drawn?
Given a binary classifier f with spos (x) be the output score of the positive class for an
input instance x. And α be the separation threshold for positive class, i.e., x is
positive if spos (x) ≥ α.
ROC curve is the graph of FPR (x–axis) and TPR (y–axis) when α is changing in the
range [αmin , αmax ].

data analysis and mining course @ Xuan–Hieu Phan data classification 124 / 159
ROC curve (receiver operating characteristic) (cont’d)

Normally, spos (x) varies in [0, 1], if not we can normalize.


αmin = min{spos (x) | x ∈ De }; αmax = max{spos (x) | x ∈ De }
P
αmin = min{spos (x) | x ∈ De and x ∈ positive class}
P
αmax = max{spos (x) | x ∈ De and x ∈ positive class}
N
αmin = min{spos (x) | x ∈ De and x ∈ negative class}
N
αmax = max{spos (x) | x ∈ De and x ∈ negative class}
N P N P
The case in figure: αmin = αmin < αmin < αmax < αmax = αmax
data analysis and mining course @ Xuan–Hieu Phan data classification 125 / 159
ROC curve (receiver operating characteristic) (cont’d)

N N P P
Red dash line: αmin < αmax < αmin < αmax ← a perfect model!
N P N P
Green curve: αmin < αmin < αmax < αmax ← an excellent model
N P N P
Yellow curve: αmin < αmin < αmax < αmax ← a good model
N P N P
Blue dash line: αmin = αmin < αmax = αmax ← random guess!
data analysis and mining course @ Xuan–Hieu Phan data classification 126 / 159
ROC curve and AUC (area under the curve)

AUC is the area under the curve. Its maximum value is 1.0.
A model is better if its AUC is closer to 1.0. In the figure of the previous slide, AUC
of the yellow model is 0.915 and the AUC of the green model is 0.956.
The ROC curve and AUC value can be used to select an optimal α threshold for a
given model. However, it also depends on the trade–off between FP (error type I)
and FN (error type II).
ROC curve and AUC are normally used for binary classification. However, they can
be used for multi–class classification by choosing one class as positive and all the
others are negative (One vs. Rest).

data analysis and mining course @ Xuan–Hieu Phan data classification 127 / 159
Precision–recall curve

Precision–recall curve is also used to assess the model’s classification performance.


It uses recall ( T PT+F
P TP
N ) as x–axis and precision ( T P +F P ) as y–axis.

Suppose the number of positive and negative instances in the test set De are the
same.
The curve is also drawn when the positive threshold α changing from αmin to αmax .

data analysis and mining course @ Xuan–Hieu Phan data classification 128 / 159
Precision–recall curve (cont’d)

N
When α = αmin = αmin : precision = 0.5 and recall = 1
N P
When αmin < α < αmin : precision > 0.5 and recall = 1
P N
When αmin < α < αmax : precision > 0.5 and recall < 1
N P
When αmax < α < αmax : precision = 1 and recall < 1
P
When α → αmax : precision = 1 and recall approaches 0

data analysis and mining course @ Xuan–Hieu Phan data classification 129 / 159
ROC curve or precision–recall curve?

If data are balanced, both types of curves are OK. What if data are (highly)
imbalanced? e.g., 5% positive and 95% negative.
TPR = TP/(TP+FN) will not decrease much since FN cannot be large (positive is minor).
FPR = FP/(FP+TN) will not increase much since TN is large (negative is major). Therefore,
ROC curve does not change much when the data become more imbalanced.
Precision = TP/(TP+FP) will drop quickly when α decreases since FP increases quickly
(many negative instances become FP as negative is major). Therefore, Precision–recall curve
reflects the performance better when data are imbalanced.
data analysis and mining course @ Xuan–Hieu Phan data classification 130 / 159
Speed, robustness, scalability, and interpretability

Speed: the computational costs involved in training and using the given classifier.
Robustness: the ability of the classifier to make correct predictions given noisy
data or data with missing values. Robustness is typically assessed with a series of
synthetic data sets representing increasing degrees of noise and missing values.
Scalability: the ability to construct the classifier efficiently given large amounts of
data. Scalability is typically assessed with a series of data sets of increasing size.
Interpretability and explainability: the level of understanding and insight that
is provided by the classifier or predictor. Interpretability is subjective and therefore
more difficult to assess. Decision trees and classification rules can be easy to
interpret, while SVMs or neural networks are black–box and more difficult to
explain.

data analysis and mining course @ Xuan–Hieu Phan data classification 131 / 159
Model evaluation and selection

Holdout
K–fold cross–validation
Stratified cross-validation
Leave–one–out cross–validation

5×2 cross–validation
Bootstrap resampling
Comparing classifiers: using ROC and precision–recall curves
Comparing classification methods: using hypothesis testing
McNemar’s test
The resampled paired t test
K–fold cross–validation paired t test
5×2 cross–validation paired t test

data analysis and mining course @ Xuan–Hieu Phan data classification 132 / 159
Holdout

Holdout evaluation method [1]


Holdout method randomly partitions the class-labeled data into two independent
sets, a training set and a test set.
Typically a majority (e.g., two-thirds or three-fourths) is used as the training data,
and the remaining is used as the test data.
The training set is used to train the model. The model’s accuracy is then estimated
with the test set.
The estimate is usually pessimistic because only a portion of the initial data is used
to learn the model.
data analysis and mining course @ Xuan–Hieu Phan data classification 133 / 159
K–fold cross–validation

The labeled data D are randomly divided into k equal partitions (or folds) D1 , D2 ,
. . . , Dk (remember this k is not the number of class labels).
For each foldi (i = 1..k):
Training the model fi on the training data Dti = D \ Di .
Evaluating the model fi on the test data Di ; and measuring the performance of fi like
accuracy i , error-rate i , precision i , recall i , F–measure i , etc.

Computing the mean and variance of those measures and get: µ̂accuracy , σ̂accuracy 2 ,
2 2 2 2
µ̂error−rate , σ̂error−rate , µ̂precision , σ̂precision , µ̂recall , σ̂recall , µ̂F −measure , σ̂F −measure .
The mean values indicate the average performance while the variance values
indicate the stability of the models (or the classification method as well as the data
quality).
This method is called k–fold cross–validation. k can be any integer number but
normally k = 5, 10, 15, 20, 30.

data analysis and mining course @ Xuan–Hieu Phan data classification 134 / 159
Stratified cross–validation

If the data are balanced among classes, the data partitioning can be performed using
simple random sampling.
If the data are (highly) imbalanced, the simple random partitioning may lead to
the lack of data instances of minor classes in the test set (of each fold). This affects
the evaluation results.
Solution is to use stratified random partitioning, that is, the data of each class
can be partitioned individually and then combine them to get the k partitions/folds
as needed. This way helps to retain the class distribution of the whole data in
each partition.

data analysis and mining course @ Xuan–Hieu Phan data classification 135 / 159
Leave–one–out cross–validation

In k–fold cross–validation, if k = n (n is the size of data), then each fold has n − 1


data tuples for training and only 1 tuple for test.
Perform training and test for k fold and obtain the overall performance (precision,
recall, F–measure, etc.)
This method is feasible if the size of data is not large. If the data is very large,
this evaluation method is very expensive and not feasible.
This method is normally used for applications where labeled data is limited and
hard to find, such as medical diagnosis.

data analysis and mining course @ Xuan–Hieu Phan data classification 136 / 159
5×2 cross–validation
Dietterich (1998) proposed the 5×2 cross–validation, which uses training and
validation sets of equal size.
(1) (2)
The labeled data D is randomly divided into two equal parts, say D1 and D1 .
(1) (2) (2)
Then we train the model on D1 and test on D1 ; and then train the model on D1
(1)
and test on D1 .
(1) (2) (1) (2) (1)
Repeat the step above four times and we will get (D2 , D2 ), (D3 , D3 ), (D4 ,
(2) (1) (2)
D4 ), and (D5 , D5 ). And perform 8 times train/test on these pairs.
Finally, we have 10 models and 5 × 2 = 10 train/test times on those pairs. Then,
calculate the mean and variance of the performance from these 10 times to get the
final results.
Dietterich points out that after five folds (i.e., 10 times), the sets share many
instances, so that the statistics calculated from these sets do not add new
information.
data analysis and mining course @ Xuan–Hieu Phan data classification 137 / 159
Bootstrap resampling

Let D be the labeled dataset consisting of n data instances.


Create Di by sampling n times with replacement from D. Therefore, |Di | = |D|.
Repeating this sampling procedure k times, we will have k datasets: D1 , D2 , . . . , Di ,
. . . , Dk .
1
Using sampling with replacement, the probability an instance being chosen is n and
not being chosen is 1 − n1 .
Then, the probability of an instance xj ∈ D and xj ∈
/ Di (after n times sampling) is
P (xj ∈ 1 n −1
/ Di ) = (1 − n ) ≈ e ≈ 0.368.
Therefore, the probability of an instance xj ∈ Di after n times sampling is
P (xj ∈ Di ) = 1 − P (xj ∈
/ Di ) = 1 − 0.368 = 0.632.

data analysis and mining course @ Xuan–Hieu Phan data classification 138 / 159
Bootstrap resampling (cont’d)

For i = 1..k:
Training a model fi on the training dataset Di .
Evaluating the model fi on the test dataset D; and measuring the performance of fi
like Di ; and measuring the performance of fi like accuracy i , error-rate i , precision i ,
recall i , F–measure i , etc.

Calculating the mean and variance values: µ̂accuracy , σ̂accuracy 2 , µ̂error−rate ,


2 2 2 2
σ̂error−rate , µ̂precision , σ̂precision , µ̂recall , σ̂recall , µ̂F −measure , σ̂F −measure .
In terms of probability, each Di is likely to cover 63.2% instances in D. Therefore,
the classification performance on (the test data) D tends to be higher than that of
holdout and k–fold cross-validation.

data analysis and mining course @ Xuan–Hieu Phan data classification 139 / 159
Comparing classifiers: ROC and precision–recall curves

Both ROC and precision–recall curves can be used to compare two or more binary
classifiers.
Both curves are suitable for balanced data.
Precision–recall curve is more appropriate for (highly) imbalanced data.

data analysis and mining course @ Xuan–Hieu Phan data classification 140 / 159
Comparing methods: McNemar’s test
Given a training set and a validation set, we use two classification methods to train
two classifiers f1 and f2 on the training set and test them on the validation set, and
compute their errors and accuracy.
McNemar’s test use the following contingency table:
e00 : the number of e01 : the number of examples
examples misclassified by both misclassified by f1 but not f2
e10 : the number of examples e11 : the number of examples
misclassified by f2 but not f1 correctly classified by both
Under the null hypothesis that f1 and f2 have the same error rate, we expect e01
= e10 and these to be equal to (e01 + e10 )/2. We have the chi–square statistic with
one degree of freedom.
(|e01 − e10 | − 1)2
χ2 = (65)
e01 + e10
McNemar’s test rejects the null hypothesis at significance level α if this value is
greater than χ2α,1 . For α = 0.05, χ20.05,1 = 3.84.
data analysis and mining course @ Xuan–Hieu Phan data classification 141 / 159
Comparing methods: McNemar’s test example

McNemar’s test example [source: Internet]

(|25 − 15| − 1)2


χ2 = = 2.025
25 + 15
This value is smaller than 3.84 (at significance level: α = 0.05).
The null hypothesis cannot be rejected. That is there is no significant difference
between the two classifiers. Or we cannot conclude that one is better than the other.
data analysis and mining course @ Xuan–Hieu Phan data classification 142 / 159
Comparing methods: the resampled paired t test
A series of k trials is conducted (normally k = 30). In the trial i = 1..k, the labeled
data D is divided into a training set Dit of a specified size (e.g., typically 2/3 of D)
and a validation set Div .
Two classification methods A and B are both trained on Dit and the resulting
classifiers fiA and fiB are tested on Div . Let pA B
i and pi be the resulting error rates of
A B
fi and fi at the trial i.
If we assume that the k differences pi = pA B
i − pi were drawn independently from a
normal distribution, then we can apply Student’s t test, by computing the statistic:

p̄ k
t = q Pk (66)
2
i=1 (pi −p̄)
k−1
1 Pk
where p̄ = k i=1 pi .
Under the null hypothesis, this statistic has a t distribution with k − 1 degrees of
freedom. For k = 30, the null hypothesis can be rejected if |t| > t0.025,29 = 2.045.
data analysis and mining course @ Xuan–Hieu Phan data classification 143 / 159
Comparing methods: the resampled paired t test (cont’d)

There are many potential drawbacks of this approach.


First, as observed above, the individual differences pi will not have a normal
distribution because pA B
i and pi are not independent.
Second, the pi themselves (i = 1..k) are not independent, because the test sets in the
trials overlap (and the training sets in the trials overlap as well).
These violations of the assumptions underlying the t test cause severe problems that
make this test unsafe to use.

data analysis and mining course @ Xuan–Hieu Phan data classification 144 / 159
Comparing methods: k–fold cross–validation paired t test
Use k–fold cross–validation to get k training/validation set pairs {(Dit , Div )}ki=1 .
Use two classification methods A and B to train two classifiers fiA and fiB on the
training set Dit and test on the validation set Div (i = 1..k). The error rates of the
two classifiers measured on the validation sets are pA B
i and pi (i = 1..k), respectively.
Three test scenarios:
Two–tailed paired t test: If A and B have different error rates, then we expect them to
have different means, or, the difference of their means is not equal to 0.
Left–tailed paired t test: If A has less error rate than B, then the difference of their
means is less than 0.
Right–tailed paired t test: If A has greater error rate than B, then the difference of
their means is greater than 0.
The difference in error rates on fold i is pi = pA B
i − pi . This is a paired test; that is,
for each i, both algorithms see the same training and validation sets.
When this is done k times, we have a distribution of pi containing k points. Given
that pA B
i and pi are both (approximately) normal, their difference pi is also normal.
data analysis and mining course @ Xuan–Hieu Phan data classification 145 / 159
Comparing methods: k–fold CV two–tailed paired t test

The null hypothesis is that the distribution {p1 , p2 , . . . , pi , . . . , pk } has zero mean,
i.e., H0 : µ = 0. And the alternative hypothesis Ha : µ ̸= 0.
The sample mean and sample variance:
k
1X
p̄ = pi (67)
k
i=1
Pk 2
i=1 (pi − p̄)
S2 = (68)
k−1

By making the assumption that pi (i = 1..k) were independently drawn and follow
an approximately normal distribution, under the null hypothesis that µ = 0, the t
statistic with k − 1 degrees of freedom according to Student’s t test is as follows:

p̄ − µ p̄ − 0 k p̄
tk−1 = √ = √ = (69)
S/ k S/ k S

data analysis and mining course @ Xuan–Hieu Phan data classification 146 / 159
Comparing methods: k–fold CV two–tailed paired t test (cont’d)

Once tk−1 is computed, we can reject the null hypothesis that two methods A and B
have the same error rate at significance level α if tk−1 value is outside the interval
(−tα/2,k−1 , tα/2,k−1 ).
For example, for two–tailed t test, t0.05,4 = 2.776, t0.05,9 = 2.262, t0.05,19 = 2.093,
t0.05,29 = 2.045.
The problem with this method, and the reason why it is not recommended to be
used in practice, is that it violates an assumption of Student’s t test.
The difference between the classifiers’ performances pi = pA B
i − pi are not normal
A B
distributed because pi and pi are not independent.
pi (i = 1..k) themselves are not independent because the training sets overlap.

5×2 cross–validation paired t–test should be used.

data analysis and mining course @ Xuan–Hieu Phan data classification 147 / 159
Comparing methods: k–fold CV left– and right–tailed paired t test
Left–tailed: the null hypothesis is that the distribution {p1 , p2 , . . . , pi , . . . , pk } has
a zero or positive mean, i.e., H0 : µ ≥ 0. And the alternative hypothesis
Ha : µ < 0.
Right–tailed: the null hypothesis is that the distribution {p1 , p2 , . . . , pi , . . . , pk }
has a zero or negative mean, i.e., H0 : µ ≤ 0. And the alternative hypothesis
Ha : µ > 0.
The t statistic with k − 1 degrees of freedom is calculated as in two–tailed test:

p̄ − µ p̄ − 0 k p̄
tk−1 = √ = √ = (70)
S/ k S/ k S

For left–tailed test: we can reject the null hypothesis if tk−1 < −tα,k−1 .
For right–tailed test: we can reject the null hypothesis if tk−1 > tα,k−1 .
For example, for one–tailed t test, some critical values are: t0.05,4 = 2.132,
t0.05,9 = 1.833, t0.05,19 = 1.729, t0.05,29 = 1.699.
data analysis and mining course @ Xuan–Hieu Phan data classification 148 / 159
Comparing methods: k–fold CV right–tailed paired t test (example 1)
fold1 fold2 fold3 fold4 fold5 fold6 fold7 fold8 fold9 fold10
fiA error rate .054 .061 .047 .059 .065 .042 .039 .051 .064 .046
fiB error rate .048 .050 .049 .045 .052 .036 .045 .052 .057 .048
Difference .006 .011 -.002 .014 .013 .006 -.006 -.001 .007 -.002

.006 + .011 − .002 + .014 + .013 + .006 − .006 − .001 + .007 − .002
p̄ = = 0.0046
10
Pk
(pi − p̄)2 0.0004404
S 2 = i=1 = = 0.000048933
k−1 9

10 0.0046
t9 = = 2.079
0.006995
With α = 0.05, t0.05,9 = 1.833. The t9 value is greater than 1.833. Therefore, we can
reject the null hypothesis. And we can state that the error rate of A is significantly
greater than that of B with a confidence level of 95%. In other words, the method B is
more accurate than A on this k–fold cross–validation.
data analysis and mining course @ Xuan–Hieu Phan data classification 149 / 159
Comparing methods: k–fold CV right–tailed paired t test (example 2)
fold1 fold2 fold3 fold4 fold5 fold6 fold7 fold8 fold9 fold10
fiA error rate .054 .061 .047 .062 .065 .042 .039 .051 .066 .046
fiB error rate .048 .050 .049 .055 .052 .036 .045 .052 .057 .051
Difference .006 .011 -.002 .007 .013 .006 -.006 -.001 .009 -.005

.006 + .011 − .002 + .007 + .013 + .006 − .006 − .001 + .009 − .005
p̄ = = 0.0038
10
Pk
2 (pi − p̄)2 0.0004136
S = i=1 = = 0.000045956
k−1 9

10 0.0038
t9 = = 1.773
0.006779
With α = 0.05, t0.05,9 = 1.833. The t9 value is less than 1.833. Therefore, we cannot
reject the null hypothesis. In other words, we cannot conclude that the method A has a
greater error rate than that of B.
data analysis and mining course @ Xuan–Hieu Phan data classification 150 / 159
The t distribution table (significance level α)

data analysis and mining course @ Xuan–Hieu Phan data classification 151 / 159
Comparing methods: 5×2 cross–validation paired t test
The 5×2 CV paired t test is a procedure for comparing the performance of two
models (classifiers or regressors) that was proposed by Dietterich to address
shortcomings in other methods such as the resampled paired t test and the k–fold
CV paired t test (above).
In 5×2 cross–validation, the labeled data D is randomly divided into two equal parts.
(1) (2)
And this division is repeated 5 times to get 5 pairs of datasets: {(Di , Di )}5i=1 .
(1) (2)
Given two classification methods A and B, and for each dataset pair (Di , Di ), we
train classifiers and evaluate them as follows:
(1) (2)
Train two classifiers fiA1 and fiB1 on Di and test on Di and compute the their
(1)
difference in error rates: pi (= errorrateA1 B1
i − errorratei ).
(2) (1)
Train two classifiers fiA2 and fiB2 on Di and test on Di and compute the their
(2)
difference in error rates: pi (= errorrateA2 B2
i − errorratei ).

(1) (2)
Finally, we have 5 pairs: {(pi , pi )}5i=1 .
data analysis and mining course @ Xuan–Hieu Phan data classification 152 / 159
Comparing methods: 5×2 cross–validation paired t test (cont’d)
(1) (2)
For each pair (pi , pi ), we compute the mean p̄i and variance s2i as follows:
(1) (2)
pi + pi
p̄i = (71)
2
(1) (2)
s2i = (pi − p̄i )2 + (pi − p̄i )2 (72)

Under the null hypothesis that the two classification algorithms have the same error
(j)
rate, pi is the difference of two identically distributed proportions, and ignoring the
(j)
fact that these proportions are not independent, pi can be treated as
approximately normal distributed with mean 0 and unknown variance σ 2 .
(j) (1) (2)
Then pi /σ is approximately unit normal. If we assume pi and pi are
independent normals (which is not strictly true because their training and test sets
are not drawn independently of each other), then s2i /σ 2 has a chi–square
distribution with one degree of freedom.

data analysis and mining course @ Xuan–Hieu Phan data classification 153 / 159
Comparing methods: 5×2 cross–validation paired t test (cont’d)
If each of the s2i are assumed to be independent (which is not true because they are
all computed from the same set of available data), then their sum is chi–square with
five degrees of freedom:
P5
s2i
M = i=1 ∼ χ25 (73)
σ2
and
(1) (1)
p /σ p
t = p1 = qP 1 ∼ t5 (74)
M/5 5
s 2 /5
i=1 i

This gives us a t statistic with five degrees of freedom. The 5×2 CV paired t test
rejects the null hypothesis that the two classification algorithms have the same error
rate at significance level α if this value is outside the interval (−tα/2,5 , tα/2,5 ). The
default α = 0.05, t0.025,5 = 2.571.

data analysis and mining course @ Xuan–Hieu Phan data classification 154 / 159
Comparing methods: 5×2 cross–validation paired t test (example 1)

i=1 i=2 i=3 i=4 i=5


fiA1 error rate 0.156 0.163 0.167 0.162 0.158
fiB1 error rate 0.143 0.149 0.146 0.142 0.161
fiA2 error rate 0.148 0.150 0.171 0.156 0.175
fiB2 error rate 0.153 0.144 0.165 0.151 0.172
(1)
pi 0.013 0.014 0.021 0.020 -0.003
(2)
pi -0.005 0.006 0.006 0.005 0.003
p̄i 0.004 0.010 0.0135 0.0125 0.0
s2i 0.000162 0.000032 0.000113 0.000113 0.000018
(1)
p 0.013
t = qP 1 =√ = 1.391 −→ cannot reject the null hypothesis!
5 2 0.0000874
i=1 si /5

data analysis and mining course @ Xuan–Hieu Phan data classification 155 / 159
Comparing methods: 5×2 cross–validation paired t test (example 2)
i=1 i=2 i=3 i=4 i=5
fiA1 error rate 0.172 0.167 0.178 0.171 0.155
fiB1 error rate 0.141 0.153 0.145 0.140 0.158
fiA2 error rate 0.149 0.166 0.173 0.176 0.173
fiB2 error rate 0.152 0.141 0.146 0.153 0.175
(1)
pi 0.031 0.014 0.033 0.031 -0.003
(2)
pi -0.003 0.025 0.027 0.023 -0.002
p̄i 0.014 0.0195 0.03 0.027 -0.0025
s2i 0.000578 0.0000605 0.000018 0.000032 0.0000005
(1)
p 0.031
t = qP 1 =√ = 2.641 −→ can reject the null hypothesis!
5 2 0.0001378
i=1 si /5

The error rates of the two methods A and B are different! Confidence level = 95%.
data analysis and mining course @ Xuan–Hieu Phan data classification 156 / 159
Outline

1 Introduction

2 Bayes classification

3 Decision tree classification

4 Rule–based classification

5 Instance–based learning

6 Logistic regression

7 Classification model assessment

8 References and Summary

data analysis and mining course @ Xuan–Hieu Phan data classification 157 / 159
References

[1] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan
Kaufmann, Elsevier, 2012 [Book1].
[2] C. Aggarwal. Data Mining: The Textbook. Springer, 2015 [Book2].
[3] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets.
Cambridge University Press, 2014 [Book3].
[4] M. J. Zaki and W. M. Jr. Data Mining and Analysis: Fundamental Concepts and
Algorithms. Cambridge University Press, 2013 [Book4].
[5] D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a
Highly Connected World. Cambridge University Press, 2010 [Book5].
[6] J. VanderPlas. Python Data Science Handbook: Essential Tools for Working with
Data. O’Reilly, 2017 [Book6].
[7] J. Grus. Data Science from Scratch: First Principles with Python. O’Reilly, 2015
[Book7].

data analysis and mining course @ Xuan–Hieu Phan data classification 158 / 159
Summary
Introducing the classification problem (supervised learning).
Bayes classification with naive Bayes classifier, model parameter estimation,
smoothing techniques.
Decision tree classification with ID3, C4.5 and CART algorithms together with
different attribute selection measures like information gain, gain ratio, Gini index.
Rule–based classification with rules from decision trees and rule learning using
sequential covering algorithm.
Instance-based learning with k-nearest neighbors algorithm.
Logistic regression (for binary classification) and multinominal logistic regression.
Model performance measures with accuracy, error rate, confusion matrix, precision,
recall, F -score, true true positive (TP), true negative (TN), false positive (FP), false
negative (FN), ROC curve, precision-recall curve, AUC, etc.
Model evaluation and selection with holdout, cross–validation (k-fold, leave-one-out,
stratified), bootstrap resampling, as well as how compare two classification methods.
data analysis and mining course @ Xuan–Hieu Phan data classification 159 / 159

You might also like