0% found this document useful (0 votes)
43 views145 pages

M6 Classification Alternative

The document discusses rule-based classification techniques for data mining. It describes how rule-based classifiers work by applying a set of "if-then" rules to classify records. Each rule has a condition part and a class label part. It provides an example of rules developed from a dataset to classify different types of animals. The document also discusses concepts like rule coverage, accuracy, ordered vs unordered rule sets, direct and indirect methods for building rules, and the sequential covering approach for generating rules from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views145 pages

M6 Classification Alternative

The document discusses rule-based classification techniques for data mining. It describes how rule-based classifiers work by applying a set of "if-then" rules to classify records. Each rule has a condition part and a class label part. It provides an example of rules developed from a dataset to classify different types of animals. The document also discusses concepts like rule coverage, accuracy, ordered vs unordered rule sets, direct and indirect methods for building rules, and the sequential covering approach for generating rules from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Data Mining

Classification: Alternative Techniques

Lecture Notes for Chapter 4

Rule-Based

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Rule-Based Classifier

Classify records by using a collection of


“if…then…” rules
Rule: (Condition) → y
– where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
– Examples of classification rules:
◆ (Blood Type=Warm)  (Lay Eggs=Yes) → Birds
◆ (Taxable Income < 50K)  (Refund=Yes) → Evade=No

9/30/2020 Introduction to Data Mining, 2nd Edition 2


Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
9/30/2020 Introduction to Data Mining, 2nd Edition 3
Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes) → Birds
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

9/30/2020 Introduction to Data Mining, 2nd Edition 4


Rule Coverage and Accuracy
Tid Refund Marital Taxable
Coverage of a rule: Status Income Class

– Fraction of records 1 Yes Single 125K No


2 No Married 100K No
that satisfy the
3 No Single 70K No
antecedent of a rule 4 Yes Married 120K No

Accuracy of a rule: 5 No Divorced 95K Yes


6 No Married 60K No
– Fraction of records 7 Yes Divorced 220K No
that satisfy the 8 No Single 85K Yes

antecedent that 9 No Married 75K No


10 No Single 90K Yes
also satisfy the 10

consequent of a (Status=Single) → No
rule Coverage = 40%, Accuracy = 50%

9/30/2020 Introduction to Data Mining, 2nd Edition 5


How does Rule-based Classifier Work?

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

9/30/2020 Introduction to Data Mining, 2nd Edition 6


Characteristics of Rule Sets: Strategy 1

Mutually exclusive rules


– Classifier contains mutually exclusive rules if
the rules are independent of each other
– Every record is covered by at most one rule

Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
9/30/2020 Introduction to Data Mining, 2nd Edition 7
Characteristics of Rule Sets: Strategy 2

Rules are not mutually exclusive


– A record may trigger more than one rule
– Solution?
◆ Ordered rule set
◆ Unordered rule set – use voting schemes

Rules are not exhaustive


– A record may not trigger any rules
– Solution?
◆ Use a default class
9/30/2020 Introduction to Data Mining, 2nd Edition 8
Ordered Rule Set

Rules are rank ordered according to their priority


– An ordered rule set is known as a decision list
When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes) → Birds


R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
9/30/2020 Introduction to Data Mining, 2nd Edition 9
Rule Ordering Schemes

Rule-based ordering
– Individual rules are ranked based on their quality
Class-based ordering
– Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

9/30/2020 Introduction to Data Mining, 2nd Edition 10


Building Classification Rules

Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R

Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules

9/30/2020 Introduction to Data Mining, 2nd Edition 11


Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met

9/30/2020 Introduction to Data Mining, 2nd Edition 12


Example of Sequential Covering

(i) Original Data (ii) Step 1

9/30/2020 Introduction to Data Mining, 2nd Edition 13


Example of Sequential Covering…

R1 R1

R2

(iii) Step 2 (iv) Step 3

9/30/2020 Introduction to Data Mining, 2nd Edition 14


Rule Growing

Two common strategies

Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)

Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1

(a) General-to-specific (b) Specific-to-general

9/30/2020 Introduction to Data Mining, 2nd Edition 15


Rule Evaluation
FOIL: First Order Inductive
Foil’s Information Gain Learner – an early rule-
based learning algorithm

– R0: {} => class (initial rule)


– R1: {A} => class (rule after adding conjunct)

– 𝐺𝑎𝑖𝑛 𝑅 , 𝑅 = 𝑝 × [ 𝑙𝑜𝑔 𝑝1 𝑝0
0 1 1 2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛0

– 𝑝0 : number of positive instances covered by R0


𝑛0 : number of negative instances covered by R0
𝑝1 : number of positive instances covered by R1
𝑛1 : number of negative instances covered by R1

9/30/2020 Introduction to Data Mining, 2nd Edition 16


Direct Method: RIPPER

For 2-class problem, choose one of the classes as


positive class, and the other as negative class
– Learn rules for positive class
– Negative class will be default class
For multi-class problem
– Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
– Learn the rule set for smallest class first, treat the rest
as negative class
– Repeat with next smallest class as positive class

9/30/2020 Introduction to Data Mining, 2nd Edition 17


Direct Method: RIPPER

Growing a rule:
– Start from empty rule
– Add conjuncts as long as they improve FOIL’s
information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced
error pruning
– Measure for pruning: v = (p-n)/(p+n)
◆ p: number of positive examples covered by the rule in
the validation set
◆ n: number of negative examples covered by the rule in
the validation set
– Pruning method: delete any final sequence of
conditions that maximizes v
9/30/2020 Introduction to Data Mining, 2nd Edition 18
Direct Method: RIPPER

Building a Rule Set:


– Use sequential covering algorithm
◆ Finds the best rule that covers the current set of
positive examples
◆ Eliminate both positive and negative examples
covered by the rule
– Each time a rule is added to the rule set,
compute the new description length
◆ Stop adding new rules when the new description
length is d bits longer than the smallest description
length obtained so far

9/30/2020 Introduction to Data Mining, 2nd Edition 19


Direct Method: RIPPER

Optimize the rule set:


– For each rule r in the rule set R
◆ Consider 2 alternative rules:
– Replacement rule (r*): grow new rule from scratch
– Revised rule(r′): add conjuncts to extend the rule r
◆ Compare the rule set for r against the rule set for r*
and r′
◆ Choose rule set that minimizes MDL principle

– Repeat rule generation and rule optimization


for the remaining positive examples

9/30/2020 Introduction to Data Mining, 2nd Edition 20


Indirect Methods

P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

9/30/2020 Introduction to Data Mining, 2nd Edition 21


Indirect Method: C4.5rules

Extract rules from an unpruned decision tree


For each rule, r: A → y,
– consider an alternative rule r′: A′ → y where A′
is obtained by removing one of the conjuncts
in A
– Compare the pessimistic error rate for r
against all r’s
– Prune if one of the alternative rules has lower
pessimistic error rate
– Repeat until we can no longer improve
generalization error

9/30/2020 Introduction to Data Mining, 2nd Edition 22


Indirect Method: C4.5rules

Instead of ordering the rules, order subsets of


rules (class ordering)
– Each subset is a collection of rules with the
same rule consequent (class)
– Compute description length of each subset
◆ Description length = L(error) + g L(model)
◆ g is a parameter that takes into account the
presence of redundant attributes in a rule set
(default value = 0.5)

9/30/2020 Introduction to Data Mining, 2nd Edition 23


Example
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds

9/30/2020 Introduction to Data Mining, 2nd Edition 24


C4.5 versus C4.5rules versus RIPPER

Give C4.5rules:
Birth? (Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
Yes No
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
Mammals Live In ( ) → Amphibians
Water?
Yes No RIPPER:
(Live in Water=Yes) → Fishes
Sometimes (Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Fishes Amphibians Can → Reptiles
Fly?
(Can Fly=Yes,Give Birth=No) → Birds
Yes No () → Mammals

Birds Reptiles

9/30/2020 Introduction to Data Mining, 2nd Edition 25


C4.5 versus C4.5rules versus RIPPER

C4.5 and C4.5rules:


PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 2 0 0 0 0
CLASS Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
RIPPER:
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 0 0 0 0 2
CLASS Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4

9/30/2020 Introduction to Data Mining, 2nd Edition 26


Advantages of Rule-Based Classifiers

Has characteristics quite similar to decision trees


– As highly expressive as decision trees
– Easy to interpret (if rules are ordered by class)
– Performance comparable to decision trees
◆Can handle redundant and irrelevant attributes
◆ Variable interaction can cause issues (e.g., X-OR problem)
Better suited for handling imbalanced classes
Harder to handle missing values in the test set

9/30/2020 Introduction to Data Mining, 2nd Edition 27


Data Mining
Classification: Alternative Techniques

Lecture Notes for Chapter 4

Instance-Based Learning

Introduction to Data Mining , 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Nearest Neighbor Classifiers

Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

2/10/2021 Introduction to Data Mining, 2nd Edition 2


Nearest-Neighbor Classifiers
Unknown record Requires the following:
– A set of labeled records
– Proximity metric to compute
distance/similarity between a
pair of records
– e.g., Euclidean distance
– The value of k, the number of
nearest neighbors to retrieve
– A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)

2/10/2021 Introduction to Data Mining, 2nd Edition 3


How to Determine the class label of a Test Sample?

Take the majority vote of class labels among the k-


nearest neighbors
Weight the vote according to distance
– weight factor, 𝑤 = 1/𝑑2

2/10/2021 Introduction to Data Mining, 2nd Edition 4


Choice of proximity measure matters

For documents, cosine is better than correlation or


Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs, but


the cosine similarity measure has different
values for these pairs.

2/10/2021 Introduction to Data Mining, 2nd Edition 5


Nearest Neighbor Classification…

Data preprocessing is often required


– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
◆Example:

– height of a person may vary from 1.5m to 1.8m


– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0


means a standard deviation of 1

2/10/2021 Introduction to Data Mining, 2nd Edition 6


Nearest Neighbor Classification…

Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

2/10/2021 Introduction to Data Mining, 2nd Edition 7


Nearest-neighbor classifiers

Nearest neighbor
classifiers are local
classifiers

They can produce 1-nn decision boundary is


decision boundaries of a Voronoi Diagram
arbitrary shapes.

2/10/2021 Introduction to Data Mining, 2nd Edition 8


Nearest Neighbor Classification…

How to handle missing values in training and


test sets?
– Proximity computations normally require the
presence of all attributes
– Some approaches use the subset of attributes
present in two instances
◆ This may not produce good results since it
effectively uses different proximity measures for
each pair of instances
◆ Thus, proximities are not comparable

2/10/2021 Introduction to Data Mining, 2nd Edition 9


K-NN Classificiers…
Handling Irrelevant and Redundant Attributes

– Irrelevant attributes add noise to the proximity measure


– Redundant attributes bias the proximity measure towards certain
attributes

2/10/2021 Introduction to Data Mining, 2nd Edition 10


K-NN Classifiers: Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 11


Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 12


Improving KNN Efficiency

Avoid having to compute distance to all objects in


the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
Condensing
– Determine a smaller set of objects that give
the same performance
Editing
– Remove objects to improve efficiency
2/10/2021 Introduction to Data Mining, 2nd Edition 13
Data Mining
Classification: Alternative Techniques

Bayesian Classifiers

Introduction to Data𝑝 Mining, 2nd Edition


1

by
Tan, Steinbach, Karpatne, Kumar
Bayes Classifier

• A probabilistic framework for solving classification


problems
• Conditional Probability: P(Y | X ) =
P( X , Y )
P( X )
P( X , Y )
P( X | Y ) =
P(Y )
• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X ) =
P( X )

2/08/2021 Introduction to Data Mining, 2nd Edition 2


Using Bayes Theorem for Classification

• Consider each attribute and class


label as random variables
• Given a record with attributes (X1,
al al us
X2,…, Xd), the goal is to predict e go
ir c
e go
ir c
tin
uo
ss
a t a t n la
class Y c c co c
Tid Refund Marital Taxable
Status Income Evade

– Specifically, we want to find the value of 1 Yes Single 125K No


2 No Married 100K No
Y that maximizes P(Y| X1, X2,…, Xd )
3 No Single 70K No
4 Yes Married 120K No

• Can we estimate P(Y| X1, X2,…, Xd )


5 No Divorced 95K Yes
6 No Married 60K No
directly from data? 7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/08/2021 Introduction to Data Mining, 2nd Edition 3


Using Bayes Theorem for Classification

• Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
P( X 1 X 2  X d | Y ) P(Y )
P(Y | X 1 X 2  X n ) =
P( X 1 X 2  X d )

– Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

– Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?


2/08/2021 Introduction to Data Mining, 2nd Edition 4
Example Data
Given a Test Record:
al al us
go
ir c
go
X = (Refund = No, Divorced, Income = 120K)
ir c
in
uo
te te nt a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade • We need to estimate
1 Yes Single 125K No P(Evade = Yes | X) and P(Evade = No | X)
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No In the following we will replace
5 No Divorced 95K Yes
Evade = Yes by Yes, and
6 No Married 60K No
7 Yes Divorced 220K No Evade = No by No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/08/2021 Introduction to Data Mining, 2nd Edition 5


Example Data
Given a Test Record:
al al us
go
ir c
go
ir cX = (Refund = No, Divorced, Income = 120K)
in
uo
te te nt a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/08/2021 Introduction to Data Mining, 2nd Edition 6


Conditional Independence

• X and Y are conditionally independent given Z if


P(X|YZ) = P(X|Z)

• Example: Arm length and reading skills


– Young child has shorter arm length and
limited reading skills, compared to adults
– If age is fixed, no apparent relationship
between arm length and reading skills
– Arm length and reading skills are conditionally
independent given age

2/08/2021 Introduction to Data Mining, 2nd Edition 7


Naïve Bayes Classifier

• Assume independence among attributes Xi when class is


given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

– New point is classified to Yj if P(Yj)  P(Xi| Yj) is


maximal.

2/08/2021 Introduction to Data Mining, 2nd Edition 8


Naïve Bayes on Example Data
Given a Test Record:
al al us
go
ir c
go
X = (Refund = No, Divorced, Income = 120K)
ir c
in
uo
te te nt a ss
l
ca ca co c
Tid Refund Marital Taxable
Status Income Evade P(X | Yes) =
1 Yes Single 125K No
P(Refund = No | Yes) x
2 No Married 100K No
3 No Single 70K No
P(Divorced | Yes) x
4 Yes Married 120K No P(Income = 120K | Yes)
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
P(X | No) =
8 No Single 85K Yes P(Refund = No | No) x
9 No Married 75K No
P(Divorced | No) x
10 No Single 90K Yes
10

P(Income = 120K | No)

2/08/2021 Introduction to Data Mining, 2nd Edition 9


Estimate Probabilities from Data
l l
ic a ic a
ous
or or u
te g
te g
ntin
cla s•s P(y) = fraction of instances of class y
ca ca co
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No • For categorical attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Xi =c| y) = nc/ n
5 No Divorced 95K Yes
– where |Xi =c| is number of
6 No Married 60K No instances having attribute
7 Yes Divorced 220K No value Xi =c and belonging to
8 No Single 85K Yes class y
9 No Married 75K No
– Examples:
10 No Single 90K Yes
P(Status=Married|No) = 4/7
10

P(Refund=Yes|Yes)=0

2/08/2021 Introduction to Data Mining, 2nd Edition 10


Estimate Probabilities from Data

• For continuous attributes:


– Discretization: Partition the range into bins:
◆ Replace continuous value with bin value
– Attribute changed from continuous to ordinal

– Probability density estimation:


◆ Assume attribute follows a normal distribution
◆ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
◆ Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)

2/08/2021 Introduction to Data Mining, 2nd Edition 11


Estimate
oric a l
Probabilities
or ic a l
uous from Data
teg teg ntin a s s
a a l
c c co c
Tid Refund Marital
Status
Taxable
Income Evade
• Normal distribution:
( X i − ij ) 2

1 Yes Single 125K No 1 2 ij2
P( X i | Y j ) = e
2 No Married 100K No
2 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No • For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
◆ sample mean = 110
10 No Single 90K Yes ◆ sample variance = 2975
10

1 −
( 120−110 ) 2

P( Income = 120 | No) = e 2 ( 2975 )


= 0.0072
2 (54.54)
2/08/2021 Introduction to Data Mining, 2nd Edition 12
Example of Naïve Bayes Classifier
Given a Test Record:

X = (Refund = No, Divorced, Income = 120K)


Naïve Bayes Classifier:

P(Refund = Yes | No) = 3/7


P(Refund = No | No) = 4/7 • P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0  P(Divorced | No)
P(Refund = No | Yes) = 1  P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7  1/7  0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 • P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3  P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0  P(Income=120K | Yes)
= 1  1/3  1.2  10-9 = 4  10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25
=> Class = No

2/08/2021 Introduction to Data Mining, 2nd Edition 13


Naïve Bayes Classifier can make decisions with partial
information about attributes in the test record
Even in absence of information
about any attributes, we can use P(Yes) = 3/10
Apriori Probabilities of Class P(No) = 7/10
Variable:
Naïve Bayes Classifier: If we only know that marital status is Divorced, then:
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) = 2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Divorced, Refund = No)
P(Marital Status = Single | Yes) = 2/3 P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Marital Status = Divorced | Yes) = 1/3 P(Divorced, Refund = No)
P(Marital Status = Married | Yes) = 0
If we also know that Taxable Income = 120, then
For Taxable Income: P(Yes | Refund = No, Divorced, Income = 120) =
If class = No: sample mean = 110 1.2 x10-9 x 1 x 1/3 x 3/10 /
sample variance = 2975
P(Divorced, Refund = No, Income = 120 )
If class = Yes: sample mean = 90
sample variance = 25 P(No | Refund = No, Divorced Income = 120) =
0.0072 x 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No, Income = 120)
2/08/2021 Introduction to Data Mining, 2nd Edition 14
Issues with Naïve Bayes Classifier
Given a Test Record:
X = (Married)

Naïve Bayes Classifier:


P(Yes) = 3/10
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(No) = 7/10
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7 P(Yes | Married) = 0 x 3/10 / P(Married)
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(No | Married) = 4/7 x 7/10 / P(Married)
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0

For Taxable Income:


If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25

2/08/2021 Introduction to Data Mining, 2nd Edition 15


Issues with Naïve Bayes Classifier
ic al ic al us
o
gor gor in
u s
te te nt a s Naïve Bayes Classifier:
Consider the
ca table cwitha Tid =co7 deleted cl
Tid Refund Marital Taxable
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
sample variance = 685
9 No Married 75K No If class = No: sample mean = 90
10 No Single 90K Yes sample variance = 25
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0 classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0

2/08/2021 Introduction to Data Mining, 2nd Edition 16


Issues with Naïve Bayes Classifier

• If one of the conditional probabilities is zero, then


the entire expression becomes zero
• Need to use other estimates of conditional probabilities
than simple fractions n: number of training
instances belonging to class y
• Probability estimation:
nc: number of instances with
Xi = c and Y = y
𝑛𝑐
original: 𝑃 𝑋𝑖 = 𝑐 𝑦) = v: total number of attribute
𝑛 values that Xi can take
𝑛𝑐 + 1 p: initial estimate of
Laplace Estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛+𝑣 (P(Xi = c|y) known apriori
m: hyper-parameter for our
𝑛𝑐 + 𝑚𝑝 confidence in p
m − estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛+𝑚

2/08/2021 Introduction to Data Mining, 2nd Edition 17


Example of Naïve Bayes Classifier

Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P ( A | M ) =    = 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) =    = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) = 0.06  = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004  = 0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals

2/08/2021 Introduction to Data Mining, 2nd Edition 18


Naïve Bayes (Summary)

• Robust to isolated noise points

• Handle missing values by ignoring the instance


during probability estimate calculations

• Robust to irrelevant attributes

• Redundant and correlated attributes will violate


class conditional assumption
–Useother techniques such as Bayesian Belief
Networks (BBN)

2/08/2021 Introduction to Data Mining, 2nd Edition 19


Naïve Bayes

• How does Naïve Bayes perform on the following dataset?

Conditional independence of attributes is violated

2/08/2021 Introduction to Data Mining, 2nd Edition 20


Bayesian Belief Networks

• Provides graphical representation of probabilistic


relationships among a set of random variables
• Consists of:
– A directed acyclic graph (dag) A B
◆ Node corresponds to a variable
◆ Arc corresponds to dependence
C
relationship between a pair of variables

– A probability table associating each node to its


immediate parent

2/08/2021 Introduction to Data Mining, 2nd Edition 21


Conditional Independence

D
D is parent of C
A is child of C
C
B is descendant of D
D is ancestor of A

A B

• A node in a Bayesian network is conditionally


independent of all of its nondescendants, if its
parents are known
2/08/2021 Introduction to Data Mining, 2nd Edition 22
Conditional Independence

• Naïve Bayes assumption:

X1 X2 X3 X4 ... Xd

2/08/2021 Introduction to Data Mining, 2nd Edition 23


Probability Tables

• If X does not have any parents, table contains


prior probability P(X)
Y

• If X has only one parent (Y), table contains


conditional probability P(X|Y) X

• If X has multiple parents (Y1, Y2,…, Yk), table


contains conditional probability P(X|Y1, Y2,…, Yk)

2/08/2021 Introduction to Data Mining, 2nd Edition 24


Example of Bayesian Belief Network

Exercise=Yes 0.7 Diet=Healthy 0.25


Exercise=No 0.3 Diet=Unhealthy 0.75

Exercise Diet

D=Healthy D=Healthy D=Unhealthy D=Unhealthy


Heart E=Yes E=No E=Yes E=No
Disease HD=Yes 0.25 0.45 0.55 0.75
HD=No 0.75 0.55 0.45 0.25

Blood
Chest Pain
Pressure

HD=Yes HD=No HD=Yes HD=No


CP=Yes 0.8 0.01 BP=High 0.85 0.2
CP=No 0.2 0.99 BP=Low 0.15 0.8

2/08/2021 Introduction to Data Mining, 2nd Edition 25


Example of Inferencing using BBN

• Given: X = (E=No, D=Yes, CP=Yes, BP=High)


– Compute P(HD|E,D,CP,BP)?
• P(HD=Yes| E=No,D=Yes) = 0.55
P(CP=Yes| HD=Yes) = 0.8
P(BP=High| HD=Yes) = 0.85
– P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High)
 0.55  0.8  0.85 = 0.374
Classify X
• P(HD=No| E=No,D=Yes) = 0.45 as Yes
P(CP=Yes| HD=No) = 0.01
P(BP=High| HD=No) = 0.2
– P(HD=No|E=No,D=Yes,CP=Yes,BP=High)
 0.45  0.01  0.2 = 0.0009

2/08/2021 Introduction to Data Mining, 2nd Edition 26


Data Mining

Support Vector Machines

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

10/11/2021 Introduction to Data Mining, 2nd Edition 1


Support Vector Machines

• Find a linear hyperplane (decision boundary) that will separate the data
10/11/2021 Introduction to Data Mining, 2nd Edition 2
Support Vector Machines

B1

• One Possible Solution


10/11/2021 Introduction to Data Mining, 2nd Edition 3
Support Vector Machines

B2

• Another possible solution


10/11/2021 Introduction to Data Mining, 2nd Edition 4
Support Vector Machines

B2

• Other possible solutions


10/11/2021 Introduction to Data Mining, 2nd Edition 5
Support Vector Machines

B1

B2

• Which one is better? B1 or B2?


• How do you define better?
10/11/2021 Introduction to Data Mining, 2nd Edition 6
Support Vector Machines

B1

B2

b21
b22

margin
b11

b12

• Find hyperplane maximizes the margin => B1 is better than B2


10/11/2021 Introduction to Data Mining, 2nd Edition 7
Support Vector Machines

B1

 
w• x +b = 0
 
  w • x + b = +1
w • x + b = −1

b11

  b12
 1 if w • x + b  1 2
f ( x) =    Margin = 
− 1 if w • x + b  −1 || w ||
10/11/2021 Introduction to Data Mining, 2nd Edition 8
Linear SVM

• Linear model:
 
 1 if w • x + b  1
f ( x) =   
− 1 if w • x + b  −1

• Learning the model is equivalent to determining



the values of w and b

– How to find w and b from training data?

10/11/2021 Introduction to Data Mining, 2nd Edition 9


Learning Linear SVM
2
• Objective is to maximize: Margin = 
|| w ||
 2
 || w ||
– Which is equivalent to minimizing: L( w) =
2
– Subject to the following constraints:
 
1 if w • x i + b  1
yi =   
− 1 if w • x i + b  −1
or
𝑦𝑖 (w • x𝑖 + 𝑏) ≥ 1, 𝑖 = 1,2, . . . , 𝑁

◆ This is a constrained optimization problem


– Solve it using Lagrange multiplier method

10/11/2021 Introduction to Data Mining, 2nd Edition 10


Example of Linear SVM

Support vectors

x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0

10/11/2021 Introduction to Data Mining, 2nd Edition 11


Learning Linear SVM

• Decision boundary depends only on support


vectors
– If you have data set with same support
vectors, decision boundary will not change

– How to classify using SVM once w and b are


found? Given a test record, xi
 
 1 if w • x i + b  1
f ( xi ) =   
− 1 if w • x i + b  −1

10/11/2021 Introduction to Data Mining, 2nd Edition 12


Support Vector Machines

• What if the problem is not linearly separable?

10/11/2021 Introduction to Data Mining, 2nd Edition 13


Support Vector Machines

• What if the problem is not linearly separable?


– Introduce slack variables
◆ Need to minimize:  2
|| w ||  N k
L( w) = + C   i 
◆ Subject to:
2  i =1 

 
1 if w • x i + b  1 - i
yi =   
− 1 if w • x i + b  −1 + i
◆ If k is 1 or 2, this leads to similar objective function
as linear SVM but with different constraints (see
textbook)

10/11/2021 Introduction to Data Mining, 2nd Edition 14


Support Vector Machines

B1

B2

b21
b22

margin
b11

b12

• Find the hyperplane that optimizes both factors


10/11/2021 Introduction to Data Mining, 2nd Edition 15
Nonlinear Support Vector Machines

• What if decision boundary is not linear?

10/11/2021 Introduction to Data Mining, 2nd Edition 16


Nonlinear Support Vector Machines

• Transform data into higher dimensional space

Decision boundary:
 
w • ( x ) + b = 0
10/11/2021 Introduction to Data Mining, 2nd Edition 17
Learning Nonlinear SVM

• Optimization problem:

• Which leads to the same set of equations (but


involve (x) instead of x)

10/11/2021 Introduction to Data Mining, 2nd Edition 18


Learning NonLinear SVM

• Issues:
– What type of mapping function  should be
used?
– How to do the computation in high
dimensional space?
◆ Most computations involve dot product (xi)• (xj)
◆ Curse of dimensionality?

10/11/2021 Introduction to Data Mining, 2nd Edition 19


Learning Nonlinear SVM

• Kernel Trick:
– (xi)• (xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space)
◆ Examples:

10/11/2021 Introduction to Data Mining, 2nd Edition 20


Example of Nonlinear SVM

SVM with polynomial


degree 2 kernel

10/11/2021 Introduction to Data Mining, 2nd Edition 21


Learning Nonlinear SVM

• Advantages of using kernel:


– Don’t have to know the mapping function 
– Computing dot product (xi)• (xj) in the
original space avoids curse of dimensionality

• Not all functions can be kernels


– Must make sure there is a corresponding  in
some high-dimensional space
– Mercer’s theorem (see textbook)

10/11/2021 Introduction to Data Mining, 2nd Edition 22


Characteristics of SVM

• The learning problem is formulated as a convex optimization problem


– Efficient algorithms are available to find the global minima
– Many of the other methods use greedy approaches and find locally
optimal solutions
– High computational complexity for building the model

• Robust to noise
• Overfitting is handled by maximizing the margin of the decision boundary,
• SVM can handle irrelevant and redundant attributes better than many
other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values

• What about categorical variables?

10/11/2021 Introduction to Data Mining, 2nd Edition 23


Data Mining

Ensemble Techniques

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

10/11/2021 Introduction to Data Mining, 2nd Edition 1


Ensemble Methods

Construct a set of base classifiers learned from


the training data

Predict class label of test records by combining


the predictions made by multiple classifiers (e.g.,
by taking majority vote)

10/11/2021 Introduction to Data Mining, 2nd Edition 2


Example: Why Do Ensemble Methods Work?

10/11/2021 Introduction to Data Mining, 2nd Edition 3


Necessary Conditions for Ensemble Methods

Ensemble Methods work better than a single base classifier if:


1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)

Classification error for an


ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.

10/11/2021 Introduction to Data Mining, 2nd Edition 4


Rationale for Ensemble Learning

Ensemble Methods work best with unstable


base classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …

10/11/2021 Introduction to Data Mining, 2nd Edition 5


Bias-Variance Decomposition

Analogous problem of reaching a target y by firing


projectiles from x (regression problem)

For classification, the generalization error of model 𝑚 can


be given by:

𝑔𝑒𝑛. 𝑒𝑟𝑟𝑜𝑟 𝑚 = 𝑐1 + 𝑏𝑖𝑎𝑠 𝑚 + 𝑐2 × 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑚)


10/11/2021 Introduction to Data Mining, 2nd Edition 6
Bias-Variance Trade-off and Overfitting

Overfitting

Underfitting

Ensemble methods try to reduce the variance of complex


models (with low bias) by aggregating responses of
multiple base classifiers
10/11/2021 Introduction to Data Mining, 2nd Edition 7
General Approach of Ensemble Learning

Using majority vote or


weighted majority vote
(weighted according to their
accuracy or relevance)

10/11/2021 Introduction to Data Mining, 2nd Edition 8


Constructing Ensemble Classifiers

By manipulating training set


– Example: bagging, boosting, random forests

By manipulating input features


– Example: random forests

By manipulating class labels


– Example: error-correcting output coding

By manipulating learning algorithm


– Example: injecting randomness in the initial weights of ANN

10/11/2021 Introduction to Data Mining, 2nd Edition 9


Bagging (Bootstrap AGGregatING)

Bootstrap sampling: sampling with replacement

Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Build classifier on each bootstrap sample

Probability of a training instance being selected in


a bootstrap sample is:
➢ 1 – (1 - 1/n)n (n: number of training instances)
➢ ~0.632 when n is large
10/11/2021 Introduction to Data Mining, 2nd Edition 10
Bagging Algorithm

10/11/2021 Introduction to Data Mining, 2nd Edition 11


Bagging Example

Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump (decision tree of size 1)


– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy

xk

True False

yleft yright
10/11/2021 Introduction to Data Mining, 2nd Edition 12
Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1

10/11/2021 Introduction to Data Mining, 2nd Edition 13


Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7  y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7  y = 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
x > 0.35  y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1

10/11/2021 Introduction to Data Mining, 2nd Edition 14


Bagging Example

Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
x > 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1

10/11/2021 Introduction to Data Mining, 2nd Edition 15


Bagging Example

Summary of Trained Decision Stumps:

Round Split Point Left Class Right Class


1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

10/11/2021 Introduction to Data Mining, 2nd Edition 16


Bagging Example
Use majority vote (sign of sum of predictions) to
determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

Bagging can also increase the complexity (representation


capacity) of simple classifiers such as decision stumps
10/11/2021 Introduction to Data Mining, 2nd Edition 17
Boosting

An iterative procedure to adaptively change


distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights (for being selected for training)
– Unlike bagging, weights may change at the
end of each boosting round

10/11/2021 Introduction to Data Mining, 2nd Edition 18


Boosting

Records that are wrongly classified will have their


weights increased in the next round
Records that are classified correctly will have
their weights decreased in the next round

Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

10/11/2021 Introduction to Data Mining, 2nd Edition 19


AdaBoost

Base classifiers: C1, C2, …, CT

Error rate of a base classifier:

Importance of a classifier:

1  1 − i 
i = ln 
2  i 

10/11/2021 Introduction to Data Mining, 2nd Edition 20


AdaBoost Algorithm

Weight update:

If any intermediate rounds produce error rate


higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
Classification:

10/11/2021 Introduction to Data Mining, 2nd Edition 21


AdaBoost Algorithm

10/11/2021 Introduction to Data Mining, 2nd Edition 22


AdaBoost Example

Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump


– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy
xk

True False

yleft yright
10/11/2021 Introduction to Data Mining, 2nd Edition 23
AdaBoost Example

Training sets for the first 3 boosting rounds:


Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
10/11/2021 Introduction to Data Mining, 2nd Edition 24
AdaBoost Example

Weights
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Classification
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

10/11/2021 Introduction to Data Mining, 2nd Edition 25


Random Forest Algorithm

Construct an ensemble of decision trees by


manipulating training set as well as features

– Use bootstrap sample to train every decision


tree (similar to Bagging)
– Use the following tree induction algorithm:
◆ At every internal node of decision tree, randomly
sample p attributes for selecting split criterion
◆ Repeat this procedure until all leaves are pure
(unpruned tree)

10/11/2021 Introduction to Data Mining, 2nd Edition 26


Characteristics of Random Forest

10/11/2021 Introduction to Data Mining, 2nd Edition 27


Gradient Boosting

Constructs a series of models


– Models can be any predictive model that has
a differentiable loss function
– Commonly, trees are the chosen model
◆ XGboost (extreme gradient boosting) is a popular
package because of its impressive performance
Boosting can be viewed as optimizing the loss
function by iterative functional gradient descent.
Implementations of various boosted algorithms
are available in Python, R, Matlab, and more.

10/11/2021 Introduction to Data Mining, 2nd Edition 28


Data Mining
Classification: Alternative Techniques

Imbalanced Class Problem

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Class Imbalance Problem

Lots of classification problems where the classes


are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample

Key Challenge:
– Evaluation measures such as accuracy are not well-
suited for imbalanced class

2/15/2021 Introduction to Data Mining, 2nd Edition 2


Confusion Matrix

Confusion Matrix:

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

2/15/2021 Introduction to Data Mining, 2nd Edition 3


Accuracy

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

Most widely-used metric:

a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
2/15/2021 Introduction to Data Mining, 2nd Edition 4
Problem with Accuracy
Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)

PREDICTED CLASS
Class=Yes Class=No

Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
2/15/2021 Introduction to Data Mining, 2nd Edition 5
Which model is better?

PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990

Accuracy: 99%

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490

Accuracy: 50%
2/15/2021 Introduction to Data Mining, 2nd Edition 6
Which model is better?

PREDICTED
A Class=Yes Class=No
ACTUAL Class=Yes 5 5
Class=No 0 990

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490

2/15/2021 Introduction to Data Mining, 2nd Edition 7


Alternative Measures

PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
2/15/2021 Introduction to Data Mining, 2nd Edition 8
Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000

2/15/2021 Introduction to Data Mining, 2nd Edition 9


Alternative Measures
10
PREDICTED CLASS Precision (p) = = 0.5
10 + 10
10
Class=Yes Class=No Recall (r) = =1
10 + 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) = = 0.62
CLASS Class=No 10 980 1 + 0.5
990
Accuracy = = 0.99
1000

PREDICTED CLASS 1
Precision (p) = =1
1+ 0
Class=Yes Class=No
1
Recall (r) = = 0.1
Class=Yes 1 9 1+ 9
ACTUAL 2 * 0.1*1
CLASS Class=No 0 990 F - measure (F) = = 0.18
1 + 0.1
991
Accuracy = = 0.991
1000
2/15/2021 Introduction to Data Mining, 2nd Edition 10
Which of these classifiers is better?

PREDICTED CLASS
Precision (p) = 0.8
Class=Yes Class=No
Recall (r) = 0.8
A Class=Yes 40 10 F - measure (F) = 0.8
ACTUAL
CLASS Class=No 10 40 Accuracy = 0.8

PREDICTED CLASS
B Class=Yes Class=No Precision (p) =~ 0.04
Class=Yes 40 10 Recall (r) = 0.8
ACTUAL F - measure (F) =~ 0.08
CLASS Class=No 1000 4000
Accuracy =~ 0.8

2/15/2021 Introduction to Data Mining, 2nd Edition 11


Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN

 is the probability that we reject


the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).

 is the probability that we


accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).

2/15/2021 Introduction to Data Mining, 2nd Edition 12


Alternative Measures

A PREDICTED CLASS Precision (p) = 0.8


TPR = Recall (r) = 0.8
Class=Yes Class=No FPR = 0.2
F−measure (F) = 0.8
Class=Yes 40 10 Accuracy = 0.8
ACTUAL
CLASS Class=No 10 40
TPR
=4
FPR

B PREDICTED CLASS Precision (p) = 0.038


TPR = Recall (r) = 0.8
Class=Yes Class=No
FPR = 0.2
Class=Yes 40 10 F−measure (F) = 0.07
ACTUAL Accuracy = 0.8
CLASS Class=No 1000 4000
TPR
=4
FPR

2/15/2021 Introduction to Data Mining, 2nd Edition 13


Which of these classifiers is better?

A PREDICTED CLASS
Class=Yes Class=No
Precision (p) = 0.5
Class=Yes 10 40
TPR = Recall (r) = 0.2
ACTUAL
Class=No 10 40
FPR = 0.2
CLASS F − measure = 0.28

B PREDICTED CLASS
Precision (p) = 0.5
Class=Yes Class=No
TPR = Recall (r) = 0.5
Class=Yes 25 25
ACTUAL Class=No 25 25
FPR = 0.5
CLASS F − measure = 0.5

C PREDICTED CLASS Precision (p) = 0.5


Class=Yes Class=No
TPR = Recall (r) = 0.8
Class=Yes 40 10
ACTUAL FPR = 0.8
Class=No 40 10
CLASS
F − measure = 0.61
2/15/2021 Introduction to Data Mining, 2nd Edition 14
ROC (Receiver Operating Characteristic)

A graphical approach for displaying trade-off


between detection rate and false alarm rate
Developed in 1950s for signal detection theory to
analyze noisy signals
ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve

2/15/2021 Introduction to Data Mining, 2nd Edition 15


ROC Curve

(TPR,FPR):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal

Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class

2/15/2021 Introduction to Data Mining, 2nd Edition 16


ROC (Receiver Operating Characteristic)

To draw ROC curve, classifier must produce


continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
◆ Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM

2/15/2021 Introduction to Data Mining, 2nd Edition 17


Example: Decision Trees
Decision Tree
x2 < 12.63

x1 < 13.29 x2 < 17.35


Continuous-valued outputs
x1 < 6.56 x1 < 2.15

x2 < 12.63
x1 < 7.24
x2 < 8.64
x1 < 13.29 x2 < 17.35

x1 < 12.11
x2 < 1.38 x1 < 6.56 x1 < 2.15
0.059 0.220
x1 < 18.88
x1 < 7.24
x2 < 8.64 0.071
0.107

x1 < 12.11
x2 < 1.38 0.727
0.164

x1 < 18.88
0.143 0.669 0.271

0.654 0

2/15/2021 Introduction to Data Mining, 2nd Edition 18


ROC Curve Example

x2 < 12.63

x1 < 13.29 x2 < 17.35

x1 < 6.56 x1 < 2.15


0.059 0.220

x1 < 7.24
x2 < 8.64 0.071
0.107

x1 < 12.11
x2 < 1.38 0.727
0.164

x1 < 18.88
0.143 0.669 0.271

0.654 0

2/15/2021 Introduction to Data Mining, 2nd Edition 19


ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive

At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
2/15/2021 Introduction to Data Mining, 2nd Edition 20
How to Construct an ROC curve

• Use a classifier that produces a


Instance Score True Class
continuous-valued score for
1 0.95 +
each instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - • Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - • Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - • Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)

2/15/2021 Introduction to Data Mining, 2nd Edition 21


How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

2/15/2021 Introduction to Data Mining, 2nd Edition 22


Using ROC for Model Comparison

No model consistently
outperforms the other
M1 is better for
small FPR
M2 is better for
large FPR

Area Under the ROC


curve (AUC)
Ideal:
▪ Area =1
Random guess:
▪ Area = 0.5

2/15/2021 Introduction to Data Mining, 2nd Edition 23


Dealing with Imbalanced Classes - Summary

Many measures exists, but none of them may be ideal in


all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by
itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
◆C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or same
TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the scenario
at hand (class imbalance, importance of TP vs FP, cost/time tradeoffs)

2/15/2021 Introduction to Data Mining, 2nd Edition 24


Which Classifer is better?
Precision (p) = 0.98
T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
CLASS Class=No 1 99
F − measure = 0.66

Precision (p) = 0.9


T2 PREDICTED CLASS
TPR = Recall (r) = 0.99
Class=Yes Class=No
FPR = 0.1
ACTUAL
Class=Yes 99 1
TPR/FPR = 9.9
Class=No 10 90
CLASS
F − measure = 0.94

T3 PREDICTED CLASS Precision (p) = 0.99


Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
CLASS Class=No 1 99 TPR/FPR = 99

2/15/2021 Introduction to Data Mining, 2nd Edition F − measure = 0.99


25
Which Classifer is better? Medium Skew case
Precision (p) = 0.83
T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
CLASS Class=No 10 990
F − measure = 0.62

Precision (p) = 0.5


T2 PREDICTED CLASS
TPR = Recall (r) = 0.99
Class=Yes Class=No
FPR = 0.1
ACTUAL
Class=Yes 99 1
TPR/FPR = 9.9
Class=No 100 900
CLASS
F − measure = 0.66

T3 PREDICTED CLASS Precision (p) = 0.9


Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
CLASS Class=No 10 990 TPR/FPR = 99

2/15/2021 Introduction to Data Mining, 2nd Edition F − measure = 0.94


26
Which Classifer is better? High Skew case
Precision (p) = 0.3
T1 PREDICTED CLASS TPR = Recall (r) = 0.5
Class=Yes Class=No
FPR = 0.01
Class=Yes 50 50 TPR/FPR = 50
ACTUAL
CLASS Class=No 100 9900
F − measure = 0.375

Precision (p) = 0.09


T2 PREDICTED CLASS
TPR = Recall (r) = 0.99
Class=Yes Class=No
FPR = 0.1
ACTUAL
Class=Yes 99 1
TPR/FPR = 9.9
Class=No 1000 9000
CLASS
F − measure = 0.165

T3 PREDICTED CLASS Precision (p) = 0.5


Class=Yes Class=No TPR = Recall (r) = 0.99
Class=Yes 99 1 FPR = 0.01
ACTUAL
CLASS Class=No 100 9900 TPR/FPR = 99

2/15/2021 Introduction to Data Mining, 2nd Edition F − measure = 0.66


27
Building Classifiers with Imbalanced Training Set

Modify the distribution of training data so that rare


class is well-represented in training set
– Undersample the majority class
– Oversample the rare class

2/15/2021 Introduction to Data Mining, 2nd Edition 28

You might also like