0% found this document useful (0 votes)
8 views82 pages

Aiml Unit-4

The document covers various classification methods in machine learning, focusing on Nearest Neighbor Classifiers and Naive Bayes Classifier. It explains the basic principles, advantages, and challenges of these classifiers, including distance computation, handling missing values, and estimating probabilities from data. Additionally, it discusses the importance of attribute selection and preprocessing for effective classification outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views82 pages

Aiml Unit-4

The document covers various classification methods in machine learning, focusing on Nearest Neighbor Classifiers and Naive Bayes Classifier. It explains the basic principles, advantages, and challenges of these classifiers, including distance computation, handling missing values, and estimating probabilities from data. Additionally, it discusses the importance of attribute selection and preprocessing for effective classification outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

RV College Go, change the

of world
Engineering

Artificial Intelligence and Machine Learning


(IS353IA)

Unit-IV
Contents

❑ Nearest Neighbor Classifiers


❑ Naive Bayes Classifier
❑ Logistic Regression
❑ Ensemble Methods

2/10/2021 Introduction to Data Mining, 2nd Edition 2


Nearest Neighbor Classifiers

Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

2/10/2021 Introduction to Data Mining, 2nd Edition 3


Nearest-Neighbor Classifiers
Unknown record Requires the following:
– A set of labeled records
– Proximity metric to compute
distance/similarity between a
pair of records
– e.g., Euclidean distance
– The value of k, the number of
nearest neighbors to retrieve
– A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)

2/10/2021 Introduction to Data Mining, 2nd Edition 4


How to Determine the class label of a Test Sample?

Take the majority vote of class labels among the k-


nearest neighbors
Weight the vote according to distance
– weight factor, 𝑤 = 1/𝑑2

2/10/2021 Introduction to Data Mining, 2nd Edition 5


Choice of proximity measure matters

For documents, cosine is better than correlation or Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs, but


the cosine similarity measure has different
values for these pairs.

2/10/2021 Introduction to Data Mining, 2nd Edition 6


Nearest Neighbor Classification…

Data preprocessing is often required


– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the attributes
◆Example:

– height of a person may vary from 1.5m to 1.8m


– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0 means


a standard deviation of 1

2/10/2021 Introduction to Data Mining, 2nd Edition 7


Nearest Neighbor Classification…

Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

2/10/2021 Introduction to Data Mining, 2nd Edition 8


Nearest-neighbor classifiers
Nearest neighbor
classifiers are local
classifiers

They can produce 1-nn decision boundary is


decision boundaries of a Voronoi Diagram
arbitrary shapes.

2/10/2021 Introduction to Data Mining, 2nd Edition 9


Nearest Neighbor Classification…

How to handle missing values in training and test


sets?
– Proximity computations normally require the
presence of all attributes
– Some approaches use the subset of attributes
present in two instances
◆ This may not produce good results since it effectively
uses different proximity measures for each pair of
instances
◆ Thus, proximities are not comparable

2/10/2021 Introduction to Data Mining, 2nd Edition 10


K-NN Classificiers…
Handling Irrelevant and Redundant Attributes
– Irrelevant attributes add noise to the proximity measure
– Redundant attributes bias the proximity measure towards certain attributes

2/10/2021 Introduction to Data Mining, 2nd Edition 11


K-NN Classifiers: Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 12


Handling attributes that are interacting

2/10/2021 Introduction to Data Mining, 2nd Edition 13


Improving KNN Efficiency
Avoid having to compute distance to all objects in the
training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH)
Condensing
– Determine a smaller set of objects that give the
same performance
Editing
– Remove objects to improve efficiency

2/10/2021 Introduction to Data Mining, 2nd Edition 14


Naive Bayes Classifier 𝑝1

2/10/2021 Introduction to Data Mining, 2nd Edition 15


Bayes Classifier

A probabilistic framework for solving classification


problems
Conditional Probability: P( X , Y )
P (Y | X ) =
P( X )
P( X , Y )
P( X | Y ) =
P (Y )
Bayes theorem:

P( X | Y ) P(Y )
P(Y | X ) =
P( X )

2/10/2021 Introduction to Data Mining, 2nd Edition 16


Using Bayes Theorem for Classification

Consider each attribute and class label as


random variables
Given a record with attributes (X1,
al al us
X2,…, Xd), the goal is to predict class Y e go
ir c
e go
ir c
tin
uo
ss
a t a t n la
c c co c
Tid Refund Marital Taxable
– Specifically, we want to find the value of Y Status Income Evade

that maximizes P(Y| X1, X2,…, Xd ) 1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No

Can we estimate P(Y| X1, X2,…, Xd ) 4 Yes Married 120K No


5 No Divorced 95K Yes
directly from data? 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/10/2021 Introduction to Data Mining, 2nd Edition 17


Using Bayes Theorem for Classification

Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using the
Bayes theorem

P( X 1 X 2  X d | Y ) P(Y )
P(Y | X 1 X 2  X n ) =
P( X 1 X 2  X d )

– Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

– Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

How to estimate P(X1, X2, …, Xd | Y )?


2/10/2021 Introduction to Data Mining, 2nd Edition 18
Example Data
Given a Test Record:
al al us
ir c ir c X = (oRefund = No, Divorced, Income = 120K)
u
e go e go tin ss
t t n la
ca ca co c
Tid Refund Marital Taxable
Status Income Evade • We need to estimate
1 Yes Single 125K No P(Evade = Yes | X) and P(Evade = No | X)
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No In the following we will replace
5 No Divorced 95K Yes
Evade = Yes by Yes, and
6 No Married 60K No
7 Yes Divorced 220K No Evade = No by No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/10/2021 Introduction to Data Mining, 2nd Edition 19


Example Data
Given a Test Record:
al al s
ir c ir c X = o(uRefund = No, Divorced, Income = 120K)
u
e go e go tin ss
t t n la
ca ca co c
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/10/2021 Introduction to Data Mining, 2nd Edition 20


Conditional Independence

X and Y are conditionally independent given Z if


P(X|YZ) = P(X|Z)

Example: Arm length and reading skills


– Young child has shorter arm length and limited
reading skills, compared to adults
– If age is fixed, no apparent relationship between
arm length and reading skills
– Arm length and reading skills are conditionally
independent given age

2/10/2021 Introduction to Data Mining, 2nd Edition 21


Naïve Bayes Classifier
Assume independence among attributes Xi when class is given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

– New point is classified to Yj if P(Yj)  P(Xi| Yj) is


maximal.

2/10/2021 Introduction to Data Mining, 2nd Edition 22


Naïve Bayes on Example Data
Given a Test Record:
al al us
ir c ir c X = (oRefund = No, Divorced, Income = 120K)
u
e go e go tin ss
t t n la
ca ca co c
Tid Refund Marital Taxable
Status Income Evade P(X | Yes) =
1 Yes Single 125K No
P(Refund = No | Yes) x
2 No Married 100K No
3 No Single 70K No
P(Divorced | Yes) x
4 Yes Married 120K No P(Income = 120K | Yes)
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
P(X | No) =
8 No Single 85K Yes P(Refund = No | No) x
9 No Married 75K No
P(Divorced | No) x
10 No Single 90K Yes
10

P(Income = 120K | No)

2/10/2021 Introduction to Data Mining, 2nd Edition 23


Estimate Probabilities from Data
l l
ic a ic a
ous
gor gor nu s
te te nti
cla s P(y) = fraction of instances of class y
ca ca co
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No For categorical attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Xi =c| y) = nc/ n
5 No Divorced 95K Yes
– where |Xi =c| is number of
6 No Married 60K No instances having attribute value
7 Yes Divorced 220K No Xi =c and belonging to class y
8 No Single 85K Yes
– Examples:
9 No Married 75K No
10 No Single 90K Yes P(Status=Married|No) = 4/7
10

P(Refund=Yes|Yes)=0

2/10/2021 Introduction to Data Mining, 2nd Edition 24


Estimate Probabilities from Data

For continuous attributes:


– Discretization: Partition the range into bins:
◆ Replace continuous value with bin value
– Attribute changed from continuous to ordinal

– Probability density estimation:


◆ Assume attribute follows a normal distribution
◆ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
◆ Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)

2/10/2021 Introduction to Data Mining, 2nd Edition 25


Estimate
oric a l
oric Probabilities
a l
uous from Data
teg teg ntin a s s
a a l
c c co c
Tid Refund Marital
Status
Taxable
Income Evade
Normal distribution:
( X i − ij ) 2

1 Yes Single 125K No 1 2 ij2
P( X i | Y j ) = e
2 No Married 100K No
2 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
◆ sample mean = 110
10 No Single 90K Yes ◆ sample variance = 2975
10

1 −
( 120−110 ) 2

P( Income = 120 | No) = e 2 ( 2975 )


= 0.0072
2 (54.54)
2/10/2021 Introduction to Data Mining, 2nd Edition 26
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Divorced, Income = 120K)
Naïve Bayes Classifier:

P(Refund = Yes | No) = 3/7


P(Refund = No | No) = 4/7 • P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0  P(Divorced | No)
P(Refund = No | Yes) = 1  P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7  1/7  0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 • P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3  P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0  P(Income=120K | Yes)
= 1  1/3  1.2  10-9 = 4  10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25
=> Class = No

2/10/2021 Introduction to Data Mining, 2nd Edition 27


Naïve Bayes Classifier can make decisions with partial
information about attributes in the test record
Even in absence of information
about any attributes, we can use P(Yes) = 3/10
Apriori Probabilities of Class P(No) = 7/10
Variable:
Naïve Bayes Classifier: If we only know that marital status is Divorced, then:
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) = 2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Divorced, Refund = No)
P(Marital Status = Single | Yes) = 2/3 P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Marital Status = Divorced | Yes) = 1/3 P(Divorced, Refund = No)
P(Marital Status = Married | Yes) = 0
If we also know that Taxable Income = 120, then
For Taxable Income: P(Yes | Refund = No, Divorced, Income = 120) =
If class = No: sample mean = 110 1.2 x10-9 x 1 x 1/3 x 3/10 /
sample variance = 2975
P(Divorced, Refund = No, Income = 120 )
If class = Yes: sample mean = 90
sample variance = 25 P(No | Refund = No, Divorced Income = 120) =
0.0072 x 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No, Income = 120)
2/10/2021 Introduction to Data Mining, 2nd Edition 28
Issues with Naïve Bayes Classifier
Given a Test Record:
X = (Married)

Naïve Bayes Classifier:


P(Yes) = 3/10
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(No) = 7/10
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7 P(Yes | Married) = 0 x 3/10 / P(Married)
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(No | Married) = 4/7 x 7/10 / P(Married)
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0

For Taxable Income:


If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25

2/10/2021 Introduction to Data Mining, 2nd Edition 29


Issues with Naïve Bayes Classifier
ic al ic al us
o
gor gor in
u s
te te nt a s Naïve Bayes Classifier:
Consider the
ca table cwitha Tid =co7 deleted cl
Tid Refund Marital Taxable
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
sample variance = 685
9 No Married 75K No If class = No: sample mean = 90
10 No Single 90K Yes sample variance = 25
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0 classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0

2/10/2021 Introduction to Data Mining, 2nd Edition 30


Issues with Naïve Bayes Classifier

If one of the conditional probabilities is zero, then the


entire expression becomes zero
Need to use other estimates of conditional probabilities than
simple fractions
n: number of training
Probability estimation: instances belonging to class y
nc: number of instances with
Xi = c and Y = y
𝑛𝑐
original: 𝑃 𝑋𝑖 = 𝑐 𝑦) = v: total number of attribute
𝑛 values that Xi can take
𝑛𝑐 + 1 p: initial estimate of
Laplace Estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛+𝑣 (P(Xi = c|y) known apriori
m: hyper-parameter for our
𝑛𝑐 + 𝑚𝑝 confidence in p
m − estimate: 𝑃 𝑋𝑖 = 𝑐 𝑦) =
𝑛+𝑚

2/10/2021 Introduction to Data Mining, 2nd Edition 31


Example of Naïve Bayes Classifier

Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat
pigeon
yes
no
yes
yes
no
no
yes
yes
mammals
non-mammals
P ( A | M ) =    = 0.06
cat yes no no yes mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) =    = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
7
P ( A | M ) P ( M ) = 0.06  = 0.021
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004  = 0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals

2/10/2021 Introduction to Data Mining, 2nd Edition 32


Naïve Bayes (Summary)

Robust to isolated noise points

Handle missing values by ignoring the instance during


probability estimate calculations

Robust to irrelevant attributes

Redundant and correlated attributes will violate class


conditional assumption
–Useother techniques such as Bayesian Belief Networks
(BBN)

2/10/2021 Introduction to Data Mining, 2nd Edition 33


Naïve Bayes
How does Naïve Bayes perform on the following dataset?

Conditional independence of attributes is violated

2/10/2021 Introduction to Data Mining, 2nd Edition 34


Bayesian Belief Networks

Provides graphical representation of probabilistic


relationships among a set of random variables
Consists of:
– A directed acyclic graph (dag) A B
◆ Node corresponds to a variable
◆ Arc corresponds to dependence
relationship between a pair of variables C

– A probability table associating each node to its


immediate parent

2/10/2021 Introduction to Data Mining, 2nd Edition 35


Conditional Independence

D
D is parent of C
A is child of C
C
B is descendant of D
D is ancestor of A

A B

A node in a Bayesian network is conditionally


independent of all of its nondescendants, if its parents
are known
2/10/2021 Introduction to Data Mining, 2nd Edition 36
Conditional Independence

Naïve Bayes assumption:

X1 X2 X3 X4 ... Xd

2/10/2021 Introduction to Data Mining, 2nd Edition 37


Probability Tables

If X does not have any parents, table contains prior


probability P(X)
Y

If X has only one parent (Y), table contains


conditional probability P(X|Y) X

If X has multiple parents (Y1, Y2,…, Yk), table


contains conditional probability P(X|Y1, Y2,…, Yk)

2/10/2021 Introduction to Data Mining, 2nd Edition 38


Example of Bayesian Belief Network

Exercise=Yes 0.7 Diet=Healthy 0.25


Exercise=No 0.3 Diet=Unhealthy 0.75

Exercise Diet

D=Healthy D=Healthy D=Unhealthy D=Unhealthy


Heart E=Yes E=No E=Yes E=No
Disease HD=Yes 0.25 0.45 0.55 0.75
HD=No 0.75 0.55 0.45 0.25

Blood
Chest Pain
Pressure

HD=Yes HD=No HD=Yes HD=No


CP=Yes 0.8 0.01 BP=High 0.85 0.2
CP=No 0.2 0.99 BP=Low 0.15 0.8

2/10/2021 Introduction to Data Mining, 2nd Edition 39


Example of Inferencing using BBN
Given: X = (E=No, D=Yes, CP=Yes, BP=High)
– Compute P(HD|E,D,CP,BP)?
P(HD=Yes| E=No,D=Yes) = 0.55
P(CP=Yes| HD=Yes) = 0.8
P(BP=High| HD=Yes) = 0.85
– P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High)
 0.55  0.8  0.85 = 0.374
Classify X
P(HD=No| E=No,D=Yes) = 0.45 as Yes
P(CP=Yes| HD=No) = 0.01
P(BP=High| HD=No) = 0.2
– P(HD=No|E=No,D=Yes,CP=Yes,BP=High)
 0.45  0.01  0.2 = 0.0009

2/10/2021 Introduction to Data Mining, 2nd Edition 40


Logistic Regression

2/10/2021 Introduction to Data Mining, 2nd Edition 41


Logistic Model

Logistic model (or logit model) is a statistical


model that models the probability of an event taking
place by having the log-odds for the event be a linear
combination of one or more independent variables.

2/10/2021 Introduction to Data Mining, 2nd Edition 42


An example curve

Example graph of a logistic regression curve fitted to data. The curve shows the probability of
passing an exam (binary dependent variable) versus hours studying (scalar independent variable).

2/10/2021 Introduction to Data Mining, 2nd Edition 43


Logistic Regression

➢ Logistic regression is a probabilistic discriminative model that


directly estimates the odds of a data instance a using its
attribute values.
➢ Basic idea is to use linear predictor, z = wT x + b, for
representing the odds of x as follows:

where w and b are the parameters of the model and aT denotes the transpose of
a vector a. Note that if wT x + b > 0, then x belongs to class 1 since its odds is
greater than 1. Otherwise, x belongs to class 0.

2/10/2021 Introduction to Data Mining, 2nd Edition 44


Cont…
Since P(y = 0|x) + P(y = 1|x) = 1, we can re-write

This can be further simplified to express P(y = 1|x) as a


function of z.

where the function σ(.) is known as the logistic or


sigmoid function

2/10/2021 Introduction to Data Mining, 2nd Edition 45


Logistic Regression as a Generalized Linear Model

➢ Logistic regression belongs to a broader family of


statistical regression models, known as generalized
linear models (GLM).

➢ Even though logistic regression has relationships with regression


models, it is a classification model since the computed posterior
probabilities are eventually used to determine the class label of a
data instance.
2/10/2021 Introduction to Data Mining, 2nd Edition 46
Learning Model Parameters

➢ The parameters of logistic regression, (w, b), are


estimated during training using a statistical approach
known as the maximum likelihood estimation (MLE)
method.

2/10/2021 Introduction to Data Mining, 2nd Edition 47


Characteristics of Logistic Regression

➢ Discriminative model for classification.


➢ The learned parameters of logistic regression can be
analyzed to understand the relationships between
attributes and class labels.
➢ Can work more robustly even in high-dimensional
settings
➢ Can handle irrelevant attributes
➢ Cannot handle data instances with missing values

2/10/2021 Introduction to Data Mining, 2nd Edition 48


Ensemble Techniques

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 49
Ensemble Methods

There are techniques for improving classification


accuracy by aggregating the predictions of multiple
classifiers.
– known as ensemble or classifier combination
methods.
An ensemble method constructs a set of base
classifiers from training data and performs
classification
– by taking a vote on the predictions made by each
base classifier.

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 50
Ensemble Methods

Construct a set of base classifiers learned from the


training data

Predict class label of test records by combining the


predictions made by multiple classifiers (e.g., by
taking majority vote)

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 51
Example: Why Do Ensemble Methods Work?

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 52
Necessary Conditions for Ensemble Methods

Ensemble Methods work better than a single base classifier if:


1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)

Classification error for an


ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.

Comparison between
errors of base classifiers
and errors of the ensemble
classifier
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 53
Rationale for Ensemble Learning

Ensemble Methods work best with unstable base


classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 54
Bias-Variance Decomposition
Analogous problem of reaching a target y by firing projectiles
from x (regression problem)

For classification, the generalization error of model 𝑚 can be


given by:

𝑔𝑒𝑛. 𝑒𝑟𝑟𝑜𝑟 𝑚 = 𝑐1 + 𝑏𝑖𝑎𝑠 𝑚 + 𝑐2 × 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑚)

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 55
Bias-Variance Trade-off and Overfitting

Overfitting

Underfitting

Ensemble methods try to reduce the variance of complex


models (with low bias) by aggregating responses of multiple
base classifiers
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 56
General Approach of Ensemble Learning

Using majority vote or


weighted majority vote
(weighted according to their
accuracy or relevance)

A logical view of the ensemble learning method.

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 57
Constructing Ensemble Classifiers

By manipulating training set


– Example: bagging, boosting

By manipulating input features


– Example: random forests

By manipulating class labels


– Example: error-correcting output coding

By manipulating learning algorithm


– Example: injecting randomness in the initial weights of ANN

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 58
Constructing Ensemble Classifiers

By manipulating training set


o multiple training sets are created by resampling the original data according to some
sampling distribution and constructing a classifier from each training set.
o The sampling distribution determines how likely it is that an example will be selected
for training, and it may vary from one trial to another.
o Example: bagging, boosting

By manipulating input features


o a subset of input features is chosen to form each training set. The subset can be either
chosen randomly or based on the recommendation of domain experts.
o this approach works very well with data sets that contain highly redundant features
o Example: random forests --ensemble method that manipulates its input features and
uses decision trees as its base classifiers

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 59
Constructing Ensemble Classifiers

By manipulating class labels


o Example: error-correcting output coding
o can be used when the number of classes is sufficiently large.
o The training data is transformed into a binary class problem by randomly partitioning
the class labels into two disjoint subsets, A0 and A1.
o Training examples whose class label belongs to the subset A0 are assigned to class 0,
while those that belong to the subset A1 are assigned to class 1.
o The relabeled examples are then used to train a base classifier.
o By repeating this process multiple times, an ensemble of base classifiers is obtained.
When a test example is presented, each base classifier Ci is used to predict its class
label.
o If the test example is predicted as class 0, then all the classes that belongto A0 will
receive a vote. Conversely, if it is predicted to be class 1, then all the classes that
belong to A1 will receive a vote.
o The votes are tallied and the class that receives the highest vote is assigned to the test

2/10/2021
12/29/2024 Introduction to Data Mining, 2nd Edition 60
Constructing Ensemble Classifiers

By manipulating learning algorithm


– Example: injecting randomness in the initial weights of ANN
o Many learning algorithms can be manipulated in such a way that applying the
algorithm several times on the same training data will result in the construction of
different classifiers.
o For example, an artificial neural network can change its network topology or the initial
weights of the links between neurons.
o Similarly, an ensemble of decision trees can be constructed by injecting randomness
into the tree-growing procedure. For example, instead of choosing the best splitting
attribute at each node, we can randomly choose one of the top k attributes for splitting.

The first three approaches are generic methods that are applicable to any classifier,
whereas the fourth approach depends on the type of classifier used.
The base classifiers for most of these approaches can be generated sequentially (one
after another) or in parallel (all at once).
2/10/2021
12/29/2024 Introduction to Data Mining, 2nd Edition 61
Bagging (Bootstrap AGGregatING)

Bootstrap sampling: sampling with replacement

Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Build classifier on each bootstrap sample

Probability of a training instance being selected in a


bootstrap sample is:
➢ 1 – (1 - 1/n)n (n: number of training instances)
➢ ~0.632 when n is large
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 62
Bagging Algorithm

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 63
Bagging Example
Consider 1-dimensional data set:

Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump (decision tree of size 1)


– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy

xk

True False

yleft yright
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 64
Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1

2/10/2021 Introduction to Data Mining, 2nd Edition 65


Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7  y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7  y = 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
x > 0.35  y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1

2/10/2021 Introduction to Data Mining, 2nd Edition 66


Bagging Example

Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
x > 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1

2/10/2021 Introduction to Data Mining, 2nd Edition 67


Bagging Example

Summary of Trained Decision Stumps:

Round Split Point Left Class Right Class


1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 68
Bagging Example
Use majority vote (sign of sum of predictions) to determine
class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

Bagging can also increase the complexity (representation


capacity) of simple classifiers such as decision stumps
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 69
Boosting

An iterative procedure to adaptively change


distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal weights
(for being selected for training)
– Unlike bagging, weights may change at the end of
each boosting round

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 70
Boosting
Records that are wrongly classified will have their
weights increased in the next round
Records that are classified correctly will have their
weights decreased in the next round

Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 71
AdaBoost

Base classifiers: C1, C2, …, CT

Error rate of a base classifier:

Importance of a classifier:

1  1 − i 
i = ln 
2  i 

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 72
AdaBoost Algorithm
Weight update:

If any intermediate rounds produce error rate higher


than 50%, the weights are reverted back to 1/n and the
resampling procedure is repeated
Classification:

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 73
AdaBoost Algorithm

2/10/2021 Introduction to Data Mining, 2nd Edition 74


AdaBoost Example
Consider 1-dimensional data set:

Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump


– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy

xk

True False

yleft yright
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 75
AdaBoost Example
Training sets for the first 3 boosting rounds:
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 76
AdaBoost Example

Weights
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Classification

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 77
Random Forest Algorithm
Construct an ensemble of decision trees by
manipulating training set as well as features

– Use bootstrap sample to train every decision tree


(similar to Bagging)
– Use the following tree induction algorithm:
◆ At every internal node of decision tree, randomly
sample p attributes for selecting split criterion
◆ Repeat this procedure until all leaves are pure (unpruned
tree)

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 78
Random Forest Algorithm

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 79
Characteristics of Random Forest

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 80
Gradient Boosting
Constructs a series of models
– Models can be any predictive model that has a
differentiable loss function
– Commonly, trees are the chosen model
◆ XGboost (extreme gradient boosting) is a popular
package because of its impressive performance
Boosting can be viewed as optimizing the loss
function by iterative functional gradient descent.
Implementations of various boosted algorithms are
available in Python, R, Matlab, and more.

12/29/2024
2/10/2021 Introduction to Data Mining, 2nd Edition 81
References:
1. Introduction to Data Mining ,Pang-Ning Tan, Michael Steinbach, Vipin
Kumar,2nd edition, 2019,Pearson , ISBN-10-9332571406, ISBN-13 -978-
9332571402
2. Machine Learning ,Tom M. Mitchell, Indian Edition, 2013, McGraw Hill
Education, ISBN – 10 – 1259096955
3. Jiawei Han and Micheline, Kamber: Data Mining – Concepts and
Techniques, 2nd Edition, Morgan Kaufmann, 2006, ISBN 1-55860-901-6

2/10/2021 Introduction to Data Mining, 2nd Edition 82

You might also like