Classification Methods
Classification Methods
Lecture 2: Methods
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Problem, goal, evaluation
• Methods
– Nearest Neighbor
– Decision Tree
– Naïve Bayes
– Rule-based Classification
– Logistic Regression
– Support Vector Machines
– Ensemble methods
– ………
• Advanced topics
– Multi-view Learning
– Semi-supervised Learning
– Transfer Learning
– ……
2
Bayesian Classification
3
Posterior Probability
• Let X be a data sample whose class label is unknown
• Let Hi be the hypothesis that X belongs to a particular
class Ci
• P(Hi|X) is posteriori probability of H
conditioned on X
– Probability that data example X belongs to class Ci
given the attribute values of X
– e.g., given X=(age:31…40, income: medium,
student: yes, credit: fair), what is the probability X
buys computer?
4
Bayes Theorem
• To classify means to determine the highest P(Hi|X)
among all classes C1,…Cm
– If P(H1|X)>P(H0|X), then X buys computer
– If P(H0|X)>P(H1|X), then X does not buy computer
– Calculate P(Hi|X) using the Bayes theorem
P H i P X | H i
P H i | X
P X
6
Age Income Student Credit Buys_computer
P1 31…4 high no fair no
0
P2 <=30 high no excellent no
P3 31…4 high no fair yes
0
P4 >40 medium no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent no
P7 31…4 low yes excellent yes
0
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 >40 medium yes fair yes
H1: Buys_computer=yes
H0: Buys_computer=no P( X | H )P(H )
P(H1)=6/10 = 0.6 P(H | X ) i i
i P( X )
P(H0)=4/10 = 0.4 7
Descriptor Prior Probability
10
Age Income Student Credit Buys_computer
P1 31…40 high no fair no
P2 <=30 high no excellent no
P3 31…40 high no fair yes
P4 >40 medium no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent no
P7 31…40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 >40 medium yes fair yes
P H i P X | H i
P H i | X
P X
16
Avoiding the Zero-Probability Problem
• Descriptor posterior probability goes to 0 if any of probability is
0:
d
P( X | H i ) P( x j | H i )
j 1
18
Logistic Regression Classifier
• Input distribution
– X is n-dimensional feature vector
– Y is 0 or 1
– X|Y ~ Gaussian distribution
– Y ~ Bernoulli distribution
• Model P(Y|X)
– What does P(Y|X) look like?
– What does P(Y=0|X)/P(Y=1|X) look like?
19
20
Log ratio:
Positive—Class 0 Negative—Class 1 21
Logistic Function
P(Y 1 | X )
1
1 exp( wX b)
Training set: X
Y=1—P(Y=1|X)=1 Y=0—P(Y=1|X)=0 22
Maximizing Conditional Likelihood
• Training Set:
• Find W that maximizes conditional likelihood:
• A concave function in W
• Gradient descent approach to solve it
23
Rule-Based Classifier
24
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
26
Rule Coverage and Accuracy
• Coverage of a rule:
Tid Refund Marital Taxable
Status Income Class
• Accuracy of a rule: 5
6
No
No
Divorced 95K
Married 60K
Yes
No
– Fraction of records 7 Yes Divorced 220K No
(Status=Single) No
Coverage = 40%, Accuracy = 50%
27
Characteristics of Rule-Based Classifier
• Exhaustive rules
– Classifier has exhaustive coverage if it accounts for
every possible combination of attribute values
– Each record is covered by at least one rule
28
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marita l
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K
NO YES
Each path in the tree forms a rule
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
29
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marita l
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No
31
Learn Rules from Data: Sequential Covering
32
Example of Sequential Covering
33
Example of Sequential Covering…
R1 R1
R2
34
How to Learn-One-Rule?
• Start with the most general rule possible: condition =
empty
• Adding new attributes by adopting a greedy depth-first
strategy
– Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and
accuracy
– Foil-gain: assesses info_gain by extending condition
pos' pos
FOIL _ Gain pos'(log 2 log 2 )
pos'neg ' pos neg
• favors rules that have high accuracy and cover many
positive tuples
35
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
36
Associative Classification
• Associative classification: Major steps
– Mine data to find strong associations between
frequent patterns (conjunctions of attribute-value
pairs) and class labels
– Association rules are generated in the form of
P1 ^ p2 … ^ pl “Aclass = C” (conf, sup)
– Organize the rules to form a rule-based classifier
37
Associative Classification
• Why effective?
– It explores highly confident associations among
multiple attributes and may overcome some
constraints introduced by decision-tree induction,
which considers only one attribute at a time
– Associative classification has been found to be often
more accurate than some traditional classification
methods, such as C4.5
38
Associative Classification
• Basic idea
– Mine possible association rules in the form of
• Cond-set (a set of attribute-value pairs) class
label
– Pattern-based approach
• Mine frequent patterns as candidate condition sets
• Choose a subset of frequent patterns based on
discriminativeness and redundancy
39
Frequent Pattern vs. Single Feature
The discriminative power of some frequent patterns is
higher than that of single features.
40
Two Problems
• Mine step
– combinatorial explosion
Frequent Patterns
mine
DataSet 1----------------------
---------2----------3
----- 4 --- 5 --------
--- 6 ------- 7------
41
Two Problems
• Select step
– Issue of discriminative power
4. Correlation not
3. InfoGain against the complete directly evaluated on their
dataset, NOT on subset of examples joint predictability
42
Direct Mining & Selection via Model-based Search Tree
• Basic Flow
44
Support Vector Machines—An Example
• Find a linear hyperplane (decision boundary) that will separate the data
45
Example
B1
B2
B2
B2
B2
b21
b22
margin
b11
b12
w x b 0
w x b 1
w x b 1
b11
b12
1 if w x b 1 2
y Margin 2
1 if w x b 1 || w ||
51
Support Vector Machines
2
• We want to maximize: Margin 2
|| w ||
2
|| w ||
– Which is equivalent to minimizing: L( w)
2
– But subjected to the following constraints:
w x i b 1 if yi 1
w x i b 1 if yi -1
• This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
52
Noisy Data
• What if the problem is not linearly separable?
53
Slack Variables
54
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be
mapped to some higher-dimensional feature space where the
training set is linearly separable:
Φ: x → φ(x)
55
Ensemble Learning
• Problem
– Given a data set D={x1,x2,…,xn} and their
corresponding labels L={l1,l2,…,ln}
– An ensemble approach computes:
• A set of classifiers {f1,f2,…,fk}, each of which maps data to a
class label: fj(x)=l
• A combination of classifiers f* which minimizes
generalization error: f*(x)= w1f1(x)+ w2f2(x)+…+ wkfk(x)
56
Generating Base Classifiers
57
Bagging (1)
• Bootstrap
– Sampling with replacement
– Contains around 63.2% original records in each
sample
• Bootstrap Aggregation
– Train a classifier on each bootstrap sample
– Use majority voting to determine the class label of
ensemble classifier
58
Bagging (2)
• Example
– Record 4 is hard to classify
– Its weight is increased, therefore it is more likely to
be chosen again in subsequent rounds
60
Boosting (2)
• AdaBoost
– Initially, set uniform weights on all the records
– At each round
• Create a bootstrap sample based on the weights
• Train a classifier on the sample and apply it on the original training
set
• Records that are wrongly classified will have their weights increased
• Records that are classified correctly will have their weights
decreased
• If the error rate is higher than 50%, start over
– Final prediction is weighted average of all the classifiers
with weight representing the training accuracy
61
Boosting (3)
62
Classifications (colors) and
Weights (size) after 1 iteration
Of AdaBoost
20 iterations
3 iterations
63
Boosting (4)
• Explanation
– Among the classifiers of the form:
f ( x) i 1 i Ci ( x)
K
exp y j f ( x j )
N
j 1
64
Random Forests (1)
• Algorithm
– Choose T—number of trees to grow
– Choose m<M (M is the number of total features) —number of
features used to calculate the best split at each node
(typically 20%)
– For each tree
• Choose a training set by choosing N times (N is the number of training
examples) with replacement from the training set
• For each node, randomly choose m features and calculate the best
split
• Fully grown and not pruned
– Use majority voting among all the trees
65
Random Forests (2)
• Discussions
– Bagging+random features
– Improve accuracy
• Incorporate more diversity and reduce variances
– Improve efficiency
• Searching among subsets of features is much faster than
searching among the complete set
66
Random Decision Tree (1)
• Single-model learning algorithms
– Fix structure of the model, minimize some form of errors, or maximize
data likelihood (eg., Logistic regression, Naive Bayes, etc.)
– Use some “free-form” functions to match the data given some
“preference criteria” such as information gain, gini index and MDL. (eg.,
Decision Tree, Rule-based Classifiers, etc.)
• Learning as Encoding
– Make no assumption about the true model, neither parametric form nor
free form
– Do not prefer one base model over the other, just average them
67
Random Decision Tree (2)
• Algorithm
– At each node, an un-used feature is chosen randomly
• A discrete feature is un-used if it has never been chosen previously on
a given decision path starting from the root to the current node.
• A continuous feature can be chosen multiple times on the same
decision path, but each time a different threshold value is chosen
– We stop when one of the following happens:
• A node becomes too small (<= 3 examples).
• Or the total height of the tree exceeds some limits, such as the total
number of features.
– Prediction
• Simple averaging over multiple trees
68
Random Decision Tree (3)
B3: continous
69
Random Decision Tree (4)
• Advantages
– Training can be very efficient. Particularly true for
very large datasets.
• No cross-validation based estimation of parameters for
some parametric methods.
– Natural multi-class probability.
– Imposes very little about the structures of the model.
70
RDT looks
like the optimal
boundary
71 71
Ensemble Learning--Stories of Success
• Million-dollar prize
– Improve the baseline movie
recommendation approach of Netflix
by 10% in accuracy
– The top submissions all combine
several teams and algorithms as an
ensemble
72
Netflix Prize
• Supervised learning task
– Training data is a set of users and ratings (1,2,3,4,5 stars)
those users have given to movies.
– Construct a classifier that given a user and an unrated
movie, correctly classifies that movie as either 1, 2, 3, 4,
or 5 stars
– $1 million prize for a 10% improvement over Netflix’s
current movie recommender
• Competition
– At first, single-model methods are developed, and
performance is improved
– However, improvement slowed down
– Later, individuals and teams merged their results, and
significant improvement is observed
73
Leaderboard
74
Take-away Message
• Various classification approaches
• how they work
• their strengths and weakness
• Algorithms
• Decision tree
• K nearest neighbors
• Naive Bayes
• Logistic regression
• Rule-based classifier
• SVM
• Ensemble method
75