Datamining Lect7knearst
Datamining Lect7knearst
LECTURE 7
Classification
k-nearest neighbor classifier
Naïve Bayes
Logistic Regression
Support Vector Machines
NEAREST NEIGHBOR
CLASSIFICATION
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Instance-Based Classifiers
Set of Stored Cases • Store the training records
• Use training records to
Atr1 ……... AtrN Class predict the class label of
A unseen cases
B
B
Unseen Case
C
A Atr1 ……... AtrN
C
B
Instance Based Classifiers
• Examples:
• Rote-learner
• Memorizes entire training data and performs classification only
if attributes of record match one of the training examples exactly
Compute
Distance Test
Record
X X X
d ( p, q ) ( pi
i
q )
i
2
X
Nearest Neighbor Classification…
• Scaling issues
• Attributes may have to be scaled to prevent distance
measures from being dominated by one of the attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
• Problem with Euclidean measure:
• High dimensional data
• curse of dimensionality
• Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
2-dimensional kd-trees
Nearest Neighbor Search
2-dimensional kd-trees
Nearest Neighbor Search
2-dimensional kd-trees
Nearest Neighbor Search
2-dimensional kd-trees
Nearest Neighbor Search
2-dimensional kd-trees
Nearest Neighbor Search
2-dimensional kd-trees
Nearest Neighbor Search
2-dimensional kd-trees
region(u) – all the black points in the subtree of u
Nearest Neighbor Search
2-dimensional kd-trees
A binary tree:
Size O(n)
Depth O(logn)
Construction time O(nlogn)
Query time: worst case O(n), but for many cases O(logn)
Generalizes to d dimensions
• Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
B1
B2
B2
B2
B2
b21
b22
margin
b11
b12
w x b 0
w x b 1
w x b 1
b11
b12
1 if w x b 1 2
f (x) Margin
1 if w x b 1 || w ||
Support Vector Machines
2
• We want to maximize: Margin 2
|| w ||
2
• Which is equivalent to minimizing: L( w)
|| w ||
2
• But subjected to the following constraints:
P ( A | C ) P (C )
• Bayes Theorem: P (C | A)
P ( A)
Bayesian Classifiers
• Consider each attribute and class label as random
variables
Evade C
Event space: {Yes, No}
P(C) = (0.3, 0.7)
Tid Refund M arital Taxable Refund A1
Status Incom e Evade
Event space: {Yes, No}
1 Yes Single 125K No
P(A1) = (0.3,0.7)
2 No M arried 100K No
3 No Single 70K No Martial Status A2
4 Yes M arried 120K No
Event space: {Single, Married, Divorced}
5 No Divorced 95K Yes
P(A2) = (0.4,0.4,0.2)
6 No M arried 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Taxable Income A3
9 No M arried 75K No Event space: R
10 No Single 90K Yes P(A3) ~ Normal(,)
10
Bayesian Classifiers
• Given a record X over attributes (A1, A2,…,An)
• E.g., X = (‘Yes’, ‘Single’, 125K)
P( A A A )
1 2 n
1 2 n
1
( 120110 ) 2
C
• Conditional independence given C