DHSCH 10
DHSCH 10
Classification
Introduction
• Previously, all our training samples were labeled: these
samples were said “supervised”
• Assume:
• functional forms for underlying probability densities are known
• value of an unknown parameter vector must be learned
• i.e., like chapter 3 but without class labels
• Specific assumptions:
• The samples come from a known number c of classes
• The prior probabilities P( ) for each class are known (j = 1, …,c)
j
where ( 1 , 2 ,..., c )t
• Definition: Identifiability
A density P(x | ) is said to be identifiable if
’ implies that there exists an x such that:
P(x | ) P(x | ’)
Pattern Classification, Chapter 10
8
Assume that:
P(x = 1 | ) = 0.6 P(x = 0 | ) = 0.4
We know P(x | ) but not
We can say: 1 + 2 = 1.2 but not what 1 and 2 are.
P( 1 ) 1 P( 2 ) 1
P( x | ) exp ( x 1 )2 exp ( x 2 )2
2 2 2 2
ML Estimates
Suppose that we have a set D = {x1, …, xn} of n
unlabeled samples drawn independently from
the mixture density:
c
p( x | ) p( x | j , j )P ( j )
j 1
n
ˆ arg max p( D | ) with p(D | ) p(x k | )
The MLE is: k 1
ML Estimates
n
Then the log-likelihood is:l ln p ( xk | )
k 1
ˆ
the ML estimate i must satisfy the conditions
1
P̂ ( i ) P̂ ( i | x k ,ˆ )
n
n
ˆ ) ln p( x | ,ˆ ) 0
and i k
P̂
k 1
( | x , i k i i
p( x k | i ,ˆi )P̂ ( i )
where : P̂( i | x k ,ˆ ) c
p( x k | j ,ˆ j )P̂ ( j )
j 1
Case i i P(i) c
1 ? x x x
2 ? ? ? x
3 ? ? ? ?
P( i | xk , ˆ ) xk
ˆ i k 1n (1)
P(
k 1
i | xk , ˆ )
P ( i | x k , ˆ )
Where is the fraction of those samples
having value xk that come from the ith class, and ̂ i is the average
of the samples coming from the i-th class.
Pattern Classification, Chapter 10
1
7
P( i | x k , ˆ ( j )) x k
ˆ i ( j 1 ) k 1n
P(
k 1
i | x k , ˆ ( j ))
• Example:
Consider the simple two-component one-dimensional
normal mixture
1 1 2 1
p( x | 1 , 2 ) exp (x 1 )2 exp (x 2 )2
3 2 2 3 2 2
1 2
(2 clusters!)
Let’s set 1 = -2, 2 = 2 and draw 25 samples
sequentially from this mixture. The log-likelihood
n
function is: l( , ) ln p(x | , )
1 2 k 1 2
k 1
mixture:
1 1 x
2
1 1 2
p ( x | , )
2
exp exp x
2 2 . 2 2 2 2
this term
0
P̂ ( i | x k ,ˆ ) x k
ˆ i k 1n
Iterative | x k ,ˆ )
P̂ ( i
scheme k 1
n
ˆ )( x ˆ )( x ˆ )t
i k
P̂ ( | x , k i k i
ˆ i k 1 n
ˆ)
i k
P̂
k 1
( | x ,
Pattern Classification, Chapter 10
2
4
Where:
1/ 2 1
i exp ( x k ˆ i )t ˆ i 1 ( x k ˆ i ) P̂ ( i )
P̂ ( i | x k ,ˆ ) 2
c 1/ 2 1
j 1
ˆ
j exp ( x k ˆ j )t ˆ j 1 ( x k ˆ j ) P̂ ( j )
2
K-Means Clustering
̂ m
• P̂Find the mean
( | x ,ˆ )
1 if i m
nearest to xk and approximate
P̂ ( i | x k , )
i k
0 otherwise
as:
ˆ 1 , ˆ 2 ,..., ˆ c
Begin
initialize n, c, 1, 2, …,
c(randomly selected)
do classify n samples
according to nearest i
recompute i
until no change in i
return 1, 2, …, c
End
p(x | j , θˆ j ) P ( j )
j 1
θ̂
• Therefore, the ML estimate is justified.
• Both approaches coincide if large amounts of data are
available.
• In small sample size problems they can agree or not,
depending on the form of the distributions
• The ML method is typically easier
to implement than the Bayesian one
Pattern Classification, Chapter 10
3
2
p(x
j 1
n | j , θ j ) P ( j )
c
p (θ | D n 1 )
n j j
p
j 1
( x | , θ ) P ( j ) p (θ | D n 1
) dθ
p(x n | j , θ j ) P ( j ) From
p (θ | D n 1 ) Previous
j 1
p (θ | D n ) c
slide
n j j
p
j 1
( x | , θ ) P ( j ) p (θ | D n 1
) d θ
p(x
j 1
n | j , θ j ) P ( j ) 5
p (θ | D n ) p (θ | D n 1 )
c
Eqns From
n j j
p
j 1
( x | , θ ) P ( j ) p (θ | D n 1
) dθ
Previous
p (x n | 1 , θ1 ) slide
n
p (θ | D ) p (θ | D n 1 )
n 1 1
n 1
p ( x | , θ ) p (θ | D )dθ
• Comparing the two eqns, we see that observing an additional sample changes the
estimate of .
• Ignoring the denominator which is independent of , the only significant difference
is that
• in the SL, we multiply the “prior” density for by the component density p(xn |1, 1)
• in the UL, we multiply the “prior” density by the whole mixture
c
p(x
j 1
n | j , θ j ) P( j )
• Assuming that the sample did come from class 1, the effect of not knowing this category is
to diminish the influence of xn in changing for category 1..
Pattern Classification, Chapter 10
3
6
Data Clustering
• Structures of multidimensional patterns are important
for clustering
• If we know that data come from a specific distribution,
such data can be represented by a compact set of
parameters (sufficient statistics)
Caveat
• If little prior knowledge can be assumed, the
assumption of a parametric form is meaningless:
• Issue: imposing structure vs finding structure
Similarity measures
• What do we mean by similarity?
• Two isses:
• How to measure the similarity between samples?
• How to evaluate a partitioning of a set into clusters?
• Achieving invariance:
• normalize the data, e.g., such that they all have zero
means and unit variance,
• or use principal components for invariance to rotation
• A broad class of metrics is the Minkowsky metric
1/ q
d
q
d (x, x' ) xk xk '
k 1
where q1 is a selectable parameter:
q = 1 Manhattan or city block metric
q = 2 Euclidean metric
• One can also used a nonmetric similarity function
s(x,x’) to compare 2 vectors.
Pattern Classification, Chapter 10
4
2
• In case of binary-valued
x x' t
features, we have, e.g.:
s (x, x' )
d
x t x'
s (x, x' ) t Tanimoto distance
x x x't x'x t x'
Clustering as optimization
• The second issue: how to evaluate a partitioning of
a set into clusters?
i 1 xDi
• Results:
• Good when clusters form well separated compact clouds
• Bad with large differences in the number of samples in different
clusters.
• Scatter criteria
• Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the between-
scatter matrix SB
ST = SB +SW
• Note:
• S does not depend on partitioning
T
• Two approaches:
• minimize the within-cluster
• maximize the between-cluster scatter
• As tr[S ] = tr[S
T W ] + tr[SB] and tr[ST] is independent from
the partitioning, no new results can be derived by
minimizing tr[SB]
c
1 mean1vector:
where m is the total
m
nD
x nm
n i 1
i i
Iterative optimization
• Clustering discrete optimization problem
i 1 xDi
Hierarchical Clustering
• Many times, clusters are not disjoint, but a
cluster may have subclusters, in turn having sub-
subclusters, etc.
• Consider a sequence of partitions of the n
samples into c clusters
• The first is a partition into n cluster, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third
into n-2, and so on, until the n-th in which there is only
one cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
5
4
1
d avg ( Di , D j )
ni n j
x x'
xDi x 'D j
d mean ( Di , D j ) m i m j
Graph-theoretic methods
• Caveat: no uniform way of posing clustering as a
graph theoretic problem
• Generalize from a threshold distance to arbitrary
similarity measures.
• If s0 is a threshold value, we can say that xi is
similar to xj if s(xi, xj) > s0.
• We can define a similarity matrix S = [s ] ij
1 if s (x i , x j ) s0
sij
0 otherwise