0% found this document useful (0 votes)

6 views67 pages

DHSCH 10

The document discusses unsupervised learning and clustering techniques from 'Pattern Classification' by Duda, Hart, and Stork. It covers topics such as mixture densities, maximum likelihood estimates, and various clustering algorithms including K-means and hierarchical clustering. The importance of unsupervised methods is highlighted, particularly in scenarios where labeled data is scarce or costly to obtain.

Uploaded by

eunhye.jenna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views67 pages

DHSCH 10

Uploaded by

eunhye.jenna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 67

Pattern

Classification

All materials in these slides were taken from

Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 10
Unsupervised Learning & Clustering
• Introduction
• Mixture Densities and Identifiability
• ML Estimates
• Application to Normal Mixtures
• K-means algorithm
• Unsupervised Bayesian Learning
• Data description and clustering
• Criterion function for clustering
• Hierarchical clustering
• The number of cluster problem and cluster validation
• On-line clustering
• Graph-theoretic methods
• PCA and ICA
• Low-dim reps and multidimensional scaling (self-
organizing maps)
Clustering and dimensionality reduction
3

Introduction
• Previously, all our training samples were labeled: these
samples were said “supervised”

• Why are we interested in “unsupervised” procedures

which use unlabeled samples?

1)Collecting and Labeling a large set of sample patterns can

be costly

2)We can train with large amounts of (less expensive)

unlabeled data
Then use supervision to label the groupings found, this is
appropriate for large “data mining” applications where the
contents of a large database are not known beforehand

Pattern Classification, Chapter 10

3) Patterns may change slowly with time

Improved performance can be achieved if classifiers
running in a unsupervised mode are used

4) We can use unsupervised methods to identify

features that will then be useful for
categorization
 ‘smart’ feature extraction

5) We gain some insight into the nature (or

structure) of the data
 which set of classification labels?
Pattern Classification, Chapter 10
5

Mixture Densities & Identifiability

• Assume:
• functional forms for underlying probability densities are known
• value of an unknown parameter vector must be learned
• i.e., like chapter 3 but without class labels
• Specific assumptions:
• The samples come from a known number c of classes
• The prior probabilities P( ) for each class are known (j = 1, …,c)
j

• Forms for the P(x |  ,  ) (j = 1, …,c) are known

j j

• The values of the c parameter vectors  ,  , …,  are unknown

1 2 c

• The category labels are unknown

Pattern Classification, Chapter 10
6

• The PDF for the samples is:

c component
  densities
 
P ( x |  )  P ( x |  j , j ). P (  j )
j 1   
mixing parameters

where  (  1 , 2 ,..., c )t

• This density function is called a mixture density

• Our goal will be to use samples drawn from this
mixture density to estimate the unknown parameter
vector .
• Once  is known, we can decompose the mixture
into its components and use a MAP classifier on the Pattern Classification, Chapter 10
7

• Can  be recovered from the mixture?

• Consider the case where:
• Unlimited number of samples
• Use nonparametric technique to find p(x| ) for every x
• If several  result in same p(x| )  can’t find unique
solution

This is the issue of solution identifiability.

• Definition: Identifiability
A density P(x | ) is said to be identifiable if
  ’ implies that there exists an x such that:
P(x | )  P(x | ’)
Pattern Classification, Chapter 10
8

As a simple example, consider the case where x is binary

and
1 x 1 x
P(x | ) is the mixture: P ( x |  )   1 ( 1   1 )1 x
  2 ( 1   2 )1 x
2 2
1
 2 (  1   2 ) if x 1

1 - 1 (    ) if x 0
 2 1 2

Assume that:
P(x = 1 | ) = 0.6  P(x = 0 | ) = 0.4
We know P(x | ) but not 
We can say: 1 + 2 = 1.2 but not what 1 and 2 are.

Thus, we have a case in which the mixture distribution is

completely unidentifiable, and therefore unsupervised
learning is impossible.
Pattern Classification, Chapter 10
9

• In the discrete distributions too many components can

be problematic
• Too many unknowns
• Perhaps more unknowns than independent equations
identifiability can become a serious problem!

Pattern Classification, Chapter 10

1
0

• While it can be shown that mixtures of normal densities are

usually identifiable, the parameters in the simple mixture
density

P( 1 )  1  P(  2 )  1 
P( x | )  exp   ( x   1 )2   exp   ( x   2 )2 
2  2  2  2 

cannot be uniquely identified if P(1) = P(2)

(we cannot recover a unique  even from an infinite amount of
data!)
  = ( ,  ) and  = ( ,  ) are two possible vectors that can be
1 2 2 1

interchanged without affecting P(x | ).

• Identifiability can be a problem, we always assume that the
densities we are dealing with are identifiable!

Pattern Classification, Chapter 10

1
1

ML Estimates
Suppose that we have a set D = {x1, …, xn} of n
unlabeled samples drawn independently from
the mixture density:
c
p( x |  )  p( x |  j , j )P (  j )
j 1

( is fixed but unknown!)

n
ˆ arg max p( D |  ) with p(D |  )  p(x k |  )
The MLE is:  k 1

Pattern Classification, Chapter 10

1
2

ML Estimates
n
Then the log-likelihood is:l  ln p ( xk |  )
k 1

And the gradient of the log-likelihood is:

n
 i l  P (  i | x k , ) i ln p( x k |  i , i )
k 1

Pattern Classification, Chapter 10

1
3

Since the gradient must vanish at the value of i

n
that maximizes l ( l   ln p( x
k 1
k | , ))

ˆ
the ML estimate i must satisfy the conditions

 P( i | xk , ˆ) i ln p ( xk | i , ˆi ) 0 (i 1,..., c) (a)

k 1

Pattern Classification, Chapter 10

1
4

The MLE for P(i) and ˆi must satisfy:

1
P̂ (  i )   P̂ (  i | x k ,ˆ )
n
n
ˆ ) ln p( x |  ,ˆ ) 0
and  i k
P̂
k 1
(  | x , i k i i

p( x k |  i ,ˆi )P̂ (  i )
where : P̂(  i | x k ,ˆ )  c

 p( x k |  j ,ˆ j )P̂ (  j )
j 1

Pattern Classification, Chapter 10

1
5

Applications to Normal Mixtures

p(x | i, i) ~ N(i, i)

Case i i P(i) c
1 ? x x x
2 ? ? ? x

3 ? ? ? ?

Case 1 = Simplest case

Case 2 = more realistic case
Pattern Classification, Chapter 10
1
6
Case 1: Multivariate Normal, Unknown mean vectors
i = i  i = 1, …, c, The likelihood is for the ith mean is:

ML estimate of  = (i) is:

 P( i | xk , ˆ ) xk
ˆ i  k 1n (1)
 P(
k 1
i | xk , ˆ )

P (  i | x k , ˆ )
Where is the fraction of those samples

having value xk that come from the ith class, and ̂ i is the average
of the samples coming from the i-th class.
Pattern Classification, Chapter 10
1
7

• Unfortunately, equation (1) does not give ̂ i

explicitly

• However, if we have some way of obtaining good

initial estimates ˆ i ( 0 ) for the unknown means,
equation (1) can be seen as an iterative process
for improving the estimates
n

 P(  i | x k , ˆ ( j )) x k
ˆ i ( j  1 )  k 1n
 P( 
k 1
i | x k , ˆ ( j ))

Pattern Classification, Chapter 10

1
8

• This is a gradient ascent for maximizing the log-

likelihood function

• Example:
Consider the simple two-component one-dimensional
normal mixture
1  1  2  1 
p( x |  1 ,  2 )  exp   (x   1 )2   exp   (x   2 )2 
3 2  2  3 2  2 

1 2
(2 clusters!)
Let’s set 1 = -2, 2 = 2 and draw 25 samples
sequentially from this mixture. The log-likelihood
n
function is: l( ,  )  ln p(x |  ,  )
1 2  k 1 2
k 1

Pattern Classification, Chapter 10

1
9

The maximum value of l occurs at:

ˆ 1  2.130 and ˆ 2 1.668

(which are not far from the true values: 1 = -2 and 2
= +2)

There is another peak at ˆ 1 2.085 and ˆ 2  1.257

which has almost the same height as can be
seen from the following figure.
This mixture of normal densities is identifiable
When the mixture density is not identifiable, the ML
solution is not unique
Pattern Classification, Chapter 10
2
0

Pattern Classification, Chapter 10

2
1

Case 2: All parameters unknown

• No constraints are placed on the covariance

matrix

• Let p(x | ,  ) be the two-component normal

mixture:
1  1  x    
2
1  1 2
p ( x |  , ) 
2
exp     exp  x 
2 2 .  2     2 2  2 

Pattern Classification, Chapter 10

2
2
Suppose  = x1, therefore:
1 1  1 2
p ( x1 |  ,  ) 
2
 exp   x1 
2 2  2 2  2 
For the rest of the samples:
1  1 2
p( x k |  , ) 2
exp   x k 
2 2  2 
Finally,
1  1 2 1  1 n 2
p( x1 ,..., x n |  , )   exp   x1  
2
exp    x k 
      2    ( 2 2 )  2 k 2 
n

 this term   
   0 


The likelihood is therefore large and the maximum-

likelihood solution becomes singular.
Pattern Classification, Chapter 10
2
3

• Assumption: MLE is well-behaved at local maxima.

• Consider the largest of the finite local maxima of
the likelihood function and use the ML estimation.
• We obtain the following local-maximum-likelihood
estimates: 1 n ˆ
P̂ (  i ) 
n
 P̂ ( 
k 1
i | x k , )
n

 P̂ (  i | x k ,ˆ ) x k
ˆ i  k 1n
Iterative | x k ,ˆ )
 P̂ (  i
scheme k 1
n
ˆ )( x  ˆ )( x  ˆ )t
 i k
P̂ (  | x , k i k i
ˆ i  k 1 n
ˆ)
 i k
P̂
k 1
(  | x ,
Pattern Classification, Chapter 10
2
4

Where:

 1/ 2  1 
i exp   ( x k  ˆ i )t ˆ i 1 ( x k  ˆ i ) P̂ (  i )
P̂ (  i | x k ,ˆ )   2 
c  1/ 2  1 

j 1
ˆ
j exp   ( x k  ˆ j )t ˆ j 1 ( x k  ˆ j ) P̂ (  j )
 2 

Pattern Classification, Chapter 10

2
5

K-Means Clustering

• Goal: find the c mean vectors  ,  , …,  1 2 c

• Replace the squared Mahalanobis distance

( x k  ˆ i )t ˆ i 1 ( x k  ˆ i ) by the squared Euclidean distance x k  ˆ i
2

̂ m
• P̂Find the mean
(  | x ,ˆ )
1 if i m
nearest to xk and approximate
P̂ (  i | x k , ) 
i k
0 otherwise
as:

ˆ 1 , ˆ 2 ,..., ˆ c

• Use the iterative scheme to find

Pattern Classification, Chapter 10
2
6

• If n is the known number of patterns and c the desired

number of clusters, the k-means algorithm is:

Begin
initialize n, c, 1, 2, …,
c(randomly selected)
do classify n samples
according to nearest i
recompute i
until no change in i
return 1, 2, …, c
End

Complexity is O(ndcT) where d is the # features, T the # iterations

Pattern Classification, Chapter 10

2
7

• K-means cluster on data from previous figure

Pattern Classification, Chapter 10

2
8

Unsupervised Bayesian Learning

• Other than the ML estimate, the Bayesian estimation
technique can also be used in the unsupervised case
(see chapters ML & Bayesian methods, Chap. 3 of the
textbook)
• number of classes is known
• class priors are known
• forms of class-conditional probability densities P(x|j, j) are
known
• However, the full parameter vector  is unknown
• Part of our knowledge about  is contained in the prior p()
• rest of our knowledge of  is in the training samples
• We compute the posterior distribution using the training
samples

Pattern Classification, Chapter 10

2
9

• We can compute p(|D) as seen previously

P(i|D) = P(i) since selection of i is independent of previous samples

and passing through the usual formulation introducing the
unknown parameter vector .
p (x |  i , D) p (x, θ | i , D)dθ p(x | θ, i ,D) p(θ  i ,D)dθ
p (x |  i , θ i ) p(θ |D)dθ
• Hence, the best estimate of p(x|i) is obtained by
averaging p(x|i, i) over i.
• The goodness of this estimate depends on p(|D); this is
the main issue of the problem.
Pattern Classification, Chapter 10
3
0

From Bayes we get: p ( D | θ) p (θ)

p (θ | D ) 
p( D | θ) p(θ)dθ
where independence of the samples yields the likelihood
n
p ( D | θ)  p (x k | θ)
k 1

or alternately (denoting Dn the set of n samples) the recursive form:

• If p() is almost uniform in the region where p(D|) peaks, then

 p(x |  j , θˆ j ) P ( j )
j 1

θ̂
• Therefore, the ML estimate is justified.
• Both approaches coincide if large amounts of data are
available.
• In small sample size problems they can agree or not,
depending on the form of the distributions
• The ML method is typically easier
to implement than the Bayesian one
Pattern Classification, Chapter 10
3
2

• Formal Bayesian solution: unsupervised learning of the

parameters of a mixture density is similar to the supervised
learning of the parameters of a component density.
• Significant differences: identifiability, computational
complexity
• The issue of identifiability
• With SL, the lack of identifiability means that we do not obtain a
unique vector, but an equivalence class, which does not present
theoretical difficulty as all yield the same component density.
• With UL, the lack of identifiability means that the mixture cannot be
decomposed into its true components
 p(x | Dn) may still converge to p(x), but p(x |i, Dn) will not in
general converge to p(x |i), hence there is a theoretical barrier.

• The computational complexity

• With SL, the sufficient statistics allows the solutions to be
computationally feasible
Pattern Classification, Chapter 10
3
3

• With UL, samples comes from a mixture density and

there is little hope of finding simple exact solutions for
p(D | ).  n samples results in 2n terms.
(Corresponding to the ways in the which the n samples
can be drawn from the 2 classes.)
• Another way of comparing the UL and SL is to
consider the usual equation in which the mixture
density is explicit n 1
n p (x n | θ) p (θ | D )
p (θ | D )  
p(x
n 1
n | θ) p (θ | D ) dθ
c

 p(x
j 1
n |  j , θ j ) P ( j )
 c
p (θ | D n  1 )
 n j j
p
j 1
( x |  , θ ) P ( j ) p (θ | D n 1
) dθ

Pattern Classification, Chapter 10

3
4

• If we consider the case in which P( )=1 and all 1

other prior probabilities as zero, corresponding to
the supervised case in which all samples comes
n p (x n | 1 , θ1 ) n 1
(θ | D ) 
from the class 1, then we get
p p (θ | D )
p(x
n 1
n | 1 , θ1 ) p (θ | D ) dθ

Pattern Classification, Chapter 10

c 3

 p(x
j 1
n |  j , θ j ) P ( j ) 5

• Comparing the two eqns, we see that observing an additional sample changes the
estimate of .
• Ignoring the denominator which is independent of , the only significant difference
is that
• in the SL, we multiply the “prior” density for  by the component density p(xn |1, 1)
• in the UL, we multiply the “prior” density by the whole mixture
c

 p(x
j 1
n |  j , θ j ) P( j )

• Assuming that the sample did come from class 1, the effect of not knowing this category is
to diminish the influence of xn in changing  for category 1..
Pattern Classification, Chapter 10
3
6

Data Clustering
• Structures of multidimensional patterns are important
for clustering
• If we know that data come from a specific distribution,
such data can be represented by a compact set of
parameters (sufficient statistics)

• If samples are considered

coming from a specific
distribution, but actually they
are not, these statistics is a
misleading representation of
the data

Pattern Classification, Chapter 10

3
7

• Aproximation of density functions:

• Mixture of normal distributions can approximate arbitrary
PDFs

• In these cases, one can use parametric methods

to estimate the parameters of the mixture density.
• No free lunch  dimensionality issue!
• Huh?

Pattern Classification, Chapter 10

3
8

Caveat
• If little prior knowledge can be assumed, the
assumption of a parametric form is meaningless:
• Issue: imposing structure vs finding structure

use non parametric method to estimate the

unknown mixture density.

• Alternatively, for subclass discovery:

• use a clustering procedure
• identify data points having strong internal similarities
Pattern Classification, Chapter 10
3
9

Similarity measures
• What do we mean by similarity?
• Two isses:
• How to measure the similarity between samples?
• How to evaluate a partitioning of a set into clusters?

• Obvious measure of similarity/dissimilarity is the

distance between samples

• Samples of the same cluster should be closer to

each other than to samples in different classes.
Pattern Classification, Chapter 10
4
0

• Euclidean distance is a possible metric:

• assume samples belonging to same cluster if their distance is less
than a threshold d0

• Clusters defined by Euclidean distance are invariant to

translations and rotation of the feature space, but not
invariant to general transformations that distort the distance
relationship Pattern Classification, Chapter 10
4
1

• Achieving invariance:
• normalize the data, e.g., such that they all have zero
means and unit variance,
• or use principal components for invariance to rotation
• A broad class of metrics is the Minkowsky metric
1/ q
d
q
d (x, x' )   xk  xk ' 
 k 1 
where q1 is a selectable parameter:
q = 1  Manhattan or city block metric
q = 2  Euclidean metric
• One can also used a nonmetric similarity function
s(x,x’) to compare 2 vectors.
Pattern Classification, Chapter 10
4
2

• It is typically a symmetric function whose value is

large when x and x’ are similar.
• For example, the inner product
x t x'
s (x, x' ) 
x x'

• In case of binary-valued
x x' t
features, we have, e.g.:
s (x, x' ) 
d
x t x'
s (x, x' )  t Tanimoto distance
x x  x't x'x t x'

Pattern Classification, Chapter 10

4
3

Clustering as optimization
• The second issue: how to evaluate a partitioning of
a set into clusters?

• Clustering can be posed as an optimization of a

criterion function
• The sum-of-squared-error criterion and its variants
• Scatter criteria
• The sum-of-squared-error criterion
• Let n the number of samples in D , and m the mean of
i i i
those samples 1
mi 
ni
x
xDi

Pattern Classification, Chapter 10

4
4

• The sum of squared error is defined as

c
J e   x  m i
2

i 1 xDi

• This criterion defines clusters by their mean vectors mi

 it minimizes the sum of the squared lengths of the error x - mi.

• The minimum variance partition minimizes Je

• Results:
• Good when clusters form well separated compact clouds
• Bad with large differences in the number of samples in different
clusters.

Pattern Classification, Chapter 10

4
5

• Scatter criteria
• Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the between-
scatter matrix SB
ST = SB +SW
• Note:
• S does not depend on partitioning
T

• In contrast, S and S depend on partitioning

B W

• Two approaches:
• minimize the within-cluster
• maximize the between-cluster scatter

Pattern Classification, Chapter 10

4
6

The trace (sum of diagonal elements) is the

simplest scalar measure of the scatter matrix
c c
tr SW   tr Si    x  m i
2
J e
i 1 i 1 xDi

• proportional to the sum of the variances in the

coordinate directions
• This is the sum-of-squared-error criterion, Je.

Pattern Classification, Chapter 10

4
7

• As tr[S ] = tr[S
T W ] + tr[SB] and tr[ST] is independent from
the partitioning, no new results can be derived by
minimizing tr[SB]

• However, seeking to minimize the within-cluster criterion

Je=tr[SW], is equivalentc to maximise the between-cluster
tr S B   ni m i  m

2
criterion
i 1

c
1 mean1vector:
where m is the total
m
nD
 x  nm
n i 1
i i

Pattern Classification, Chapter 10

4
8

Iterative optimization
• Clustering  discrete optimization problem

• Finite data set  finite number of partitions

• What is the cost of exhaustive search?

cn/c! For c clusters. Not a good idea

• Typically iterative optimization used:

• starting from a reasonable initial partition
• Redistribute samples to minimize criterion function.
 guarantees local, not global, optimization. Pattern Classification, Chapter 10
4
9

• consider an iterative procedure to minimize the

sum-of-squared-error criterion Je
c
J e  J i where J i   x  m i
2

i 1 xDi

where Ji is the effective error per cluster.

x̂
• Moving sample from cluster Di to Dj, changes
the errors inn the
j 2 clusters
2
by: ni 2
J j * J j  xˆ  m j J i * J i  xˆ  m i
n j 1 ni  1

Pattern Classification, Chapter 10

5
0

• Hence, the transfer is advantegeous if the

decrease in Ji is larger than the increase in Jj
ni 2 nj 2
xˆ  m i  xˆ  m j
ni  1 n j 1

Pattern Classification, Chapter 10

5
1

• Alg. 3 is sequential version of the k-means alg.

• Alg. 3 updates each time a sample is reclassified
• k-means waits until n samples have been reclassified
before updating

• Alg 3 can get trapped in local minima

• Depends on order of the samples
• Basically, myopic approach
• But it is online!

Pattern Classification, Chapter 10

5
2

• Starting point is always a problem

• Approaches:
1. Random centers of clusters
2. Repetition with different random initialization
3. c-cluster starting point as the solution of the (c-1)-
cluster problem plus the sample farthest from the
nearer cluster center

Pattern Classification, Chapter 10

5
3

Hierarchical Clustering
• Many times, clusters are not disjoint, but a
cluster may have subclusters, in turn having sub-
subclusters, etc.
• Consider a sequence of partitions of the n
samples into c clusters
• The first is a partition into n cluster, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third
into n-2, and so on, until the n-th in which there is only
one cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
5
4

• Given any two samples x and x’, they will be grouped

together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
• Hierarchical clustering  tree representation called
dendrogram

Pattern Classification, Chapter 10

5
5

• Are groupings natural or forced: check similarity values

• Evenly distributed similarity  no justification for grouping

• Another representation is based on set, e.g., on the Venn

diagrams

Pattern Classification, Chapter 10

5
6

• Hierarchical clustering can be divided in

agglomerative and divisive.

• Agglomerative (bottom up, clumping): start with n

singleton cluster and form the sequence by
merging clusters

• Divisive (top down, splitting): start with all of the

samples in one cluster and form the sequence by
successively splitting clusters

Pattern Classification, Chapter 10

5
7

Agglomerative hierarchical clustering

• The procedure terminates when the specified number of

cluster has been obtained, and returns the cluster as sets of
points, rather than the mean or a representative vector for
each cluster
Pattern Classification, Chapter 10
5
8

• At any level, the distance between nearest clusters can

provide the dissimilarity value for that level
• To find the nearest clusters, one can use

d min ( Di , D j )  min x  x'

xDi , x 'D j

d max ( Di , D j )  max x  x'

xDi , x 'D j

1
d avg ( Di , D j ) 
ni n j
  x  x'
xDi x 'D j

d mean ( Di , D j )  m i  m j

which behave quite similar of the clusters are

hyperspherical and well separated.
• The computational complexity is O(cn2d2), n>>c
Pattern Classification, Chapter 10
5
9

Nearest-neighbor algorithm (single linkage)

• dmin is used

• Viewed in graph terms, an edge is added to the

nearest nonconnected components

• Equivalent of Prims minimum spanning tree

algorithm

• Terminates when the distance between nearest

clusters exceeds an arbitrary threshold
Pattern Classification, Chapter 10
6
0

• The use of d as a distance measure and the

min
agglomerative clustering generate a minimal spanning
tree

• Chaining effect: defect of this distance measure (right)

Pattern Classification, Chapter 10
6
1

The farthest neighbor algorithm (complete linkage)

• dmax is used

• This method discourages the growth of elongated

clusters

• In graph theoretic terms:

• every cluster is a complete subgraph
• the distance between two clusters is determined by the
most distant nodes in the 2 clusters

• terminates when the distance between nearest

clusters exceeds an arbitrary threshold
Pattern Classification, Chapter 10
6
2

• When two clusters are merged, the graph is changed by

adding edges between every pair of nodes in the 2
clusters

• All the procedures involving minima or maxima are

sensitive to outliers. The use of dmean or davg are natural
compromises Pattern Classification, Chapter 10
6
3

The problem of the number of clusters

• How many clusters should there be?

• For clustering by extremizing a criterion function
• repeat the clustering with c=1, c=2, c=3, etc.
• look for large changes in criterion function
• Alternatively:
• state a threshold for the creation of a new cluster
• useful for on line cases
• sensitive to order of presentation of data.
• These approaches are similar to model selection
procedures

Pattern Classification, Chapter 10

6
4

Graph-theoretic methods
• Caveat: no uniform way of posing clustering as a
graph theoretic problem
• Generalize from a threshold distance to arbitrary
similarity measures.
• If s0 is a threshold value, we can say that xi is
similar to xj if s(xi, xj) > s0.
• We can define a similarity matrix S = [s ] ij

1 if s (x i , x j )  s0
sij 
0 otherwise

Pattern Classification, Chapter 10

6
5

• This matrix induces a similarity graph, dual to S, in

which nodes corresponds to points and edge joins
node i and j iff sij=1.
• Single-linkage alg.: two samples x and x’ are in
the same cluster if there exists a chain x, x1, x2, …,
xk, x’, such that x is similar to x1, x1 to x2, and so
on  connected components of the graph
• Complete-link alg.: all samples in a given cluster
must be similar to one another and no sample can
be in more than one cluster.
• Neirest-neighbor algorithm is a method to find the
minimum spanning tree and vice versa
• Removal of the longest edge produce a 2-cluster
grouping, removal of the next longest edge produces a
3-cluster grouping, and so on.
Pattern Classification, Chapter 10
6
6

• This is a divisive hierarchical procedure, and

suggest ways to dividing the graph in subgraphs
• E.g., in selecting an edge to remove, comparing its
length with the lengths of the other edges incident the
nodes

Pattern Classification, Chapter 10

6
7

• One useful statistic to be estimated from the

minimal spanning tree is the edge length
distribution
• For instance, in the case of 2 dense cluster
immersed in a sparse set of points:

Pattern Classification, Chapter 10

FNS - IIT JEE by Ajay Singh
No ratings yet
FNS - IIT JEE by Ajay Singh
1 page
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
44 pages
Newton's Law Insp Champs Adv2024
No ratings yet
Newton's Law Insp Champs Adv2024
27 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
PPT CH 1 PR Ir
No ratings yet
PPT CH 1 PR Ir
48 pages
Structural Dynamics in Earthquake and Blast Resistant Design Prasad
No ratings yet
Structural Dynamics in Earthquake and Blast Resistant Design Prasad
337 pages
Hydra 325 - Im - 2024 - Ce Dept
No ratings yet
Hydra 325 - Im - 2024 - Ce Dept
177 pages
Structural Analysis of Unreinforced Masonry Spiral Staircases Using Discrete
No ratings yet
Structural Analysis of Unreinforced Masonry Spiral Staircases Using Discrete
19 pages
Pattern Recognition Presenation
100% (1)
Pattern Recognition Presenation
83 pages
Graphical Approaches For Evaluating Overdamped Second-Order-Plus-Dead-Time (SOPDT) Model Parameters
No ratings yet
Graphical Approaches For Evaluating Overdamped Second-Order-Plus-Dead-Time (SOPDT) Model Parameters
14 pages
Module 4 System of Linear Equations and Inequalities
No ratings yet
Module 4 System of Linear Equations and Inequalities
29 pages
Models - Cfd.oldroyd B Viscoelastic PDF
No ratings yet
Models - Cfd.oldroyd B Viscoelastic PDF
14 pages
Pattern Classification
No ratings yet
Pattern Classification
141 pages
Turbulence and Diffusion in The Atmosphere - Lectures in Environmental Sciences (PDFDrive)
No ratings yet
Turbulence and Diffusion in The Atmosphere - Lectures in Environmental Sciences (PDFDrive)
184 pages
Unit - V Pattern Recognition: Dr.K.Sampath Kumar Scse/Gu
No ratings yet
Unit - V Pattern Recognition: Dr.K.Sampath Kumar Scse/Gu
30 pages
Theory of Elasticity: References
No ratings yet
Theory of Elasticity: References
85 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Thermo
No ratings yet
Thermo
1 page
Teachers Guide Earthquake Education
No ratings yet
Teachers Guide Earthquake Education
118 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Report
No ratings yet
Report
50 pages
Ep Handnotes c7
No ratings yet
Ep Handnotes c7
51 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
DHSCH 2 Part 2
No ratings yet
DHSCH 2 Part 2
16 pages
CpE646 6v3 PDF
No ratings yet
CpE646 6v3 PDF
44 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
Pattern Classification
No ratings yet
Pattern Classification
39 pages
Unsupervised Learning and Clustering
No ratings yet
Unsupervised Learning and Clustering
26 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
17 pages
Pattern Classification
No ratings yet
Pattern Classification
39 pages
03 Classification Methods
No ratings yet
03 Classification Methods
37 pages
Subject: Statistics
No ratings yet
Subject: Statistics
21 pages
November 2020 (v1) QP - Paper 4 CIE Physics IGCSE
No ratings yet
November 2020 (v1) QP - Paper 4 CIE Physics IGCSE
20 pages
Analisis Klasifikasi
No ratings yet
Analisis Klasifikasi
41 pages
Vectors and The Geometry of Space
No ratings yet
Vectors and The Geometry of Space
37 pages
DHSCH 2 Part 3
No ratings yet
DHSCH 2 Part 3
22 pages
I2ml3e Chap7
No ratings yet
I2ml3e Chap7
22 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
DHSCH 2 Part 1
No ratings yet
DHSCH 2 Part 1
21 pages
Lecture 5 Discriminant Analysis
No ratings yet
Lecture 5 Discriminant Analysis
10 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
67 pages
Machine Learning: A Review of Classification and Combining Techniques
No ratings yet
Machine Learning: A Review of Classification and Combining Techniques
32 pages
Uoc Luong Phi Tham So
No ratings yet
Uoc Luong Phi Tham So
84 pages
Sec 2 E Math Yio Chu Kang Sec SA2 2018i
No ratings yet
Sec 2 E Math Yio Chu Kang Sec SA2 2018i
28 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
19 pages
Pattern Classification PDF
No ratings yet
Pattern Classification PDF
39 pages
Baysean Linear Quadratic Classifier
No ratings yet
Baysean Linear Quadratic Classifier
14 pages
From Dummy Regression To Prior Probabili
No ratings yet
From Dummy Regression To Prior Probabili
8 pages
j077 2011 KulHar WileyTutorial
No ratings yet
j077 2011 KulHar WileyTutorial
14 pages
1 s2.0 016786559400074D Main33
No ratings yet
1 s2.0 016786559400074D Main33
7 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
80 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
Image Representation
No ratings yet
Image Representation
26 pages
G-Graphical Kinematics
No ratings yet
G-Graphical Kinematics
2 pages
Lecture 3
No ratings yet
Lecture 3
49 pages
Dual Nature Notes
No ratings yet
Dual Nature Notes
4 pages
Duda ch10
No ratings yet
Duda ch10
17 pages
Pattern Classification
No ratings yet
Pattern Classification
39 pages
Induction Motor Bee
No ratings yet
Induction Motor Bee
3 pages
Chap 4
No ratings yet
Chap 4
21 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
Experimental Investigation On Thermal Properties of Silver Nanofluids-1
No ratings yet
Experimental Investigation On Thermal Properties of Silver Nanofluids-1
11 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
12 pages
ELLIPSOIS
No ratings yet
ELLIPSOIS
4 pages
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
No ratings yet
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
12 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
World School Datesheet (2024-25) - XII (Science) - 1
No ratings yet
World School Datesheet (2024-25) - XII (Science) - 1
1 page
DIP WISC 13 Recognition
No ratings yet
DIP WISC 13 Recognition
18 pages
Spanning Sets and Linear Independence: Nontrivial
No ratings yet
Spanning Sets and Linear Independence: Nontrivial
10 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
17 pages
Wire Rope Grips
No ratings yet
Wire Rope Grips
5 pages
Chemistry Experiment - Effect of Temp On A Rate of Reaction
No ratings yet
Chemistry Experiment - Effect of Temp On A Rate of Reaction
3 pages
Predicting The Missing Value by Bayesian Classification: Abstract
No ratings yet
Predicting The Missing Value by Bayesian Classification: Abstract
5 pages
Fluid Statics: Objectives
No ratings yet
Fluid Statics: Objectives
20 pages
Mechanics of Solid Deflection in Beams: Figure 11.1 Bending in Beam
No ratings yet
Mechanics of Solid Deflection in Beams: Figure 11.1 Bending in Beam
18 pages
Pattern Recognition: Lasse Holmstr Om and Petri Koistinen
No ratings yet
Pattern Recognition: Lasse Holmstr Om and Petri Koistinen
10 pages
Members Subjected To Combined Forces
No ratings yet
Members Subjected To Combined Forces
20 pages
Unit I Assignment
No ratings yet
Unit I Assignment
2 pages
Pattern Recognition Problem
No ratings yet
Pattern Recognition Problem
8 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
16 pages
DSM Calculator
No ratings yet
DSM Calculator
5 pages
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)