LecN10 R
LecN10 R
CLUSTERING
ä Problem: we are given n data items: x1, x2, · · · , xn. Would
• Details on clustering like to ‘cluster’ them, i.e., group them so that each group or cluster
• K-means contains items that are similar in some sense.
Superconductors 4
−1
−2
−3
5
Catalytic −4 6
Multi−ferroics Thermo−electric 7
−5
−6 −4 −2 0 2 4 6 8
10-2 – Clustering
ä A basic algorithm that uses Euclidean distance ä Class of Methods that perform clustering by exploiting a graph
that describes the similarities between any two items in the data.
1 Select p initial centers: c1, c2, ..., cp for classes 1, 2, · · · , p
2 For each xi do: determine class of xi as argmink kxi − ck k ä Need to:
3 Redefine each ck to be the centroid of class k
4 Repeat until convergence 1. decide what nodes are in the neighborhood of a given node
c1
2. quantify their similarities - by assigning a weight to any pair of
●
●
● ● ● nodes.
● ● ● ● ä Simple algorithm
●
●
●c ä Works well (gives good Example: For text data: Can decide that any columns i and j
●
● 3 ●
results) but can be slow with a cosine greater than 0.95 are ‘similar’ and assign that cosine
●
● c ● ● ä Performance depends on value to wij
2
● initialization
● ●
●
ä Goal: to build a similarity graph, i.e., a graph that captures ä Given: a set of n data points X = {x1, . . . , xn} → vertices
similarity between any two items ä Given: a proximity measure between two data points xi and xj
●
j
● – as measured by a quantity dist(xi, xj )
Nearest neighbor graphs Two types of nearest neighbor graph often used:
-graph: Edges consist of pairs (xi, xj ) such that
ä For each node, get a few of the nearest neighbors → Graph ρ(xi, xj ) ≤
Define weight between i and j as: ä Assume now that we have built a ‘similarity graph’
ä Setting is identical with that of graph partitioning.
−kxi −xj k2
2
wij = fij × e σX
if kxi − xj k < r ä Need a Graph Laplacean: L = D−W with wii = 0, wij ≥ 0
0 if not and D = diag(W ∗ ones(n, 1)) [in matlab notation]
ä Note kxi − xj k could be any measure of distance... ä Partition vertex set V in two sets A and B with
ä First (naive) approach: use this measure to partition graph, i.e., Ratio-cuts
... Find A and B that minimize cut(A, B).
ä Standard Graph Partitioning approach: Find A, B by solving
ä Issue: Small sets, isolated nodes, big imbalances,
● Minimize cut(A, B), subject to |A| = |B|
●● ● Min−cut 1
● ● ● ●
● ä Condition |A| = |B| not too meaningful in some applications -
● ●
● ● too restrictive in others.
● ●● ● ●
● ä Minimum Ratio Cut approach. Find A, B by solving:
● ● ● ●
● ● Min−cut 2 cut(A,B)
●● ● Minimize |A|.|B|
●
10-14 – Clustering
ä Just like graph partitioning we can: ä First task: obtain a graph from pixels.
1. Apply the method recursively [Repeat clustering on the resulted ä Common idea: use “Heat kernels”
parts] ä Let Fj = feature value (e.g., brightness), and Let Xj = spatial
2. or compute a few eigenvectors and run K-means clustering on these position.
eigenvectors to get the clustering. Then define
−kXi −Xj k2
−kFi −Fj k2 2
wij = e σI2
× e σX
ifkXi − Xj k < r
0 else
ä Sparsity depends on parameters
●
●
● ● ● ●
●
●
1 Given: Collection of data samples {x1, x2, · · · , xn} ●
●
●
●
●
●
Building a nearest neighbor graph Recall: Two common types of nearest neighbor graphs
-graph: Edges consist of pairs (xi, xj ) such that
ä Question: How to build a nearest-neighbor graph from given ρ(xi, xj ) ≤
data?
kNN graph: Nodes adjacent to xi are those nodes x` with the
Data k with smallest distances ρ(xi, x`).
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111 ä -graph is undirected and is geometrically motivated. Issues: 1)
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111 may result in disconnected components 2) what ?
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111
000000000000000000000000000000000000000
111111111111111111111111111111111111111
ä kNN graphs are directed in general (can be trivially fixed).
Graph
ä kNN graphs especially useful in practice.
ä Will demonstrate the power of a divide a conquer approach
combined with the Lanczos algorithm.
ä Note: The Lanczos algortithm will be covered in detail later
uT X̂ = σv T .
ä Note that uT x̂i = uT X̂ei = σv T ei →
X+ = {xi | vi ≥ 0} and X− = {xi | vi < 0}, Two divide and conquer algorithms
ä In practice: replace above criterion by Glue method: divide current set into two disjoint subsets X1, X2
plus a third set X3 called gluing set.
X+ = {xi | vi ≥ med(v)} & X− = {xi | vi < med(v)}
hyperplane hyperplane
ä Divide current set X into two overlapping subsets: Divide the set X into two disjoint subsets X1 and X2 with a gluing
subset X3:
X1 = {xi | vi ≥ −hα(Sv )} and X2 = {xi | vi < hα(Sv )},
X1∪X2 = X, X1∩X2 = ∅, X1∩X3 6= ∅, X2∩X3 6= ∅.
• where Sv = {|vi| | i = 1, 2, . . . , n}.
Criterion used for splitting:
• and hα(·) is a function that returns an element larger than (100α)%
of those in Sv . X1 = {xi | vi ≥ 0}, X2 = {xi | vi < 0},
X3 = {xi | −hα(Sv ) ≤ vi < hα(Sv )}.
ä Rationale: to ensure that the two subsets overlap (100α)% of
the data, i.e., Note: gluing subset X3 here is just the intersection of the sets
|X1 ∩ X2| = dα|X|e . X1, X2 of the overlap method.
Approximate kNN Graph Construction: The Overlap Method Approximate kNN Graph Construction: The Glue Method