Graph-Based Clustering and Data Visualization Algorithms (PDFDrive)
Graph-Based Clustering and Data Visualization Algorithms (PDFDrive)
Ágnes Vathy-Fogarassy
János Abonyi
Graph-Based
Clustering and
Data Visualization
Algorithms
SpringerBriefs in Computer Science
Series Editors
Stan Zdonik
Peng Ning
Shashi Shekhar
Jonathan Katz
Xindong Wu
Lakhmi C. Jain
David Padua
Xuemin Shen
Borko Furht
V. S. Subrahmanian
Martial Hebert
Katsushi Ikeuchi
Bruno Siciliano
Graph-Based Clustering
and Data Visualization
Algorithms
123
Ágnes Vathy-Fogarassy János Abonyi
Computer Science and Systems Technology Department of Process Engineering
University of Pannonia University of Pannonia
Veszprém Veszprém
Hungary Hungary
Clustering, as a special area of data mining, is one of the most commonly used
methods for discovering hidden structure of data. Clustering algorithms group a set
of objects in such a way that objects in the same cluster are more similar to each
other than to those in other clusters. Cluster analysis can be used to quantize data,
extract cluster prototypes for the compact representation of the data set, select
relevant features, segment data into homogeneous subsets, and to initialize
regression and classification models.
Graph-based clustering algorithms are powerful in giving results close to the
human intuition [1]. The common characteristic of graph-based clustering methods
developed in recent years is that they build a graph on the set of data and then use
the constructed graph during the clustering process [2–9]. In graph-based clus-
tering methods objects are considered as vertices of a graph, while edges between
them are treated differently by the various approaches. In the simplest case, the
graph is a complete graph, where all vertices are connected to each other, and the
edges are labeled according to the degree of the similarity of the objects. Con-
sequently, in this case the graph is a weighted complete graph.
In case of large data sets the computation of the complete weighted graph
requires too much time and storage space. To reduce complexity many algorithms
work only with sparse matrices and do not utilize the complete graph. Sparse
similarity matrices contain information only about a small subset of the edges,
mostly those corresponding to higher similarity values. These sparse matrices
encode the most relevant similarity values and graphs based on these matrices
visualize these similarities in a graphical way.
Another way to reduce the time and space complexity is the application of a
vector quantization (VQ) method (e.g. k-means [10], neural gas (NG) [11], Self-
Organizing Map (SOM) [12]). The main goal of the VQ is to represent the entire
set of objects by a set of representatives (codebook vectors), whose cardinality is
much lower than the cardinality of the original data set. If a VQ method is used to
reduce the time and space complexity, and the clustering method is based on
graph-theory, vertices of the graph represent the codebook vectors and the edges
denote the connectivity between them.
Weights assigned to the edges express similarity of pairs of objects. In this book
we will show that similarity can be calculated based on distances or based on
v
vi Preface
structural information. Structural information about the edges expresses the degree
of the connectivity of the vertices (e.g. number of common neighbors).
The key idea of graph-based clustering is extremely simple: compute a graph of
the original objects or their codebook vectors, then delete edges according to some
criteria. This procedure results in an unconnected graph where each subgraph
represents a cluster. Finding edges whose elimination leads to good clustering is a
challenging problem. In this book a new approach will be proposed to eliminate
these inconsistent edges.
Clustering algorithms in many cases are confronted with manifolds, where low-
dimensional data structure is embedded in a high-dimensional vector space. In
these cases classical distance measures are not applicable. To solve this problem it
is necessary to draw a network of the objects to represent the manifold and
compute distances along the established graph. Similarity measure computed in
such a way (graph distance, curvilinear or geodesic distance [13]) approximates
the distances along the manifold. Graph-based distances are calculated as the
shortest path along the graph for each pair of points. As a result, computed dis-
tance depends on the curvature of the manifold, thus it takes the intrinsic geo-
metrical structure of the data into account. In this book we propose a novel graph-
based clustering algorithm to cluster and visualize data sets containing nonlinearly
embedded manifolds.
Visualization of complex data in a low-dimensional vector space plays an
important role in knowledge discovery. We present a data visualization technique
that combines graph-based topology representation and dimensionality reduction
methods to visualize the intrinsic data structure in a low-dimensional vector space.
Application of graphs in clustering and visualization has several advantages.
Edges characterize relations, weights represent similarities or distances. A Graph
of important edges gives compact representation of the whole complex data set. In
this book we present clustering and visualization methods that are able to utilize
information hidden in these graphs based on the synergistic combination of
classical tools of clustering, graph-theory, neural networks, data visualization,
dimensionality reduction, fuzzy methods, and topology learning.
The understanding of the proposed algorithms is supported by
• figures (over 110);
• references (170) which give a good overview of the current state of clustering,
vector quantizing and visualization methods, and suggest further reading
material for students and researchers interested in the details of the discussed
algorithms;
• algorithms (17) which aim to understand the methods in detail and help to
implement them;
• examples (over 30);
• software packages which incorporate the introduced algorithms. These Matlab
files are downloadable from the website of the author (www.abonyilab.com).
Preface vii
References
1. Jaromczyk, J.W., Toussaint, G.T.: Relative neighborhood graphs and their relatives. Proc.
IEEE 80(9), 1502–1517 (1992)
2. Anand, R., Reddy, C.K.: Graph-based clustering with constraints. PAKDD 2011, Part II,
LNAI 6635, 51–62 (2011)
3. Chen, N., Chen, A., Zhou, L., Lu, L.: A graph-based clustering algorithm in large transaction.
Intell. Data Anal. 5(4), 327–338 (2001)
4. Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical
attributes. In: Proceedings of the 15th International Conference On Data Engeneering,
pp. 512–521 (1999)
5. Huang, X., Lai, W.: Clustering graphs for visualization via node similarities. J. Vis. Lang.
Comput. 17, 225–253 (2006)
6. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic
modeling. IEEE Comput. 32(8), 68–75 (1999)
7. Kawaji, H., Takenaka, Y., Matsuda, H.: Graph-based clustering for finding distant
relationships in a large set of protein sequences. Bioinformatics 20(2), 243–252 (2004)
8. Novák, P., Neumann, P., Macas, J.: Graph-based clustering and characterization of repetitive
sequences in next-generation sequencing data. BMC Bioinformatics 11, 378 (2010)
viii Preface
9. Zaki, M.J., Peters, M., Assent, I., Seidl, T.: CLICKS: An effective algorithm for mining
subspace clusters in categorical datasets. Data Knowl. Eng. 60, 51–70 (2007)
10. McQueen, J.: Some methods for classification and analysis of multivariate observations. In:
Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability,
pp. 281–297 (1967)
11. Martinetz, T.M., Shulten, K.J.: A neural-gas network learns topologies. In Kohonen, T.,
Mäkisara, K., Simula, O., Kangas, J. (eds): Artificial Neural Networks, pp. 397–402 (1991)
12. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, New York (2001)
13. Bernstein, M., de Silva, V., Langford, J.C., Tenenbaum, J.B.: Graph approximations to
geodesics on embedded manifolds. Stanford University (2000)
Contents
ix
x Contents
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Acronyms
xi
Symbols
xiii
Chapter 1
Vector Quantisation and Topology Based Graph
Representation
Abstract Compact graph based representation of complex data can be used for
clustering and visualisation. In this chapter we introduce basic concepts of graph
theory and present approaches which may generate graphs from data. Computa-
tional complexity of clustering and visualisation algorithms can be reduced replacing
original objects with their representative elements (code vectors or fingerprints)
by vector quantisation. We introduce widespread vector quantisation methods, the
k-means and the neural gas algorithms. Topology representing networks obtained by
the modification of neural gas algorithm create graphs useful for the low-dimensional
visualisation of data set. In this chapter the basic algorithm of the topology repre-
senting networks and its variants (Dynamic Topology Representing Network and
Weighted Incremental Neural Network) are presented in details.
A graph G is a pair (V, E), where V is a finite set of the elements, called vertices
or nodes, and E is a collection of pairs of V . An element of E, called edge , is
ei, j = (vi , v j ), where vi , v j ∈ V . If {u, v} ∈ E, we say that u and v are neighbors.
The set of the neighbors for a given vertex is the neighborhood of that vertex.
The
N
complete graph K N on a set of N vertices is the graph that has all the possible
2
edges. In a weighted graph a weight function w : E → R is defined, which function
determines a weight wi, j for each edge ei, j . A graph may be undirected, meaning
that there is no distinction between the two vertices associated with each edge. On
the other hand, a graph may be directed, when its edges are directed from one vertex
to another. A graph is connected if there is a path (i.e. a sequence of edges) from any
vertex to any other vertex in the graph. A graph that is not connected is said to be
disconnected. A graph is finite if V and E are finite sets. A tree is a graph in which
any two vertices are connected by exactly one path. A forest is a disjoint union of
trees.
In practical data mining data often contain large number of observations. In case of
large datasets the computation of the complete weighted graph requires too much
time and storage space. Data reduction methods may provide solution for this prob-
lem. Data reduction can be achieved in such a way that the original objects are
replaced with their representative elements. Naturally, the number of the representa-
tive elements is considerably less than the number of the original observations. This
form of data reduction methods is called Vector quantization (VQ). Formally, vector
1.2 Vector Quantisation Algorithms 3
k-means algorithm [12] is the simplest and most commonly used vector quantisa-
tion method. k-means clustering partitions data into clusters and minimises distance
between cluster centres (code vectors) and data related to the clusters:
c
J (X, V) = xk − vi 2 , (1.1)
i=1 xk ∈Ci
where Ci denotes the ith cluster, and xk − vi is a chosen distance measure between
the data point xk and the cluster center vi .
The whole procedure can be found in Algorithm 1.
The iteration steps are repeated until there is no reassignment of patterns to new
cluster centers or there is no segnificant decrease in the squared error.
The k-means algorithm is very popular because it is easy to implement, and its
time complexity is O(N ), where N is the number of objects. The main drawback of
this algorithm is that it is sensitive to the selection of the initial partition and may
converge to a local minimum of the criterion function. As its implementation is very
easy, this algorithm is frequently used for vector quantisation. Cluster centres can be
seen as the reduced representation (representative elements) of the data. The number
4 1 Vector Quantisation and Topology Based Graph Representation
of the cluster centres and so the number of the representative elements (codebook
vectors) is given by the user a priori. The Linde-buzo-gray algorithm (LBG) [13]
works similar to the k-means vector quantisation method, but it starts with only one
representative element (it is the cluster centre or centroid of the entire data set) and
in each iteration dynamically duplicates the number of the representative elements
and reassigns the objects to be analysed among the cluster centres. The algorithm
stops when the desired number of centroids is obtained.
Partitional clustering is closely related to the concept of Voronoi diagram. A set of
representative elements (cluster centres) decompose subspaces called Voronoi cells.
These Voronoi cells are drawn in such a way that all data points in a given Voronoi
cell are closer to their own representative data point than to the other representative
elements. Delaunay triangulation (DT) is the dual graph of the Voronoi diagram for
the same representatives. Delaunay triangulation [14] is a subdivision of the space
into triangles in such a way that there no other representative element is inside the
circumcircle of any triangle. As a result the DT divides the plane into a number of
triangles. Figure 1.1 represents a small example for the Voronoi diagram and Delau-
nay triangulation. In this figure blue dots represents the representative objects, the
Voronoi cells are drawn with red lines, and black lines form the Delaunay triangu-
lation of the representative elements. In this approach the representative elements
can be seen as a compressed presentation of the space in such a way that data points
placed in a Voronoi cell are replaced with their representative data point in the same
Voronoi cell.
The induced Delaunay triangulation is a subset of the Delaunay triangulation, and
it can be obtained by masking the Delaunay triangulation with the data distribution.
Therefore the induced Delaunay triangulation reflects more precisely to the structure
of data and do not contains such edges which go through in such areas where no data
points are found. The detailed description of induced Delaunay triangulation and the
connecting concept of masked Voronoi polyhedron can be found in [15].
Neural gas algorithm (NG) [16] gives an informative reduced data representation
for a given data set. The name ‘neural gas’ is coming from the operation of the
algorithm since representative data points distribute themselves in the vector space
like a gas. The algorithm firstly initialises code vectors randomly. Then it repeats
iteration steps in which the following steps are performed: the algorithm randomly
chooses a data point from the data objects to be visualised, calculates the distance
order of the representatives to the randomly chosen data point, and in the course of
the adaptation step the algorithm moves all representatives closer to the randomly
chosen data point. The detailed algorithm is given in Algorithm 2.
The ε and λ parameters are decreasing with time t. The adaptation step (Step 4)
corresponds to a stochastic gradient descent on a given cost function. As a result
the algorithm presents n D-dimensional output vectors which distribute themselves
homogeneously in the input ‘data cloud’.
Figure 1.2 shows a synthetic data set (‘boxlinecircle’) and the run of the neural
gas algorithm on this data set. The original data set contains 7,100 sample data
(N = 7100) placed in a cube, in a refracted line and in a circle (Fig. 1.2a). Data
points placed in the cube contain random errors (noise). In this figure the original
data points are yield with blue points and the borders of the points are illustrated
with red lines. Figure 1.2b shows the initialisation of the neural gas algorithm, where
the neurons were initialised in the range of the variables randomly. The number of
6 1 Vector Quantisation and Topology Based Graph Representation
(a) (b)
5 5
0 0
−5 −5
−5 5
−5 0 5 10 0 5 0 5 10 15 0
15 20 −5 20 −5
(c) (d)
5 5
0 0
−5 −5
−5 0 5 5 −5 0 5 5
0
10 15 20 −5 10 15 20 −5 0
(e) (f)
5 5
0 0
−5 −5
−5 0 5 5 −5 0 5
10 15 0 5 10 15 0
20 −5 20 −5
Fig. 1.2 A synthetic data set and different status of neural gas algorithm. a The synthetic ‘box-
linecircule’ data set (N = 7100). b Neural gas initialization (n = 300). c NG, numbr of itrations:
100 (n = 300). d NG, number of iterations: 1000 (n = 300). e NG, number of iterations: 10000
(n = 300). f NG, number of iterations: 50000 (n = 300)
the representative elements was chosen to be n = 300. Figure 1.2c–f show different
states of the neural gas algorithm. Representative elements distribute themselves
homogenously and learn the form of the original data set (Fig. 1.2f).
Figure 1.3 shows an another application example. The analysed data set contains
5,000 sample points placed on a 3-dimensional S curve. The number of the repre-
sentative elements in this small example was chosen to be n = 200, and the neurons
was initialised as data points characterised by small initial values. Running results
in different states are shown in Fig. 1.3b–d.
It should be noted that neural gas algorithm has much more robust convergence
properties than k-means vector quantisation.
In most of the cases the distribution of high dimensional data is not known. In this
cases the initialisation of the k-means and the neural gas algorithms is not easy,
since it is hard to determine the number of the representative elements (clusters).
1.2 Vector Quantisation Algorithms 7
(a) (b)
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 6 −1 6
−1 4 4
−0.5 0 0.5 2 −0.6 −0.4 −0.2 2
1 0 0 0.2 0.4 0
(c) (d)
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 6 −1 6
4 −1 4
−0.8−0.6−0.4−0.2 0 0.2 2 −0.5 0 2
0.4 0.6 0.8 1 0 0.5 1 0
Fig. 1.3 The S curve data set and different states of the neural gas algorithm. a The ‘S curve’ data
set (N = 5000). b NG, number of iterations: 200 (n = 200). c NG, number of iterations: 1000
(n = 200). d NG, number of iterations: 10000 (n = 200)
The Growing neural gas (GNG) [17] algorithm provides a fairly good solution to
solve this problem, since it adds and removes representative elements dynamically.
The other main benefit of this algorithm is that it creates a graph of representatives,
therefore it can be used for exploring the topological structure of data as well. GNG
algorithm starts with two random representatives in the vector space. After this
initialisation step the growing neural gas algorithm iteratively select an input vector
randomly, locate the two nearest nodes (representative elements) to this selected
input vector, moves the nearest representative closer to the selected input vector,
updates some edges, and in definite cases creates a new representative element as
well. The algorithm is detailed in Algorithm 3 [17]. As we can see the network
topology is generated incrementally during the whole process. Termination criterion
might be for example the evaluation of a quality measure (or a maximum number
of the nodes has been reached). GNG algorithm has several important parameters,
including the maximum age of a representatives before it is deleted (amax ), scaling
factors for the reduction of error of representatives (α, d), and the degrees (εb , εa ) of
movements of the selected representative elements in the adaptation step (Step 6).
As these parameters are constant in time and since the algorithm is incremental,
there is no need to determine the number of representatives a priori. One of the
main benefits of growing neural gas algorithm is that is generates a graph as results.
Nodes of this graph are representative elements which present the distribution of the
8 1 Vector Quantisation and Topology Based Graph Representation
original objects and edges give information about the neighbourhood relations of the
representatives.
wq + wr
ws = (1.6)
2
• Create edges between the representatives ws and wq , and ws and wr . If there was an edge
between wq and wr than delete it.
• Decrease the error variables of representatives wq and wr , and initialize the error variable
of the data point ws with the new value of the error variable of wq in that order as follows:
Topology representing network (TRN) algorithm [15, 16] is one of best known neural
network based vector quantisation method. The TRN algorithm works as follows.
Given a set of data (X = {x1 , x2 , . . . , x N }, xi ∈ R D , i = 1, . . . , N ) and a set
of codebook vectors (W = {w1 , w2 , . . . , wn }, wi ∈ R D , i = 1, . . . , n) (N > n)
the algorithm distributes pointers wi between the data objects by the neural gas
algorithm (steps 1–4 without setting the connection strengths ci, j to zero) [16],
and forms connections between them by applying competitive Hebbian rule [18].
The run of the algorithm results in a Topology Representing Network that means a
graph G = (W, C), where W denotes the nodes (codebook vectors, neural units,
representatives) and C yields the set of edges between them. The detailed description
of the TRN algorithm is given in Algorithm 4.
The algorithm has many parameters. Opposite to growing neural gas algorithm
topology representing network requires the number of the representative elements a
priori. The number of the iterations (tmax ) and the number of the codebook vectors (n)
are determined by the user. Parameter λ, step size ε and lifetime T are dependent on
the number of the iterations. This time dependence can be expressed by the following
general form:
10 1 Vector Quantisation and Topology Based Graph Representation
(a) (b)
30 30
25 25
20 20
15 15
10 10
5 5
0 30 0 30
20 0 20
0 5 10 10 5 10 15 10
15 20 25 0 20 25 0
Fig. 1.4 The swiss roll data set and a possible topology representing network of it. a Original swiss
roll data set (N = 5000). b TRN of swiss roll dats set (n = 200)
t/tmax
gf
g(t) = gi , (1.12)
gi
where gi denotes the initial value of the variable, g f denotes the final value of the
variable, t denotes the iteration number, and tmax denotes the maximum number of
iterations. (For example for parameter λ it means: λ(t) = λi (λ f /λi )t/tmax .) Paper
[15] gives good suggestions to tune these parameters.
To demonstrate the operation of TRN algorithm 2 synthetic data sets were chosen.
The swiss roll and the S curve data sets. The number of original objects in both cases
were N = 5000. The swiss roll data set and its topology representing network with
n = 200 quantised objects are shown in Fig. 1.4a and b.
Figure 1.5 shows two possible topology representing networks of the S curve
data set. In Fig. 1.5a, a possible TRN graph of the S curve data set with n = 100
representative elements is shown. In the second case (Fig. 1.5b) the number of the
representative elements was chosen to be twice as many as in the first case. As it can
(a) (b)
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 6 −1 6
4 4
−1 −0.5 0 2 −1 −0.5 0 2
0.5 1 0 0.5 1 0
Fig. 1.5 Different topology representing networks of the S curve data set. a TRN of S curve data
set (n = 100). b TRN of S curve dats set (n = 200)
1.2 Vector Quantisation Algorithms 11
be seen the greater the number of the representative elements the more accurate the
approximation is.
Parameters in both cases were set as follows: the number of iterations was set
to tmax = 200n, where n is the number of representative elements. Initial and final
values of λ, ε and T parameters were: εi = 0.3, ε f = 0.05, λi = 0.2n, λi = 0.01,
Ti = 0.1n and Ti = 0.05n. Although the modification of these parameters may
somewhat change the resulted graph, the number of the representative elements has
more significant effect on the structure of the resulted network.
The main disadvantage of the TRN algorithm is that the number of the representa-
tives must be given a priori. The Dynamic topology representing network (DTRN)
introduced by Si et al. in 2000 [19] eliminates this drawback. In this method the graph
incrementally changes by adding and removing edges and vertices. The algorithm
starts with only one node, and it examines a vigilance test in each iteration. If the
nearest (winner) node to the randomly selected input pattern fails this test, a new
node is created and this new node is connected to the winner. If the winner passes
the vigilance test, the winner and its adjacent neighboors are moved closer to the
selected input pattern. In this second case, if the winner and the second closest nodes
are not connected, the algorithm creates an edge between them. Similarly to the TRN
algorithm DTRN also removes those connections whose age achieves a predefined
threshold. The most important input parameter of DTRN algorithm is the vigilance
threshold. This vigilance threshold gradually decreases from an initial value to a final
value. The detailed algorithm is given in Algorithm 5.
The termination criterion of the algorithm can be given by a maximum number
of iterations or can be controlled with the vigilance threshold. The output of the
algorithm is a D-dimensional graph.
As it can be seen DTRN and TRN algorithms are very similar to each other,
but there are some significant differences between them. While TRN starts with n
randomly generated codebook vectors, DTRN step by step builds up the set of the
representative data elements, and the final number of the codebook vectors can be
determined by the vigilance threshold as well. While during the adaptation process
the TRN moves the representative elements based on their ranking order closer to the
selected input object, DTRN performs this adaptation step based on the Euclidean
distances of the representatives and the selected input element. Furthermore, TRN
moves all representative elements closer to the selected input object, but DTRN
method applies the adaptation rule only to the winner and its direct topological
neighboors. The vigilance threshold is an additional parameter of the DTRN algo-
rithm. The tuning of this is based on the formula introduced in the TRN algorithm.
The vigilance threshold ρ accordingly to the formula 1.12 gradually decreases from
ρi to ρ f during the algorithm.
12 1 Vector Quantisation and Topology Based Graph Representation
Step 1 Initialization: Start with only one representative element (node) wi . To represent this node
select one input object randomly.
Step 2 Select randomly an element x from the input data objects. Find the nearest representative
element (the winner) (wc ) and its direct neighbor (wd ) from:
x − wc = min x − wi (1.13)
i
x − wd = min x − wi (1.14)
i =c
The similarity between a data point and a representative element is measured by the Euclidean
distance.
Step 3 Perform a vigilance test based on the following formula:
x − wc < ρ (1.15)
where ρ is a vigilance threshold.
Step 4 If the winner representative element fails the vigilance test: create a new codebook vector
with wg = x. Connect the new codebook vector to the winner representative element by setting
sc,g = 1, and set other possible connections of wg to zero. Set tg, j = 0 if j = c and tg, j = ∞
otherwise. Go Step 6.
Step 5 If the winner representative element passes the vigilance test:
Step 5.1: Update the coordinates of the winner node and its adjacent neighbors based on the
following formula:
e−βx(k)−wi (k)
2
where k = 0, 1, . . . is a discrete time variable, α(k) is the learning rate factor, and β is an
annealing parameter.
Step 5.2: Update the connections between the representative elements. If the winner and its
closest representative are connected (sc,d = 1) set tc,d = 0. If they are not connected with an
edge, connect them by setting sc,d = 1 and set tc,d = 0.
Step 6 Increase all connections to the winner representative element by setting tc, j = tc, j + 1.
If an age of a connection exceeds a time limit T (tc, j > T ) delete this edge by setting sc, j = 0.
Step 7 Remove the node wi if si, j = 0 for all j = i, and there exists more than 1 representative
element. That is if there are more than 1 representative elements, remove all representatives
which do not have any connections to the other codebook vectors.
Step 8 If a termination criterion is not met continue the iteration and go back to Step 2.
(a) (b)
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 6 −1 6
4 4
−1 −0.5 2 −1 −0.5 2
0 0.5 1 0 0 0.5 0
1
Fig. 1.6 Different DTRN graphs of the S curve data set with the same parameter settings.
a A possible DTRN of S curve data set (n = 362). b Another possible DTRN of S curve data
set (n = 370)
algorithm in these two cases was parameterised in the same way as follows: the
vigilance threshold decreased from the average deviation of the dimensions to con-
stant 0.1, learning rate factor decreased from 0.05 to 0.0005, number of the iterations
was chosen to be 1,000 and maximum age of connections was set to be 5. DTRN
results in different topology based networks arising from the random initialisation
of the neurons. As DTRN dynamically adds and removes nodes the number of the
representative elements differs in the two examples.
Figure 1.7 shows the influence of the number of iterations (tmax ) and the maxi-
mum age (T ) of edges. When the number of the iterations increases the number of
representative elements increases as well. Furthermore, the increase of the maximum
age of edges results additional links between slightly far nodes (see Fig. 1.7b and d).
(a) (b)
30 30
25 25
20 20
15 15
10 10
5 5
0 40 0 40
0 5 20 0 20
10 15 20 25
0 5 10 15 20 25
0
(c) (d)
30 30
25 25
20 20
15 15
10 10
5 5
0 40 0 40
0 20 0 20
5 10 15 20 25
0 5 10 15 20 25
0
Fig. 1.7 DTRN graphs of the swiss roll data set with different parameter settings. a DTRN of swiss
roll data set tmax = 500, T = 5 (n = 370). b DTRN of swiss roll data set tmax = 500, T = 10
(n = 383). c DTRN of swiss roll data set tmax = 1000, T = 5 (n = 631). d DTRN of swiss roll
data set tmax = 1000, T = 10 (n = 345)
(a) (b)
30
30
25
25
20
20
15
15
10
10
5 5
30
20
0 40 0 10
0 5 10 20 0 5 10 0
15 20 25
0 15 20 25
Fig. 1.8 Weighted Incremental Networks of the swiss roll data set. a WINN of swiss roll data set
applying the suggested amax = N /10 parameter setting. b WINN of swiss roll data set with T = 3
parameter setting
1.2 Vector Quantisation Algorithms 15
the instructions of [20] we have set parameter amax to be N /10, amax = 500. The
resulted graph contains some unnecessary links. Setting this parameter to a lower
value this superfluous connections do not appear in the graph. Figure 1.8b shows
this reduced parameter setting, where amax was set to be amax = 3. The number of
representative elements in both cases was n = 200.
References
1. Yao, A.: On constructing minimum spanning trees in k-dimensional spaces and related prob-
lems. SIAM J. Comput. 721–736 (1892)
2. Boopathy, G., Arockiasamy, S.: Implementation of vector quantization for image
compression—a survey. Global J. Comput. Sci. Technol. 10(3), 22–28 (2010)
3. Domingo, F., Saloma, C.A.: Image compression by vector quantization with noniterative deriva-
tion of a codebook: applications to video and confocal images. Appl. Opt. 38(17), 3735–3744
(1999)
4. Garcia, C., Tziritas, G.: Face detection using quantized skin color regions merging and wavelet
packet analysis. IEEE Trans. Multimedia 1(3), 264–277 (1999)
5. Biatov, K.: A high speed unsupervised speaker retrieval using vector quantization and second-
order statistics. CoRR Vol. abs/1008.4658 (2010)
6. Chu, W.C.: Vector quantization of harmonic magnitudes in speech coding applications a survey
and new technique. EURASIP J. App. Sig. Proces. 17, 2601–2613 (2004)
7. Kekre, H.B., Kulkarni, V.: Speaker identification by using vector quantization. Int. J. Eng. Sci.
Technol. 2(5), 1325–1331 (2010)
8. Abdelwahab, A.A., Muharram, N.S.: A fast codebook design algorithm based on a fuzzy
clustering methodology. Int. J. Image Graph. 7(2), 291–302 (2007)
9. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, New York (2001)
10. Kurasova, O., Molyte, A.: Combination of vector quantization and visualization. Lect. Notes
Artif. Intell. 5632, 29–43 (2009)
11. Vathy-Fogarassy, A., Kiss, A., Abonyi, J.: Topology representing network map—a new tool
for visualization of high-dimensional data. Trans. Comput. Sci. I 4750, 61–84 (2008)
12. McQueen, J.: Some methods for classification and analysis of multivariate observations. In:
Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp.
281–297 (1967)
13. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Trans. Com-
mun. 28, 84–94 (1980)
14. Delaunay, B.: Sur la sphere vide. Izvestia Akademii Nauk SSSR, Otdelenie Matematicheskikh
i Estestvennykh Nauk 7, 793–800 (1934)
15. Martinetz, T.M., Shulten, K.J.: Topology representing networks. Neural Netw. 7(3), 507–522
(1994)
16. Martinetz, T.M., Shulten, K.J.: A neural-gas network learns topologies. In Kohonen, T., Mäk-
isara, K., Simula, O., Kangas, J. (eds.) Artificial Neural Networks, pp. 397–402, Elsevier
Science Publishers B.V, North-Holland (1991)
17. Fritzke, B.: A growing neural gas network learns topologies. Adv. Neural Inf. Proces. Syst. 7,
625–632 (1995)
18. Hebb, D.O.: The Organization of Behavior. John, Inc New York (1949)
19. Si, J., Lin, S., Vuong, M.-A.: Dynamic topology representing networks. Neural Netw. 13,
617–627 (2000)
20. Muhammed, H.H.: Unsupervised fuzzy clustering using weighted incremental neural networks.
Int. J. Neural Syst. 14(6), 355–371 (2004)
Chapter 2
Graph-Based Clustering Algorithms
Abstract The way how graph-based clustering algorithms utilize graphs for
partitioning data is very various. In this chapter, two approaches are presented.
The first hierarchical clustering algorithm combines minimal spanning trees and
Gath-Geva fuzzy clustering. The second algorithm utilizes a neighborhood-based
fuzzy similarity measure to improve k-nearest neighbor graph based Jarvis-Patrick
clustering.
Since clustering groups neighboring objects into same cluster neighborhood graphs
are ideal for cluster analysis. A general introduction to the neighborhood graphs
is given in [18]. Different interpretations of concepts ‘near’ or ‘neighbour’ lead
to a variety of related graphs. The Nearest Neighbor Graph (NNG) [9] links each
vertex to its nearest neighbor. The Minimal Spanning Tree (MST) [29] of a weighted
graph is a spanning tree where the sum of the edge weights is minimal. The Relative
Neighborhood Graph (RNG) [25] connects two objects if and only if there is no other
object that is closer to both objects than they are to each other. In the Gabriel Graph
(GabG) [12] two objects, p and q, are connected by an edge if and only if the circle
with diameter pq does not contain any other object in its interior. All these graphs
are subgraphs of the well-known Delaunay triangulation (DT) [11] as follows:
There are many graph-based clustering algorithms that utilize neighborhood rela-
tionships. Most widely known graph-theory based clustering algorithms (ROCK [16]
and Chameleon [20]) also utilize these concepts. Minimal spanning trees [29] for
clustering was initially proposed by Zahn [30]. Clusters arising from single linkage
hierarchical clustering methods are subgraphs of the minimum spanning tree of the
data [15]. Clusters arising from complete linkage hierarchical clustering methods are
maximal complete subgraphs, and are related to the node colorability of graphs [3]. In
[2, 24], the maximal complete subgraph was considered to be the strictest definition
of the clusters. Several graph-based divisive clustering algorithms are based on MST
[4, 10, 14, 22, 26]. The approach presented in [1] utilizes several neighborhood
graphs to find the groups of objects. Jarvis and Patrick [19] extended the nearest
neighbor graph with the concept of the shared nearest neighbors. In [7] Doman et al.
iteratively utilize Jarvis-Patrick algorithm for creating crisp clusters and then they
fuzzify the previously calculated clusters. In [17], a node structural metric has been
chosen making use of the number of shared edges.
In the following, we introduce the details and improvements of MST and Jarvis-
Patrick clustering algorithms.
Minimal spanning tree is a weighted connected graph, where the sum of the weights
is minimal. Denote G = (V, E) a graph. Creating the minimal spanning tree means,
that we are searching the G = (V, E ), the connected subgraph of G, where E ⊂ E
and the cost is minimal. The cost is computed in the following way:
w(e) (2.2)
e∈E
where w(e) denotes the weight of the edge e ∈ E. In a graph G, where the number
of the vertices is N , MST has exactly N − 1 edges.
A minimal spanning tree can be efficiently computed in O(N 2 ) time using either
Prim’s [23] or Kruskal’s [21] algorithm. Prim’s algorithm starts with an arbitrary
vertex as the root of a partial tree. In each step of the algorithm, the partial tree grows
by iteratively adding an unconnected vertex to it using the lowest cost edge, until
no unconnected vertex remains. Kruskal’s algorithm begins with the connection of
the two nearest objects. In each step, the minimal pairwise distance that connects
separate trees is selected, and these two trees are connected along these objects. So
the Kruskal’s algorithm iteratively merges two trees (or a tree with a single object) in
the current forest into a new tree. The algorithm continues until a single tree remains
only, connecting all points. Detailed description of these algorithms are given in
Appendix A.1.1.1 and A.1.1.2.
Clustering based on minimal spanning tree is a hierarchical divisive procedure.
Removing edges from the MST leads to a collection of connected subgraphs of G,
which can be considered as clusters. Since MST has only N −1 edges, we can choose
inconsistent edge (or edges) by revising only N −1 values. Using MST for clustering,
we are interested in finding edges, whose elimination leads to best clustering result.
Such edges are called inconsistent edges.
The basic idea of Zahn’s algorithm [30] is to detect inherent separations in the
data by deleting edges from the MST which are significantly longer than other edges.
2.2 Minimal Spanning Tree Based Clustering 19
Step 1 Construct the minimal spanning tree so that the edges weights are the distances between
the data points.
Step 2 Remove the inconsistent edges to get a set of connected components (clusters).
Step 3 Repeat Step 2 until a terminating criterion is not satisfied.
Zahn proposed the following criterion to determine the inconsistent edges: an edge
is inconsistent if its length is more than f times the average length of the edges, or
more than f times the average of the length of nearby edges. This algorithm is able
to detect clusters of various shapes and sizes; however, the algorithm cannot detect
clusters with different densities.
Identification of inconsistent edges causes problems in the MST based clustering
algorithms. Elimination of k edges from a minimal spanning tree results in k + 1
disconnected subtrees. In the simplest recursive theories k = 1. Denote δ the length of
the deleted edge, and let V1 , V2 be the sets of the points in the resulting two clusters. In
the set of clusters, we can state that there are no pairs of points (x1 , x2 ), x1 ∈ V1 , x2 ∈
V2 such that d(x1 , x2 ) < δ. There are several ways to define the distance between two
disconnected groups of individual objects (minimum distance, maximum distance,
average distance, distance of centroids, etc.). Defining the separation between V1
and V2 , we have the result that the separation is at least δ. The determination of the
value of δ is very difficult because data can contain clusters with different densities,
shapes, volumes, and furthermore they can also contain bridges (chain links) between
the clusters. A terminating criterion determining the stop of the algorithm should be
also defined.
The simplest way to delete edges from MST is based on distances between ver-
tices. By deleting the longest edge in each iteration step we get a nested sequence of
subgraphs. Several ways are known to stop the algorithm, for example the user can
define the number of clusters or give a threshold value on the length, as well. Zahn
suggested a global threshold value for the cutting, which considers the distribution
of the data in the feature space. In [30], this threshold (δ) is based on the average
weight (distances) of the MST (Criterion-1):
1
δ=λ w(e) (2.3)
N −1 e∈E
where λ is a user defined parameter, N is the number of the objects, and E yields
the set of the edges of MST. Of course, λ can be defined several ways.
Long edges of MST do not always indicate outliers or cluster separation. In case of
clusters with different densities, recursive cutting of longest edges does not give the
expected clustering result (see Fig. 2.1). Solving this problem Zahn [30] suggested
that an edge is inconsistent if its length is at least f times as long as the average of
the length of nearby edges (Criterion-2). Another usage of Criterion-2 based MST
clustering is finding dense clusters embedded in a sparse set of points.
The first two splitting criteria are based on distance between the resulted
clusters. Clusters chained by a bridge of small set of data cannot be separated by
20 2 Graph-Based Clustering Algorithms
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
(i) its considerable time complexity, (ii) its sensitivity to the presence of noise in
data. Three indices, are proposed in the literature that are more robust to the presence
of noise. These Dunn-like indices are based on the following concepts: minimum
spanning tree, Relative Neighborhood Graph, and Gabriel Graph.
One of the three Dunn-like indices [6] is defined using the concept of the MST.
Let Ci be a cluster and G i = (Vi , E i ) the complete graph whose vertices correspond
to the objects of Ci . Denote w(e) the weight of an edge e of the graph. Let E iMST
be the set of edges of the MST of the graph G i , and eiMST the continuous sequence
of the edges in E iMST whose total edge weight is the largest. Then, the diameter of
the cluster Ci is defined as the weight of eiMST . With the use of this notation the
Dunn-like index based on the concept of the MST is given by the equation:
δ(Ci , C j )
Dn c = min min (2.4)
i=1,...,n c j=i+1,...,n c maxk=1,...,n c diam (C k )
where n c yields the number of the clusters, δ(Ci , C j ) is the dissimilarity function
between two clusters Ci and C j defined as minxl ∈Ci ,xm ∈C j d(xl , xm ), and diam(Ck )
is the diameter of the cluster Ck , which may be considered as a measure of clusters
dispersion. The number of clusters at which Dn c takes its maximum value indicates
the number of clusters in the underlying data.
Varma and Simon [26] used the Fukuyama-Sugeno clustering measure for deleting
edges from the MST. In this validity measure weighted membership value of an object
is multiplied by the difference between the distance between the node and its cluster
center, and the distances between the cluster center and the center of the whole data
set. The Fukuyama-Sugeno clustering measure is defined in the following way:
N
nc
FSm = μi,m j x j − vi 2A − vi − v2A (2.5)
j=1 i=1
where μi, j is the degree of the membership of data point x j in the ith cluster,
m is a weighting parameter, v denotes the global mean of all objects, vi denotes
the mean of the objects in the ith cluster, A is a symmetric and positive definite
matrix, and n c denotes the number of the clusters. The first term inside the brackets
measures the compactness of clusters, while the second one measures the distances
of the cluster representatives. Small FS indicates tight clusters with large separations
between them. Varma and Simon found, that Fukuyama-Sugeno measure gives the
best performance in a data set with a large number of noisy features.
c
FHV = Vi (2.8)
i=1
where c denotes the number of clusters. Based on this measure, the proposed Hybrid
Minimal Spanning Tree—Gath-Geva algorithm compares the volume of the clusters.
Bad clusters with large volumes are further partitioned until there are ‘bad’ clusters.
In the first step, the algorithm creates the minimal spanning tree of the normalized
data that will be partitioned based on the following steps:
• classical cutting criteria of the MST (Criterion-1 and Criterion-2),
• the application of fuzzy hyper volume validity measure to eliminate edges from
the MST (Criterion-3).
The proposed Hybrid MST-GG algorithm iteratively builds the possible clusters.
First all objects form a single cluster, and then in each iteration step a binary splitting
is performed. The use of the cutting criteria results in a hierarchical tree of clusters,
in which the nodes denote partitions of the objects. To refine the partitions evolved in
the previous step, we need to calculate the volumes of the obtained clusters. In each
iteration step, the cluster (a leaf of the binary tree) having the largest hyper volume
2.2 Minimal Spanning Tree Based Clustering 23
is selected for the cutting. For the elimination of edges from the selected cluster, first
the cutting conditions Criterion-1 and Criterion-2 are applied, which were previously
introduced (see Sect. 2.2). The use of the classical MST based clustering methods
detects well-separated clusters, but does not solve the typical problem of the graph-
based clustering algorithms (chaining affect). To dissolve this discrepancy, the fuzzy
hyper volume measure is applied. If the cutting of the partition having the largest
hyper volume cannot be executed based on Criterion-1 or Criterion-2, then the cut is
performed based on the measure of the total fuzzy hyper volume. If this partition has
N objects, then N − 1 possible cuts must be checked. Each of the N − 1 possibilities
results in a binary split, hereby the objects placed in the cluster with the largest
hyper volume are distributed into two subclusters. The algorithm chooses the binary
split that results in the least total fuzzy hyper volume. The whole process is carried
out until a termination criterion is satisfied (e.g., the predefined number of clusters,
and/or the minimal number of objects in each partition is reached). As the number
of the clusters is not known beforehand, it is suggested to give a relatively large
threshold for it and then to draw the single linkage based dendrogram of the clusters
to determine the proper number of them.
The application of this hybrid cutting criterion can be seen as a divisive hierar-
chical method. Following a depth-first tree-growing process, cuttings are iteratively
performed. The final outcome is a hierarchical clustering tree, where the termination
nodes are the final clusters. Figure 2.2 demonstrates a possible result after applying
the different cutting methods on the MST. The partitions marked by the solid lines are
resulted by the applying of the classical MST-based clustering methods (Criterion-1
or Criterion-2), and the partitions having gray dotted notations are arising from the
application of the fuzzy hyper volume criterion (Criterion-3).
When compact parametric representation of the clusters is needed a Gaussian
mixture model-based clustering should be performed where the number of Gaussians
is equal to the termination nodes, and iterative Gath-Geva algorithm is initialized
based on the partition obtained from the cuted MST. This approach is really fruitful,
since it is well-known that the Gath-Geva algorithm is sensitive to the initialization
of the partitions. The previously obtained clusters give an appropriate starting-point
for the GG algorithm. Hereby, the iterative application of the Gath-Geva algorithm
v1 v2
results in a good and compact representation of the clusters. The whole Hybrid
MST-GG algorithm is described in Algorithm 7.
The Hybrid MST-GG clustering method has the following four parameters: (i)
cutting condition for the classical splitting of the MST (Criterion-1 and Criterion-2);
(ii) terminating criterion for stopping the iterative cutting process; (iii) weighting
exponent m of the fuzzy membership values (see GG algorithm in Appendix A.5),
and (iv) termination tolerance ε of the GG algorithm.
The previously introduced Hybrid MST-GG algorithm involves two major parts: (1)
creating a clustering result based on the cluster volume based splitting extension of
the basic MST-based clustering algorithm, and (2) utilizing this clustering output as
2.2 Minimal Spanning Tree Based Clustering 25
The first example is intended to illustrate that the proposed cluster volume based split-
ting extension of the basic MST-based clustering algorithm is able to handle (avoid)
the chaining phenomena of the classical single linkage scheme. Figure 2.3 presents
the minimal spanning tree of the normalized ChainLink data set (see Appendix A.6.9)
and the result of the classical MST based clustering method. The value of parame-
ter λ in this example was chosen to be 2. It means that based on Criterion-1 and
Criterion-2, those edges are removed from the MST that are 2 times longer than the
average length of the edges of the MST or 2 times longer than the average length of
nearby (connected) edges. Parameter settings λ = 2 . . . 57 give the same results. As
(a) (b)
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 2.3 Classical MST based clustering of ChainLink data set. a MST of the ChainLink data set.
b Clusters obtained by the classical MST based clustering algorithm
26 2 Graph-Based Clustering Algorithms
Fig. 2.3b illustrates, the classical MST based algorithm detects only two clusters. If
parameter λ is set to a smaller value, the algorithm cuts up the spherical clusters into
more subclusters, but it does not unfold the chain link. If parameter λ is very large
(λ = 58, 59, . . .), the classical MST-based algorithm cannot separate the data set.
Figure 2.4 shows the results of the Hybrid MST-GG algorithm running on the
normalized ChainLink data set. Parameters were set as follows: cmax = 4, λ = 2,
m = 2, ε = 0.0001. Figure 2.4a shows the fuzzy sets that are the results of the Hybrid
MST-GG algorithm. In this figure, the dots represent the data points and the ‘o’ mark-
ers are the cluster centers. The membership values are also shown, since the curves
represent the isosurfaces of the membership values that are inversely proportional to
the distances. It can be seen that the Hybrid MST-GG algorithm partitions the data set
0.5
0.4
0.3
0.2
1
0.1
0
0 0.2 0.4 0.6 0.8 1
(b)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
2.2 Minimal Spanning Tree Based Clustering 27
adequately, and it also unfolds the data chain between the clusters. Figure 2.4b shows
the hard clustering result of the Hybrid MST-GG algorithm. Objects belonging to
different clusters are marked with different notations. It is obtained by assigning the
objects to the cluster characterized by the largest fuzzy membership value. It can be
seen that the clustering rate is 100 %.
This short example illustrates the main benefit of the incorporation of the clus-
ter validity based criterion into the classical MST based clustering algorithm. In
the following, it will be shown how the resulting nonparametric clusters can be
approximated by a mixture of Gaussians, and how this approach is beneficial for the
initialization of these iterative partitional algorithms.
Let us consider a more complex clustering problem with clusters of convex shape.
This example is based on the Curves data set (see Appendix A.6.10). For the analysis,
the maximum number of the clusters was chosen to be cmax = 10, and parameter λ
was set to λ = 2.5. As Fig. 2.5 shows, the cutting of the MST based on the hybrid
cutting criterion is able to detect properly clusters, because there is no partition
containing data points from different curves. The partitioning of the clusters has
not been stopped at the detection of the well-separated clusters (Criterion-1 and
Criterion-2), but the resulting clusters have been further split to get clusters with
small volumes, (Criterion-3). The main benefit of the resulted partitioning is that it
can be easily approximated by a mixture of multivariate Gaussians (ellipsoids). This
approximation is useful since the obtained Gaussians give a compact and parametric
description of the clusters.
Figure 2.6a shows the final result of the Hybrid MST-GG clustering. The notation
of this figure are the same as in Fig. 2.4. As can be seen, the clusters provide an
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
28 2 Graph-Based Clustering Algorithms
0.3
3
0.2 4
0.1 8
7
0
0 0.2 0.4 0.6 0.8 1
(b)
1
0.995
0.99
0.985
0.98
0.975
0.97
1 2 3 4 5 7 8 6 9 10
excellent description of the distribution of the data. The clusters with complex shape
are approximated by a set of ellipsoids. It is interesting to note, that this clustering
step only slightly modifies the placement of the clusters (see Figs. 2.5 and 2.6a).
To determine the adequate number of the clusters, the single linkage dendrogram
has been also drawn based on the similarities of the clusters. Figure 2.6b shows
that it is worth merging clusters ‘7’ and ‘8’, then clusters ‘9’ and ‘10’, following
this the merging of clusters {7, 8} and 5 is suggested, then follows the merging of
clusters {6} and {9, 10}. After this merging, the clusters {5, 7, 8} and {6, 9, 10} are
merged, hereby all objects placed in the long curve belongs to a single cluster. The
merging process can be continued based on the dendrogram. Halting this iterative
process at the similarity level 0.995, the resulted clusters meet the users’ expectations
(clustering rate is 100 %).
2.2 Minimal Spanning Tree Based Clustering 29
(a) (b)
1
7
0.9 10 1
0.8
2 1 0.995
0.7
0.6 0.99
8
0.5 3 0.985
5
0.4
4 0.98
0.3
0.2 0.975
0.1 6 9 0.97
0
0 0.2 0.4 0.6 0.8 1 1 2 3 6 9 4 5 7 8 10
Fig. 2.7 Result of the Gath-Geva clustering initialized by fuzzy c-means (Curves data set). a Result
of the GG clustering initialized by FCM. b Dendrogram based on the result of FCM-GG
For testing the effect of the parameters, we have performed several runs with
different values of parameters λ and cmax .1 It is not advisable to select parameter λ
to be smaller than 2, because the data set is then cut up into many small subclusters.
While choosing parameter λ to be greater than 2 does not have an effect on the final
result. If cmax is chosen to be smaller than 10, the algorithm is not able to cut up
the large (‘S’) curve. If parameter cmax is chosen to be larger than 10, the Hybrid
MST-GG algorithm discloses the structure of the data set well.
In order to demonstrate the effectiveness of the proposed initialization scheme,
Fig. 2.7 illustrates the result of the Gath-Geva clustering, where the clustering was
initialized by the classical fuzzy c-means algorithm. As can be seen, this widely
applied approach failed to find the proper clustering of the data set, only a sub-
optimal solution has been found. The main difference between these two approaches
can be seen in the dendrograms (see Figs. 2.6b and 2.7b).
The previous example showed that it is possible to obtain a properly clustered rep-
resentation by the proposed mapping algorithm. However, the real advantage of the
algorithm was not shown. This will be done by the clustering of the well-known
Iris data set (see Appendix A.6.1). The parameters were set as follows: cmax = 3,
λ = 2.5, m = 2 and ε = 0.0001.
The basic MST based clustering method (Criterion-1 and Criterion-2) detects
only two clusters. In this case, the third cluster is formed only after the application of
the cluster volume based splitting criterion (Criterion-3). The resulted three clusters
correspond to the three classes of the Iris flowers. At the analysis of the distribution
of the classes in the clusters, we found only three misclassification errors. The mix-
ture of Gaussians density model is able to approximate this cluster arrangement. The
1 The effect of parameters m and ε was not tested, because these parameters has effects only on the
0.2
−0.2
−0.4
−0.6
−0.8
−1 −0.5 0 0.5 1
fuzzy clusters resulted by the Hybrid MST-GG algorithm were converted to a hard
clustering by assigning each pattern to the cluster with the largest measure of mem-
bership. After this fine-tuning clustering step, we found only five misclassifications.
This means 96.67 % classification correctness, that is a quite good result for this
classification problem. Figure 2.8 shows the two-dimensional mapped visualization
of the classified Iris data set based on the Hybrid MST-GG algorithm completed with
the fuzzy-hard conversion. The two-dimensional mapping was made by the classical
multidimensional scaling.
While most similarity measures are based on distances defied in the n-dimensional
vector space (e.g. Manhattan distances, Mahalanobis distance), similarity measures
useful for topology-based clustering utilize neighborhood relations (e.g., mutual
neighbor distance).
Jarvis-Patrick clustering (JP) [19] is a very simple clustering method. The algo-
rithm first finds k nearest neighbors (knn) of all the objects. Two objects are placed
in the same cluster whenever they fulfill the following two conditions:
• they must be each other’s k-nearest neighbors, and
• they must have at least l nearest neighbors in common.
The algorithm has two parameters:
• parameter k, that is the number of the nearest neighbors to be taken into consider-
ation, and
• parameter l, that determines the number of common neighbors necessary to classify
two objects into the same cluster.
2.3 Jarvis-Patrick Clustering 31
The main drawback of this algorithm is that the determination of the parameters k
and l influences the output of the algorithm significantly. Other drawbacks are:
• the decision criterion is very rigid (the value of l), and
• this decision is constrained by the local k-nearest neighbors.
To avoid these disadvantages we suggested an extension of the similarity measure of
the Jarvis-Patrick algorithm. The suggested fuzzy neighborhood similarity measure
takes not only the k nearest neighbors into account, and it gives a nice tool to tune
parameter l based on visualization and hierarchical clustering methods that utilize
the proposed fuzzy neighborhood similarity. The proposed extension is carried out
in the following two ways:
• fuzzyfication of parameter l, and
• spreading of the scope of parameter k.
The suggested fuzzy neighborhood similarity measure can be applied in various
forms, in different clustering and visualization techniques (e.g. hierarchical cluster-
ing, MDS, VAT). In this chapter, some application examples are also introduced to
illustrate the efficiency of the use of the proposed fuzzy neighborhood similarity
measure in clustering. These examples show that the fuzzy neighborhood similarity
measure based clustering techniques are able to detect clusters with different sizes,
shapes, and densities. It is also shown that outliers are also detectable by the proposed
measure.
Let X = {x1 , x2 , . . . , x N } be the set of data. Denote xi the ith object, which
consists of D measured variables, grouped into an D-dimensional column vector
xi = [x1,i , x2,i , . . . , x D,i ]T , xi ∈ R D . Denote m i, j the number of common k-
nearest neighbors of xi and x j . Furthermore, denote set Ai the k-nearest neighbors
of xi , and A j , respectively, for x j . The Jarvis-Patrick clustering groups xi and x j in
the same cluster, if Eq. (2.9) holds.
(r ) (r )
(r )
|Ai ∩ A j |
si, j = (r ) (r )
, (2.11)
|Ai ∪ A j |
(r ) (r )
where set Ai denotes the r -order k-nearest neighbors of object xi , and A j , respec-
tively, for x j . In each iteration step, the pairwise calculated fuzzy neighborhood
similarity measures are updated based on the following formula:
where α is the first-order filter parameter. The iteration process proceeds until r
reaches the predefined value (rmax ). The whole procedure is given in Algorithm 8.
As a result of the whole process, a fuzzy neighborhood similarity matrix (S) will be
given containing pairwise fuzzy neighborhood similarities. The fuzzy neighborhood
distance matrix (D) of the objects is obtained by the formula: D = 1 − S. These
similarity distance matrices are symmetrical, ST = S and DT = D.
The computation of the proposed transitive fuzzy neighborhood similarity/distance
measure includes the proper setting of three parameters: k, rmax , and α. Lower k (e.g.,
k = 3) separate the clusters better. By increasing value k clusters will overlap in sim-
ilar objects. The higher rmax is, the higher the similarity measure becomes. Increase
of rmax results in more compact clusters. The lower the value of α, the less the affect
of far neighbors becomes.
As the fuzzy neighborhood similarity measure is a special case of the transitive
fuzzy neighborhood similarity measure in the following these terms will be used as
equivalent.
2.3 Jarvis-Patrick Clustering 33
(r ) (r )
where set Ai denotes the r -order k-nearest neighbors of object xi ∈ X, and A j , respec-
tively, for x j ∈ X.
Step 2 Update the fuzzy neighborhood similarity measures based on the following formula:
,(r ) ,(r −1) (r )
si, j = (1 − α)si, j + αsi, j , (2.14)
,(r )
Finally, si, j max yields the fuzzy neighborhood similarities of the objects.
There are several ways to apply the previously introduced fuzzy neighborhood
similarity/distance matrix. For example, hierarchical clustering methods work on
similarity or distance matrices. Generally, these matrices are obtained from the
Euclidian distances of pairs of objects. Instead of the other similarity/distance
matrices, the hierarchical methods can also utilize the fuzzy neighborhood simi-
larity/distance matrix. The dendrogram not only shows the whole iteration process,
but it can also be a useful tool to determine the number of the data groups and the
threshold of the separation of the clusters. To separate the clusters, we suggest to
draw the fuzzy neighborhood similarity based dendrogram of the data, where the
long nodes denote the proper threshold to separate the clusters.
The visualization of the objects may significantly assist in revealing the clusters.
Many visualization techniques are based on the pairwise distance of the data. Because
multidimensional scaling methods (MDS) (see Sect. 3.3.3) work on dissimilarity
matrices, this method can also be based on the fuzzy neighborhood distance matrix.
Furthermore, the VAT is also an effective tool to determine the number of the clusters.
Because VAT works with the dissimilarities of the data, it can also be based on the
fuzzy neighborhood distance matrix.
In the following section, some examples are presented to show the application of
the fuzzy neighborhood similarity/distance measure. The first example is based on
a synthetic data set, and the second and third examples deal with visualization and
clustering of the well-known Iris and Wine data sets.
The variety data set is a synthetic data set which contains 100 two-dimensional
data objects. 99 objects are partitioned in 3 clusters with different sizes (22, 26, and 51
objects), shapes, and densities, and it also contains an outlier (see Appendix A.6.8).
34 2 Graph-Based Clustering Algorithms
(a) (b)
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(c)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Fig. 2.9 Results of the Jarvis-Patrick clustering on the normalized Variety data set. a k = 8, l = 3.
b k = 8, l = 4. c k = 8, l = 5
Figure 2.9 shows some results of Jarvis-Patrick clustering applied on the normalized
data set. The objects belonging to different clusters are marked with different markers.
In these cases, the value of parameter k was fixed to 8, and the value of parameter l
was changed from 2 to 5. (The parameter settings k = 8, l = 2 gives the same result
as k = 8 and l = 3.) It can be seen that the Jarvis-Patrick algorithm was not able to
identify the clusters in any of the cases. The cluster placed in the upper right corner in
all cases is split into subclusters. When parameter l is low (l = 2, 3, 4), the algorithm
is not able to detect the outlier. When parameter l is higher, the algorithm detects the
outlier, but the other clusters are split into more subclusters. After multiple runs of
the JP algorithm, there appeared a clustering result, where all objects were clustered
according to expectations. This parameter setting was: k = 10 and l = 5. To show
the complexity of this data set in Fig. 2.10, the result of the well-known k-means
clustering is also presented (the number of the clusters is 4). This algorithm is not
able to disclose the outlier, thereby the cluster with small density is split into two
subclusters.
2.3 Jarvis-Patrick Clustering 35
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Table 2.1 summarizes the clustering rates of the previously presented algorithms.
The clustering rate was calculated as the fraction of the number of well-clustered
objects and the total number of objects.
The proposed fuzzy neighborhood similarity measure was calculated with differ-
ent k, rmax and α parameters. Different runs with parameters k = 3 . . . 25, rmax =
2 . . . 5 and α = 0.1 . . . 0.4 have been resulted in good clustering outcomes. If a large
value is chosen for parameter k, it is necessary to keep parameter rmax on a small
value to avoid merging the outlier object with one of the clusters.
To show the fuzzy neighborhood distances of the data, the objects are visualized
by multidimensional scaling and VAT. Figure 2.11a shows the MDS mapping of the
fuzzy neighborhood distances with the parameter settings: k = 6, rmax = 3 and
α = 0.2. Other parameter settings have also been tried, and they show similar results
to Fig. 2.11a. It can be seen that the calculated pairwise fuzzy neighborhood similarity
measure separates the three clusters and the outlier well. Figure 2.11b shows the
VAT representation of the data set based on the single linkage fuzzy neighborhood
distances. The three clusters and the outlier are also easily separable in this figure.
To find the proper similarity threshold to separate the clusters and the outlier,
the dendrogram based on the single linkage connections of the fuzzy neighborhood
distances of the objects (Fig. 2.12) has also been drawn. The dendrogram shows that
the value di, j = 0.75 (di, j = 1 − si, j ) is a suitable choice to separate the clusters and
36 2 Graph-Based Clustering Algorithms
(a) (b)
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
Fig. 2.11 Different graphical representations of the fuzzy neighborhood distances (Variety data
set). a MDS based on the fuzzy neighborhood distance matrix. b VAT based on the single linkage
fuzzy neighborhood distances
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
Fig. 2.12 Single linkage dendrogram based on the fuzzy neighborhood distances (Variety data set)
the outlier from each other (k = 6, rmax = 3 and α = 0.2). Applying a single linkage
agglomerative hierarchical algorithm based on the fuzzy neighborhood distances, and
halting this algorithm at the threshold di, j = 0.75 the clustering rate is 100 %. In
other cases (k = 3 . . . 25, rmax = 2 . . . 5 and α = 0.1 . . . 0.4, and if the value of
parameter k was large, the parameter rmax was kept on low values), the clusters also
were easily separable and the clustering rate obtained was 99–100 %.
This simple example illustrates that the proposed fuzzy neighborhood similar-
ity measure is able to separate clusters with different sizes, shapes, and densities,
furthermore it is able to identify outliers. The Wine database (see Appendix A.6.3)
consists of the chemical analysis of 178 wines from 3 different cultivars in the same
Italian region. Each wine is characterized by 13 attributes, and there are 3 classes
distinguished. Figure 2.13 shows the MDS projections based on the Euclidian and
the fuzzy neighborhood distances (k = 6, rmax = 3, α = 0.2). The figures illustrate
that the fuzzy neighborhood distance based MDS separates the three clusters better.
2.3 Jarvis-Patrick Clustering 37
(a) (b)
0.8 0.4
0.6 0.3
0.4
0.2
0.2
0.1
0
0
−0.2
−0.1
−0.4
−0.6 −0.2
−0.8 −0.3
−1 −0.5 0 0.5 1 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Fig. 2.13 Different MDS representations of the Wine data set. a MDS based on the Euclidian
distances. b MDS based on the fuzzy neighborhood distance matrix
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
Fig. 2.14 Average linkage based dendrogram of fuzzy neighborhood distances (Wine data set)
To separate the clusters, we have drawn dendrograms based on the single, average,
and the complete linkage distances. Using these parameters, the best result (cluster-
ing rate 96.62 %) is given by the average linkage based dendrogram, on which the
clusters are uniquely separable. In Fig. 2.14, the average linkage based dendrogram
of the fuzzy neighborhood distances is shown. Figure 2.15 shows the VAT represen-
tation of the Wine data set based on the average linkage based relations of the fuzzy
neighborhood distances. It can be see that the VAT representation also suggest to
draw three clusters.
For the comparison the Jarvis-Patrick algorithm was also tested with different set-
tings on this data set. Running results of this algorithm show very diverse clustering
rates (see Table 2.2). The fuzzy neighborhood similarity was also tested on the Iris
data set. This data set contains data about three types of iris flowers (see Appendix
A.6.1). Iris setosa is easily distinguishable from the other two types, but the Iris
versicolor and the Iris virginica are very similar to each other. Figure 2.16a shows
the MDS mapping of this data set based on the fuzzy neighborhood distances. This
visualization distinguishes the Iris setosa from the other two types, but the individ-
38 2 Graph-Based Clustering Algorithms
Fig. 2.15 Average linkage based VAT of fuzzy neighborhood distances (Wine data set)
(a) (b)
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Fig. 2.16 MDS and VAT representations of the Iris data set based on the fuzzy neighborhood
distances. a MDS based on the fuzzy neighborhood distance matrix. b VAT based on the fuzzy
neighborhood distance matrix
uals of the Iris versicolor and virginica overlap each other. Figure 2.16b shows the
VAT visualization of the fuzzy neighborhood distances based on the single linkage
relations of the objects. The VAT visualization also suggests a well-separated and
two overlapping clusters. The parameter settings in both cases were: k = 5, rmax = 3
2.3 Jarvis-Patrick Clustering 39
and α = 0.2. Different runs of the original Jarvis-Patrick clustering have not given
an acceptable result.
ity/distance measure can discover clusters with arbitrary shapes, sizes, and densities.
Furthermore, the fuzzy neighborhood similarity/distance measure is able to identify
outliers, as well.
References
1. Anders, K.H.: A hierarchical graph-clustering approach to find groups of objects. In: Pro-
ceedings 5’th ICA workshop on progress in automated map generalization, IGN, pp. 28–30
(2003)
2. Augustson, J.G., Minker, J.: An analysis of some graph theoretical clustering techniques. J.
ACM 17, 571–588 (1970)
3. Backer, F.B., Hubert, L.J.: A graph-theoretic approach to goodness-of-fit in complete-link
hierarchical clustering. J. Am. Stat. Assoc. 71, 870–878 (1976)
4. Barrow, J.D., Bhavsar, S.P., Sonoda, D.H.: Minimal spanning trees, filaments and galaxy clus-
tering. Mon. Not. R. Astron. Soc. 216, 17–35 (1985)
5. Bezdek, J.C., Clarke, L.P., Silbiger, M.L., Arrington, J.A., Bensaid, A.M., Hall, L.O., Murtagh,
R.F.: Validity-guided (re)clustering with applications to image segmentation. IEEE Trans.
Fuzzy Syst. 4, 112–123 (1996)
6. Bezdek, J., Pal, N.: Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. 28,
301–315 (1998)
7. Doman, T.N., Cibulskis, J.M., Cibulskis, M.J., McCray, P.D., Spangler, D.P.: Algorithm5: a
technique for fuzzy similarity clustering of chemical inventories. J. Chem. Inf. Comput. Sci.
36, 1195–1204 (1996)
8. Dunn, C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
9. Eppstein, D., Paterson, M.S., Yao, F.F.: On nearest-neighbor graphs. Discrete Comput. Geom.
17, 263–282 (1997)
10. Forina, M., Oliveros, C., Concepción, M., Casolino, C., Casale, M.: Minimum spanning tree:
ordering edges to identify clustering structure. Anal. Chim. Acta 515, 43–53 (2004)
11. Fortune, S.: Voronoi diagrams and delaunay triangulations. In: Du, D.-Z., Hwang, F.K. (eds.),
Computing in Euclidean Geometry, pp. 193–223. World Scientific, Singapore (1992)
12. Gabriel, K., Sokal, R.: A new statistical approach to geographic variation analysis. Syst. Zool.
18, 259–278 (1969)
13. Gath, I., Geva, A.B.: Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Mach.
Intell. 11, 773–781 (1989)
14. Gonzáles-Barrios, J.M., Quiroz, A.J.: A clustering procedure based on the comparsion between
the k nearest neighbors graph and the minimal spanning tree. Stat. Probab. Lett. 62, 23–34
(2003)
15. Gower, J.C., Ross, G.J.S.: Minimal spanning trees and single linkage cluster analysis. Appl.
Stat. 18, 54–64 (1969)
16. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes.
In: Proceedings of the 15th international conference on data engeneering, pp. 512–521 (1999)
17. Huang, X., Lai, W.: Clustering graphs for visualization via node similarities. J. Vis. Lang.
Comput. 17, 225–253 (2006)
18. Jaromczyk, J.W., Toussaint, G.T.: Relative neighborhood graphs and their relatives. Proc. IEEE
80(9), 1502–1517 (1992)
19. Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neigh-
bors. IEEE Trans. Comput. C22, 1025–1034 (1973)
20. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: hierarchical clustering using dynamic mod-
eling. IEEE Comput. 32(8), 68–75 (1999)
21. Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem.
Proc. Am. Math. Soc. 7(1), 48–50 (1956)
References 41
22. Päivinen, N.: Clustering with a minimum spanning tree of scale-free-like structure. Pattern
Recog. Lett. 26, 921–930 (2005)
23. Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36,
1389–1401 (1957)
24. Raghavan, V.V., Yu, C.T.: A comparison of the stability characteristics of some graph theoretic
clustering methods. IEEE Trans. Pattern Anal. Mach. Intell. 3, 393–402 (1980)
25. Toussaint, G.T.: The relative neighborhood graph of a finite planar set. Pattern Recogn. 12,
261–268 (1980)
26. Varma, S., Simon, R.: Iterative class discovery and feature selection using Minimal Spanning
Trees. BMC Bioinform. 5, 126–134 (2004)
27. Vathy-Fogarassy, A., Kiss, A., Abonyi, J.: Hybrid minimal spanning tree and mixture of Gaus-
sians based clustering algorithm. In: Lecture Notes in Computer Science: Foundations of Infor-
mation and Knowledge Systems vol. 3861, pp. 313–330. Springer, Heidelberg (2006)
28. Vathy-Fogarassy, A., Kiss, A., Abonyi, J.: Improvement of Jarvis-Patrick clustering based on
fuzzy similarity. In: Masulli, F., Mitra, S., Pasi, G. (eds.) Applications of Fuzzy Sets Theory,
LNCS, vol. 4578, pp. 195–202. Springer, Heidelberg (2007)
29. Yao, A.: On constructing minimum spanning trees in k-dimensional spaces and related prob-
lems. SIAM J. Comput. 11, 721–736 (1892)
30. Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans.
Comput. C20, 68–86 (1971)
Chapter 3
Graph-Based Visualisation of High Dimensional
Data
active research area, lots of research papers introduce new algorithms or utilise them
in different scientific fields (e.g. [4–9]).
Feature selection methods keep most important dimensions of the data and elimi-
nate unimportant or noisy factors. Forward selection methods start with an empty set
and add variables to this set one by one by optimizing an error criterion. Backward
selection methods start with all variables in the selected set and remove them one by
one, in each step removing the one that decreases the error the most.
In the literature there are many approaches (e.g. [10–13]) described to select the
proper subset of the
attributes.
The well known exhaustive search method [12] exam-
D
ines all possible subsets and selects the subset with largest feature selection
d
criterion as the solution. This method guarantees to find the optimum solution, but if
the number of the possible subsets is large, it becomes impractical. There have been
many methods proposed to avoid the enormous computational cost (e.g. branch and
bound search [14], floating search [15], Monte Carlo algorithms).
Feature extraction methods do not select the most relevant attributes but they com-
bine them into some new attributes. The number of these new attributes is generally
more less than the number of the original attributes. So feature extraction methods
take all attributes into account and they provide reduced representation by feature
combination and/or transformation. The resulted representation provides relevant
information about the data. There are several dimensionality reduction methods pro-
posed in the literature based on the feature extraction approach, for example the well
known Principal Component Analysis (PCA) [16, 17], Sammon mapping (SM) [18],
or the Isomap [19] algorithm.
Data sets to be analysed often contain lower dimensional manifolds embedded
in higher dimensional space. If these manifolds are linearly embedded into high-
dimensional vector space the classical linear dimensionality reduction methods pro-
vide a fairly good low-dimensional representation of data. These methods assume
that data lie on a linear or on a near linear subspace of the high-dimensional space
and they calculate the new coordinates of data as the linear combination of the orig-
inal variables. The most commonly used linear dimensionality reduction methods
are for example the Principal Component Analysis (PCA) [17], the Independent
Component Analysis (ICA) [20] or the Linear Discriminant Analysis (LDA) [21].
However if the manifolds are nonlinearly embedded into the higher dimensional
space linear methods provide unsatisfactory representation of data. In these cases
the nonlinear dimensionality reduction methods may outperform the traditional lin-
ear techniques and they are able to give a good representation of data set in the
low-dimensional data space. To unfold these nonlinearly embedded manifolds many
nonlinear dimensionality reduction methods are based on the concept of geodesic
distance and they build up graphs to carry out the visualisation process (e.g. Isomap,
Isotop, TRNMap). The bets known nonlinear dimensionality reduction methods are
Kohonen’s Self-Organizing Maps (SOM) [22, 23], Sammon mapping [18], Locally
Linear Embedding (LLE) [24, 25], Laplacian Eigenmaps [26] or Isomap [19].
Dimensional reduction methods approximate high-dimensional data distribution
in a low-dimensional vector space. Different dimensionality reduction approaches
46 3 Graph-Based Visualisation of High Dimensional Data
1
N
E metric_MDS = (di,∗ j − di, j )2 , (3.2)
N
di,∗2j i< j
i< j
In both equations di,∗ j denotes the distance between the ith and jth original objects,
and di, j yields the distance for the mapped data points in the reduced vector space.
Variable N yields the number of the objects to be mapped.
The error measure is based on the residual variance defined as:
1 − R2 (D∗X , DY ), (3.3)
2
N
M1 (k) = 1 − (r (i, j) − k) , (3.4)
N k(2N − 3k − 1)
i=1 j∈Uk (i)
2
N
M2 (k) = 1 − (s (i, j) − k) , (3.5)
N k(2N − 3k − 1)
i=1 j∈Vk (i)
where s(i, j) is the rank of the data sample i from j in the output space, and Vi (k)
denotes the set of those data points that belong to the k-neighbours of data sample i
in the original space, but not in the mapped space used for visualisation.
In this book when mappings are based on geodesic distances, the ranking values
of the objects in both cases (trustworthiness and continuity) are calculated based on
the geodesic distances.
Mapping quality of the applied methods in local and in global area can be
expressed by trustworthiness and continuity. Both measures are function of the
number of neighbours k. Usually, trustworthiness and continuity are calculated for
k = 1, 2, . . . , kmax , where kmax denotes the maximum number of the objects to be
taken into account. At small values of parameter k the local reconstruction perfor-
mance of the model can be tested, while at larger values of parameter k the global
reconstruction is measured.
Topographic error and topographic product quality measures may also be used
to give information about the neighbourhood preservation of mapping algorithms.
Topographic error [32] takes only the first and second neighbours of each data point
into account and it analyzes whether the nearest and the second nearest neighbours
remain neighbours of the object in the mapped space or not. If these data points are
not adjacent in the mapped graph the quality measure considers this a mapping error.
The sum of errors is normalized to a range from 0 to 1, where 0 means the perfect
topology preservation.
Topographic product introduced by Bauer in 1992 [33] was developed for qual-
ifying the mapping result of SOM. This measure has an input parameter k and it
takes not only the two nearest neighbor into account. The topographic product com-
pares the neighbourhood relationship between each pair of data points with respect
to both their position in the resulted map and their original reference vectors in the
3.2 Measures of the Mapping Quality 49
One of the most widely applied dimensionality reduction methods is the Principal
Component Analysis (PCA) [16, 17]. The PCA algorithm is also known as Hotteling
or as Karhunen-Loéve transform ([16, 17]). PCA differs from the metric and non-
metric dimensionality reduction methods, because instead of the preservation of the
distances or the global ordering relations of the objects it tries to preserve the variance
of the data. PCA represents the data as linear combinations of a small number of basis
vectors. This method finds the projection that stores the largest variance possible in
the original data and rotates the set of the objects such that the maximum variability
becomes visible. Geometrically, PCA transforms the data into a new coordinate
system such that the greatest variance by any projection of the data comes to lie on
the first coordinate, the second greatest variance on the second coordinate, and so
on. If the data set (X) is characterised with D dimensions and the aim of the PCA is
to find the d-dimensional reduced representation of the data set, the PCA works as
follows: The corresponding d-dimensional output is found by linear transformation:
Y = QX, where Q is the d × D matrix of linear transformation composed of the
d largest eigenvectors of the covariance matrix, and Y is the d × D matrix of the
projected data set.
To illustrate PCA-based dimensionality reduction visualisation we have chosen
2 well known data sets. Figure 3.1 shows the PCA-based mapping of the widely
50 3 Graph-Based Visualisation of High Dimensional Data
(a) (b)
1.5
0.8
0.6 1
0.4
0.2
PC 3
0.5
0
−0.2
PC 2
−0.4 0
−0.6
−0.8 −0.5
1.5 3
1 2 −1
0.5 1
0 0
−0.5 −1 −1.5
−1 −2
PC 2 −3 PC 1 −4 −3 −2 −1 0 1 2 3 4
PC 1
Fig. 3.1 PCA-based visualisations of iris data set. a 3-dimensional PCA-based visualisation of iris
data set. b 2-dimensional PCA-based visualisation of iris data set
used iris data set (see Appendix A.6.1). In the original data set each sample flower is
characterised with 4 attributes (sepal length and width and petal length and width).
As each sample flower is characterised with 4 numeric values, the original data set
is placed in a 4-dimensional vector space, which is not visible for human eyes. By
the use of PCA this dimensionality can be reduced. In the firs subfigure the first
three principal components are shown on axes (PC1, PC2, PC3), so it provides
a 3-dimensional presentation of the iris data set. In the second subfigure only the
first and the second principal components are shown on axes, therefore in this case
a 2-dimensional visualisation is presented. In these figures each colored plot yields
a flower corresponding to the original data set. Red points indicate iris flowers from
class iris setosa, blue points yield sample flowers from class iris versicolor and
magenta points indicate flowers from class iris virginica.
Figure 3.2 presents the colored S curve data set with 5000 sample points and
its 2-dimensional PCA-based mapping result. The second part of this figure (Fig.
3.2b) demonstrates the main drawback of the Principal Component Analysis based
mapping. As this subfigure do not tarts with dark blue points and do not ends with
bourdon data points, so this method is not able to unfold such linear manifolds that
are nonlinearly embedded in a higher dimensionality space.
As PCA is a linear dimensionality reduction method it can not unfold low-
dimensional manifolds embedded into the high-dimensional vector space. Kernel
PCA ([35, 36]) extends the power of the PCA algorithm with applying a kernel trick.
3.3 Standard Dimensionality Reduction Methods 51
2.5
(a) 3
(b) 2
2.5 1.5
2 1
1.5 0.5
PC 2
1
Z
0
0.5 −0.5
0 −1
−0.5 −1.5
−1 6
4 −2
−1 −0.5 2
0 0.5 1 0
Y −2.5
X −3 −2 −1 0 1 2 3
PC 1
Fig. 3.2 Colored S curve data set and its PCA based mapping. a Original S curve data set with
N = 5000 points. b PCA-based visualisation of S curve data set
First it transforms data into a higher-dimensional feature space, and the principal
components are in this feature space extracted.
Sammon mapping (SM) [18] is one of the well known metric, nonlinear dimension-
ality reduction methods. While PCA attempts to preserve the variance of the data
during the mapping, Sammon’s mapping try to preserve the interpattern distances
[37, 38] as it tries to optimise a cost function that describes how well the pairwise
distances in a data set are preserved. The aim of the mapping process is to min-
imise this cost function step by step. The Sammon stress function (distortion of the
Sammon projection) can be written as:
N (d ∗ − d )2
1 i, j i, j
E SM = , (3.6)
N di,∗ j
di,∗ j i< j
i< j
where di,∗ j denotes the distance between the vectors xi and x j , and di, j respectively
for yi and y j .
The minimisation of the Sammon stress is an optimisation problem. When the
gradient-descent method is applied to search for the minimum of Sammon stress, a
local minimum can be reached. Therefore a significant number of runs with different
random initialisations may be necessary.
Figure 3.3 represents 2-dimensional visualisation of the S curve data set resulted
by the Sammon mapping. Similar to the previously presented PCA-based map-
ping, the Sammon mapping can not unfold the nonlinearly embedded 2-dimensional
manifold.
52 3 Graph-Based Visualisation of High Dimensional Data
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
1
N
E metric_MDS = (di,∗ j − di, j )2 , (3.7)
N
di,∗2j i< j
i< j
where di,∗ j denotes the distance between the vectors xi and x j , and di, j between yi
and y j respectively. The only difference between the stress functions of the Sam-
mon mapping (see 3.6) and the metric MDS (see 3.7) is that the errors in distance
preservation in the case of Sammon mapping are normalized by the distances of the
input data objects. Because of this normalisation the Sammon mapping emphasises
the preservation of small distances.
3.3 Standard Dimensionality Reduction Methods 53
Classical MDS is an algebraic method that rests on the fact that matrix Y con-
taining the output coordinates can be derived by eigenvalue decomposition from the
scalar product matrix B = YYT . Matrix B can be found from the known distances
using Young-Householder process [39]. The detailed metric MDS algorithm is the
following:
To illustrate the method visualisation of the iris data set and points lying on an S
curve were chosen (see Fig. 3.4).
In contrast with metric multidimensional scaling, in non-metric MDS only the
ordinal information of the proximities is used for constructing the spatial config-
uration. Thereby non-metric MDS attempts to preserve the rank order among the
dissimilarities. The non-metric MDS finds a configuration of points whose pairwise
Euclidean distances have approximately the same rank order as the corresponding
dissimilarities of the objects. Equivalently, the non-metric MDS finds a configuration
of points, whose pairwise Euclidean distances approximate a monotonic transforma-
54 3 Graph-Based Visualisation of High Dimensional Data
(a) (b)
1.5 2.5
setosa 2
versicolor
1 virginica 1.5
1
0.5
0.5
0 0
−0.5
−0.5 −1
−1.5
−1
−2
−1.5 −2.5
−4 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
Fig. 3.4 MDS mappings of iris and S curve data sets. a MDS mapping of iris data set. b MDS
mapping of S curve data set
tion of the dissimilarities. These transformed values are known as the disparities.
The non-metric MDS stress can be formulated as follows:
N
N
E nonmetric_MDS = (di, j − di, j )2 / di,2 j , (3.8)
i< j i< j
where di, j yields the disparity of objects xi and x j , and di, j denotes the distance
between the vectors yi and y j . Traditionally, the non-metric MDS stress is often
called Stress-1 due to Kruskal [41].
The main steps of the non-metric MDS algorithm are given in Algorithm 10.
It can be shown, that metric and non-metric MDS mappings are substantially dif-
ferent methods. On the one hand, while metric MDS algorithm is an algebraic method,
the non-metric MDS is an iterative mapping process. On the other hand the main goal
of the optimisation differs significantly, too. While metric multidimensional scaling
methods attempt to maintain the degree of the the pairwise dissimilarities of data
points, the non-metric multidimensional scaling methods focus on the preservation
of the order of the neighbourhood relations of the objects.
3.4 Neighbourhood-Based Dimensionality Reduction 55
xi −x j 2
wi j = e− t , (3.9)
where t is an input parameter.
Simple-minded: wi j = 1 if objects xi and x j are connected by an edge, otherwise wi j = 0.
Step 3 In the third step the algorithm computes the eigenvectors and eigenvalues for the following
generalized eigenvector problem:
(a) (b)
6 −4
setosa
versicolor
5
−4.5 virginica
4
−5
3
−5.5
2
−6
1
0 −6.5
−1 −7
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9
Fig. 3.5 LPP mappings of S curve and iris data sets. a LPP mapping of S curve data set. b LPP
mapping of iris data set
4
0
1
3
2
3
2
4
5
1
6
7
0
8
9
−1
−2
−2 −1 0 1 2 3 4 5 6
digits from around 80 persons. Each person wrote on a paper all the digits from 0 to
9, twice. First time in the normal way as accurate as they can and the second time
in a fast way. The digits were scanned and stretched in a rectangular box including
16 × 16 cells in a grey scale of 256 values. Then each pixel of each image was scaled
into a boolean value using a fixed threshold. As a result the data set contains 1593
sample digits and each digit is characterised with 256 boolean variables. LPP of
the Semeion data set is shown in Fig. 3.6. The resulted 2-dimensional visualisation
shows interesting correlations between the digits.
where t denotes time, wi denotes the neurons in the grid, x(t) is the random sample
object at time t and h c,i (t) yields the neighbourhood function about the winner unit
(BMU) at time t.
The training quality of the Self-Organizing Map may be evaluated by the following
formula:
1
N
E SOM = xi − wBMU
i
, (3.13)
N
i=1
where N is the number of the objects to be mapped and wBMU i yields the best matching
unit corresponding to the vector xi .
When SOM has been trained, it is ready to map any new input vector into a
low-dimensional vector space. During the mapping process a new input vector may
quickly be classified or categorized, based on the location of the closest neuron on
the grid.
There is a variety of different kinds of visualisation techniques available for the
SOM. (e.g. U-matrix, component planes). The Unified distance matrix (U-matrix)
[45] makes the 2D visualisation of multi-variate data. In the U-matrix the average
distances between the neighbouring neurons are represented by shades in a grey scale.
If the average distance of neighbouring neurons is short, a light shade is used; dark
shades represent long distances. Thereby, dark shades indicate a cluster border, and
light shades represent clusters themselves. Component planes [23] are visualised by
taking from each weight vector the value of the component (attribute) and depicting
this as a color or as a height on the grid. Figure 3.8 illustrates the U-matrix and the
3.74 1.22
1.41 0.185
d d
Fig. 3.8 The U-matrix and the component planes of the iris data set
3.4 Neighbourhood-Based Dimensionality Reduction 59
where t refers to the current number of iterations, wi is the winner boundary node,
and x is the arbitrary selected input object. The algorithm adds new nodes to the net
if too large number of analysed data are mapped into a low-dimensional output node,
and therefore the cumulative error of this node is too large. The boundary node with
60 3 Graph-Based Visualisation of High Dimensional Data
the largest cumulative error (error node) is selected as the most inadequate node to
represent the data structure. This node grows new neighbouring node or nodes. The
new nodes are placed in all possible position in the grid which are adjacent to the
winner boundary node and they are not yet occupied by a node. Figure 3.9 visualises
two kinds of growing possibilities.
Weight vectors of the new nodes are initialised based on the neighbouring nodes
and the new nodes are connected to the error node. Finally, during the adaptation
of connections phase the algorithm evaluates two parameters for each connection. If
the Euclidean distance between two neighbouring unconnected nodes are less than a
connect threshold parameter, the algorithm creates a new edge between these nodes.
On the other hand, if the distance between two nodes connected in the grid exceeds
a disconnect threshold parameter, the algorithm deletes it.
To summarise, we can see that the Incremental Grid Growing algorithm analo-
gously to the SOM method utilises a predefined 2-dimensional structure of repre-
sentative elements, but in the case of the IGG algorithm the number of these nodes
is not a predefined parameter. As a consequence of the deletion of edges the IGG
algorithm may provide unconnected subgrids, which can be seen as a representation
of different clusters of the original objects.
The Adaptive Hierarchical Incremental Grid Growing (AHIGG) method pro-
posed by Merkl and coworkers in 2003 [47] extends the Incremental Grid Growing
approach. In this article the authors combine the IGG method with the hierarchical
3.4 Neighbourhood-Based Dimensionality Reduction 61
clustering approach. The main difference between the IGG and AHIG method is,
that in the course of the AHIGG algorithm the network representing the data grows
incrementally, but there are different levels of the growing state distinguished. The
Adaptive Hierarchical Incremental Grid Growing algorithm utilises the SOM algo-
rithm to train the net as well, but the initialisation of the net differs from the method
proposed in the IGG algorithm. The training process involves a fine tuning phase as
well, when only the winner node adapts to the selected data point, and no further
nodes are added to the graph. After the fine tuning phase the algorithm searches for
the possible extensions in the graph. For this extension the algorithm calculates an
error value (mean quantisation error) for each node, and nodes with too high error
value are expanded on the next level of the presentation. As a result the algorithm
creates a hierarchical architecture of different visualisation levels. Each level of the
hierarchy involves a number of independent clusters presented by 2-dimensional grid
structures.
space by using an adaptation rule that minimises a cost function that favors the local
distance preservation. As OVI-NG utilises Euclidean distances to map the data set it
is not able to disclose the nonlinearly embedded data structures. The Geodesic Non-
linear Projection Neural Gas (GNLP-NG) [50] algorithm is an extension of OVI-NG,
which uses geodesic distances instead of the Euclidean ones. The TRNMap algo-
rithm was developed recently, and it combines the TRN-based geodesic distances
with the multidimensional scaling method. In Sects. 3.5.4–3.5.6 these algorithms
are introduced.
3.5.1 Isomap
The Isomap algorithm proposed by Tenenbaum et al. in 2000 [19] is based on the
geodesic distance measure. Isomap deals with finite number of points in a data set
in R D which are assumed to lie on a smooth submanifold M d (d D). The aim of
this method is to preserve the intrinsic geometry of the data set and visualise the data
in a low-dimensional feature space. For this purpose Isomap calculates the geodesic
distances between all data points and then projects them into a low-dimensional
vector space. In this way the Isomap algorithm consists of three major steps:
Step 1 : Constructing the neighbourhood graph of the data by using the k-neighbour-
ing or ε-neighbouring approach.
Step 2 : Computing the geodesic distances between every pair of objects.
Step 3 : Constructing a d-dimensional embedding of the data points.
For the low-dimensional (generally d = 2) visualisation Isomap utilises the MDS
method. In this case the multidimensional scaling is not based on the Euclidean
distances, but it utilises the previously computed geodesic distances. As Isomap uses
a non-Euclidean metric for mapping, a nonlinear projection is obtained as a result.
However, when the first step of the Isomap algorithm is applied to a multi-class
data set, several disconnected subgraphs can be formed, thus the MDS can not be
performed on the whole data set. Wu and Chan [51] give an extension of the Isomap
solving this problem. In their proposal unconnected subgraphs are connected with an
edge between the two nearest node. In this manner the Euclidean distance is used to
approximate the geodesic distances of data objects lying on different disconnected
subgraphs. Furthermore, applying Isomap to noisy data shows also some limitations.
Figures 3.10 and 3.11 illustrate two possible 2-dimensional Isomap mappings of
the S curve set. It can be seen that due to the calculation of the geodesic distances, the
Isomap algorithm is able to unfold the 2-dimensional manifold nonlinearly embed-
ded into the 3-dimensional vector space. Additionally, in this figures two different
mapping results can be seen, which demonstrate the effect of the parametrisation of
Isomap algorithm. To calculate geodesic distances the k-neighbouring approach was
chosen in both cases. In the first case k was chosen to be k = 5, in the second case
k was set to be k = 10.
3.5 Topology Representation 63
−1
−2
−3
−6 −4 −2 0 2 4 6
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5
As second example the iris data set was chosen to demonstrate the Isomap map-
ping. In this case the creation of the neighbouring graph resulted in two unconnected
subgraphs and so the original algorithm was not able to calculate the geodesic dis-
tances between all pairs of data. As a consequence, the MDS mapping can not be
performed on the whole data set. Therefore it can be established that the original
Isomap method (without any extensions) is not able to visualise the 2-dimensional
representation of the whole iris data set.
3.5.2 Isotop
The main limitation of SOM is that it transforms the high-dimensional pattern into
a low-dimensional discrete map in a topologically ordered grid (see Sect. 3.4.2).
Thereby, SOM is not able to preserve the topology of the input data structure. The
method Isotop [52] can be seen as a variation of SOM with a data-driven topology
grid. Contrary to the SOM’s rectangular or hexagonal lattice, Isotop creates a graph
that tries to capture the neighbourhood relationships in the manifold, and therefore
the resulted network reflects more accurate the hidden structure of the representatives
or data elements.
The algorithm consist of 3 major steps: (1) vector quantisation; (2) building a graph
from the representative elements; (3) mapping the graph onto a low-dimensional
vector space.
In the first phase Isotop performs a vector quantisation step in order to reduce
the number of data points. So the objects are replaced by their representatives. This
optional step can be achieved with simple methods, like Competitive Learning or
k-means clustering.
In the second step Isotop builds a graph structure to calculate the geodesic dis-
tances of the objects. This network is created based on the k-neighbouring or ε-
neighbouring approaches. Parameters k or ε are determined by the analyser. In the
network the edges are characterised by the Euclidean distances of the objects, and
the geodesic distances are calculated as sums of these Euclidean distances.
Finally, in the last step Isotop performs a non-metric dimensionality reduction.
This mapping process uses graph distances defined by the previously calculated
neighbourhood connections. Up to this point the analysed objects are represented
with representative elements in the high-dimensional vector space. In this step the
algorithm replaces the coordinates of representatives by low-dimensional ones, ini-
tialised randomly around zero. Then Isotop iteratively draws an object (g) randomly,
and moves all representatives closer to the randomly selected point. The movement
of each mapped representative becomes smaller and smaller as its neighbourhood
distance from the closest representative to the selected point (BMU) grows. Formally,
at time t all representatives yi in the low-dimensional space are updated according
to the rule:
where α(t) is a time-decreasing learning rate with values taken from between 1 and 0.
The neighbourhood factor h i is defined as:
1 δi,2 j
h i (t) = exp − , (3.16)
2 λ(t)E j∈N (i) xi x j 2
n
2
E CDA = δi, j − di, j F di, j , λ , (3.17)
i< j
where δi, j denotes the geodesic distance between the objects xi and x j in the
high-dimensional input space, di, j denotes the Euclidean distance for the mapped
objects yi and y j in the low-dimensional output space, and n is the number of the
66 3 Graph-Based Visualisation of High Dimensional Data
(b)
3
(a) 2
3
2.5 1
2
0
1.5
1 −1
0.5 −2
0
−3
−0.5 5
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0 −6 −4 −2 0 2 4 6
4
(d)
(c) 3
3
2
2.5
2 1
1.5 0
1
−1
0.5
0 −2
−0.5 5 −3
−1
0 −4
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6
Fig. 3.12 Nearest neighbours graphs and CDA mappings of quantised S curve data set. a Nearest
neighbours graph of S curve data set n = 50, k = 3. b CDA mapping of S curve data set n = 50.
c Nearest neighbours graph of S curve data set n = 500, k = 3. d CDA mapping of S curve data
set n = 500
The application of factor F effects that CDA algorithm emphasises the preservation
of small distances rather than of large ones. Curvilinear Distance Analysis applies
stochastic gradient descent algorithm to minimise the topology error function E CDA .
Figure 3.12 demonstrates some CDA mappings of S curve data set. In both cases
vector the process of the quantisation was made based on the k-means algorithm
where the number of the representatives in the first case was chosen to be n = 100
and in the second case n = 500. The number of the original data points in all cases
were N = 2000. In the left column the k nearest neighbours graphs are shown where
the number of the nearest neighbours in all cases was chosen to be k = 3. The right
column contains the CDA mappings of the vector quantised data.
3.5 Topology Representation 67
Comparing Isomap and CDA methods it can be seen, that CDA applies more com-
plicated techniques than the Isomap. However, when the parametrisation is adequate,
CDA may give better visualisation result, which emphasise better some characteris-
tics of the projected data sets [55].
1 ∗
n
E OVI−NG = (d j,k − d j,k )2 F(s j,k ), (3.19)
2
j=1 k= j
where d ∗j,k defines the Euclidean distance in the input space between the codebook
vectors w j and wk , d j,k yields the Euclidean distance of the codebook positions y j
and yk in the output space, and s j,k denotes the rank of the k-th codebook position
(yk ) with respect to the j-th output vector (y j ) in the output space. The function F
is defined as:
f
− σ (t)
F( f ) = e , (3.20)
where σ (t) is the width of the neighbourhood that decreases with the number of
iterations in the same way as Eq. 1.12.
The OVI-NG method performs 10 steps separately. As some of them are equivalent
with steps of TRN algorithm, in the following only the additional steps are discussed
in detail. Steps 1–7 in the OVI-NG method are the same as Steps 1–7 in the TRN
algorithm (see Sect. 1.2.4), except that in the first step beside the random initiali-
sation of the codebook vectors w j the OVI-NG also initialises codebook positions
y j randomly. In each iteration step after creating new edges and removing the ‘old’
edges (Step 5–7), the OVI-NG moves the codebook positions closer to the codebook
position associated with the winner codebook vector (w j0 ). This adaptation rule is
carried out by the following two steps:
Step 8 Generate the ranking in output space s( j0 , j) = s(y j0 (t), y j (t)) ∈ {1, . . . ,
n −1} for each codebook position y j (t) with respect to the codebook position
associated with the winner unit y j0 (t), j = j0 .
68 3 Graph-Based Visualisation of High Dimensional Data
where α is the learning rate, which typically decreases with the number of
iterations t, in the same form as Eq. 1.12.
Step 10 of the OVI-NG is the same as Step 8 in the TRN algorithm.
To sum up, we can say that OVI-NG is a nonlinear projection method, in which the
codebook positions are adjusted in a continuous output space by using an adaptation
rule that minimises a cost function that favors the local distance preservation. As
OVI-NG utilises Euclidean distances to map the data set it is not able to disclose the
nonlinearly embedded data structures.
where r j,k = r (x j , wk ) ∈ {0, 1, ..., n − 1} denotes the rank of the k-th codebook
vector with respect to the x j using geodesic distances, and σ is a width of the neigh-
bourhood surround. d j,k denotes the Euclidean distance of the codebook positions
y j and yk defined in the output space, δ j,k yields the geodesic distance between
codebook vectors w j and wk measured in the input space.
According to the previously presented overview, the GNLP-NG first determines
the topology of the data set by the modified TRN algorithm and then maps this topol-
ogy based on the graph distances. The whole process is summarised in Algorithm 12.
Parameter α is the learning rate, σ is the width of the neighbourhood, and they
typically decrease with the number of iterations t, in the same way as Eq. 1.12.
Paper [50] also gives an extension to the GNLP-NG to tear or cut the graphs with
non-contractible cycles.
Figure 3.13 visualises the 2-dimensional GNLP-NG mapping of the S curve data
set. In this small example the original S curve data set contained 2000 3-dimensional
data points, the number of the representatives was chosen to be n = 200 and they
were initialised randomly. As it can be seen the GNLP-NG method is able to unfold
the real 2-dimensional structure of the S curve data set.
70 3 Graph-Based Visualisation of High Dimensional Data
0.5
−0.5
−1
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Summarising the previously introduced methods we can say, that all these methods
seem to be a good choice for topology based dimensionality reduction, but each of
them has some disadvantages. Isomap can not model multi-class problems and it is not
efficient on large and noisy data sets. The main disadvantage of OVI-NG and GNLP-
NG methods are that they use a non-metric mapping method and thereby only the
rank ordering of the representatives is preserved during the mapping process. Isotop
can indeed fall in local minima and require some care for the parametrisation [53].
Although CDA is a more complicated technique, it needs to be well parameterized
[56]. Furthermore, the OVI-NG and CCA methods are not able to uncover the non-
linearly embedded manifolds.
3.5 Topology Representation 71
not depend on the density of the objects or the selected number of the neighbours. If
the resulted graph is unconnected, the TRNMap algorithm connects the subgraphs by
linking the closest elements (Step 2). Then the pairwise graph distances are calculated
between every pair of representatives (Step 3). In the following, the original topology
representing network is mapped into a 2-dimensional graph (Step 4). The mapping
method utilises the similarity of the data points provided by the previously calculated
graph distances. This mapping process can be carried out by the use of metric or non-
metric multidimensional scaling, as well. For the expressive visualisation component
planes are also created by the D-dimensional representatives (Step 5).
(D)
While there are unconnected subgraphs (m i ⊂ M (D) , i = 1, 2, . . .):
(D)
(a) Choose a subgraph m i .
(b) Let the terminal node t1 ∈ m i(D) and its closest neighbor
(D)
t2 ∈
/ m i from:
(D) (D)
t1 − t2 = minw j − wk , t1 , w j ∈ m i , t2 , wk ∈ / mi
(c) Set ct1 ,t2 =1.
End while
Yield M ∗(D) the modified M (D) .
Step 3 Calculate the geodesic distances between all wi , w j ∈ M ∗(D) .
Step 4 Map the graph M (D) into a 2-dimensional vector space by metric or non-metric MDS
based on the graph distances of M ∗(D) .
Step 5 Create component planes for the resulting Topology Representing Network Map based
on the values of wi ∈ M (D) .
The parameters of the TRNMap algorithm are the same as those of the Topology
Representing Networks algorithm. The number of the nodes of the output graph (n)
is determined by the user. The bigger the n the more detailed the output map will
be. The suggest the choice is n = 0.2N , where N yields the number of the original
objects. If the number of the input data elements is high, it can result in numerous
nodes. In these cases it is practical to decrease the number of the representatives and
iteratively run the algorithm to capture the structure more precisely. Values of the
other parameters of TRN (λ, the step size ε, and the threshold value of edge’s ages
T ) can be the same as proposed by Martinetz and Schulten [60].
Figure 3.14 shows the 2-dimensional structure of the S curve data set created by
the TRNMap method. As TRNMap algorithm utilises geodesic distances to calculate
3.5 Topology Representation 73
0.6 DP_TRNMap
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
(a) (b)
Dimension: 1 Dimension: 2
1 1
0.5 0.5
0 0.5 0 0.5
−0.5 −0.5
0 0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
(c) Dimension: 3
1
0.5
0 0.5
−0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Fig. 3.15 TRNMap component planes of S curve data set. a Dimension X. b Dimension Y. c
Dimension Z
the pairwise dissimilarities of the quantised data, this method is able to unfold the
real 2-dimensional structure of the S curve data set.
Besides the visualisation of the data structure, the nodes of TRNMap also visualise
high-dimensional information by the use of the component plane representation.
Component planes of the 3-dimensional S curve data set resulted by the TRNMap
are shown in the Fig. 3.15. A component plane displays the value of one component of
each node. If the input data set has D attributes, the Topology Representing Network
74 3 Graph-Based Visualisation of High Dimensional Data
In this section a comparative analysis is given about the previously introduced meth-
ods with some examples. The analysis is based on the evaluation of mapping results
of the following examples: Swiss roll data set (see Appendix A.6.5), Wine data set
(see Appendix A.6.3) and Wisconsin breast cancer data set (see Appendix A.6.4).
The mapping qualities of the algorithms are analysed based on the following two
aspects:
• preservation of distance and neighbourhood relations of data, and
• preservation of local and global geometry of data.
In our analysis the distance preservation of the methods is measured by the classi-
cal MDS stress function, Sammon stress function and residual variance. The neigh-
bourhood preservation and the local and global mapping qualities are measured by
functions of trustworthiness and continuity.
All analysed visualisation methods require the setting of some parameters. In the
following the next principle is followed: the identical input parameters of different
mapping methods are set in the same way. The common parameters of OVI-NG,
GNLP-NG and TRNMap algorithms were in all simulations set as follows: tmax =
200n, εi = 0.3, ε f = 0.05, λi = 0.2n, λ f = 0.01, Ti = 0.1n. If the influence
of the deletion of edges was not analysed, the value of parameter T f was set to
T f = 0.5n. The auxiliary parameters of the OVI-NG and GNLP-NG algorithms
were set as αi = 0.3, α f = 0.01, σi = 0.7n, and σ f = 0.1. The value of parameter
K in the GNLP-NG method in all cases was set to K = 2.
Table 3.1 Values of Sammon stress, metric MDS stress and residual variance of different algorithms
on the Swiss roll data set
Algorithm Sammon stress MDS stress Res. var.
kmeans+Eu+mMDS 0.05088 0.20743 0.22891
kmeans+Eu+nmMDS 0.05961 0.21156 0.22263
kmeans+Eu+Sammon 0.05084 0.21320 0.24200
kmeans+Eu+Sammon_mMDS 0.04997 0.20931 0.23268
kmeans+knn+mMDS 0.00212 0.00091 0.00326
kmeans+knn+nmMDS 0.00216 0.00091 0.00324
kmeans+knn+Sammon 0.00771 0.00440 0.01575
kmeans+knn+Sammon_mMDS 0.00198 0.00097 0.00348
NG+Eu+mMDS 0.05826 0.04941 0.26781
NG+Eu+nmMDS 0.06659 0.05792 0.26382
NG+Eu+Sammon 0.05758 0.05104 0.27613
NG+Eu+Sammon_mMDS 0.05716 0.05024 0.27169
NG+knn+mMDS 0.00208 0.00086 0.00307
NG+knn+nmMDS 0.00206 0.00087 0.00299
NG+knn+Sammon 0.00398 0.00242 0.00916
NG+knn+Sammon_mMDS 0.00392 0.00237 0.00892
TRN+mMDS 0.00145 0.00063 0.00224
TRN+nmMDS 0.00187 0.00064 0.00221
TRN+Sammon 0.01049 0.00493 0.01586
TRN+Sammon_mMDS 0.00134 0.00068 0.00235
The Swiss roll data set (Fig. 3.16, Appendix A.6.5) is a typical example of the non-
linearly embedded manifolds. In this example the number of the representatives in
all cases was chosen to be n = 200. Linear mapping algorithms, such Principal
Component Analysis do not come to the proper result (see Fig. 3.17a), because of
the 2-dimensional nonlinear embedding. As can be seen in Fig. 3.17b and c CCA
and OVI-NG methods are also unable to uncover the real structure of the data, as
they utilise Euclidean distances to calculate the pairwise object dissimilarities.
Figure 3.18 shows the Isotop, CDA and GNLP-NG visualisations of the Swiss roll
data set. As CDA and Isotop methods can be based on different vector quantisation
methods, both methods were calculated based on the results of k-means and neural
gas vector quantisation methods as well. Isotop and CDA require to build a graph
to calculate geodesic distances. In these cases graphs were built based on the k-
neighbouring approach, and parameter k was set to be k = 3. It can be seen that
3.6 Analysis and Application Examples 77
30
25
20
15
10
30
0 20
0 5 10 10
15 20 25 0
(a) (b)
0.8 1.5
0.6
1
0.4
0.2 0.5
0
0
−0.2
−0.4
−0.5
−0.6
−0.8 −1
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −1 −0.5 0 0.5 1 1.5 2
(c) 2
1.5
0.5
−0.5
−1 −0.5 0 0.5 1 1.5
Fig. 3.17 PCA, CCA and OVI-NG projections of Swiss roll data set. a 2-dimensional PCA pro-
jection. b 2-dimensional CCA projection. c 2-dimensional OVI-NG projection
78 3 Graph-Based Visualisation of High Dimensional Data
(a) (b)
4 4
3
3
2
2
1
1
0
0
−1
−2 −1
−3 −2
−4 −3
−5 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −2 0 2 4 6
(c) (d)
2 4
1.5 3
1 2
0.5 1
0 0
−0.5 −1
−1 −2
−2 −1 0 1 2 3 −0.5 0 0.5 1 1.5 2
(e)
2.5
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−1 −0.5 0 0.5 1 1.5 2 2.5
Fig. 3.18 Isotop, CDA and GNLP-NG projections of Swiss roll data set. a 2-dimensional Isotop
projection with k-means VQ. b 2-dimensional Isotop projection with NG VQ. c 2-dimensional
CDA projection with k-means VQ. d 2-dimensional CDA projection with NG VQ. e 2-dimensional
GNLP-NG projection
3.6 Analysis and Application Examples 79
1
0.4
0.8
0.2
0.6
0
0.4
−0.2
0.2
1 −0.4
0 0.5
0 −0.6
0 0.2 0.4 0.6 0.8 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Fig. 3.19 Topology Representing Network and metric MDS based TRNMap visualisation of the
Swiss roll data set. a TRN. b DP_TRNMap
Isotop, CDA and GNLP-NG algorithms can uncover the structure of data in essence,
but the Isotop method shows the manifold with some distortions.
In the following let us have a closer look at the results of Topology Representing
Network Map algorithm. As TRNMap is based on the creation of the Topology Rep-
resenting Network, Fig. 3.19a shows the TRN of the Swiss roll data set. In Fig. 3.19b
the 2-dimensional metric MDS based TRNMap visualisation of the Swiss roll data
set is shown (DP_TRNMap, DP from distance preservation). As the metric MDS
and the non-metric MDS based mappings of the resulted TRN in this case give very
similar results in the mapped prototypes, the resulted TRNMap visualisations are not
distinguishable by human eyes. Thereby Fig. 3.19b can be seen as the result of the
non-metric MDS based TRNMap algorithm as well. In Fig. 3.19b it can be seen that
the TRNMap methods are able to uncover the embedded 2-dimensional manifold
without any distortion.
Visualisation of the Topology Representing Network Map also includes the con-
struction of the component planes. The component planes arising from the metric
MDS based TRNMap are shown in Fig. 3.20. The largest value of the attributes of
the representatives corresponds to the black and the smallest value to the white dot
surrounded by a grey circle. Figure 3.20a shows that alongside the manifold the
value of the first attribute (first dimension) initially grows to the highest value, then
it decreases to the smallest value, after that it grows, and finally it decreases a little.
The second attribute is invariable alongside the manifold, but across the manifold it
changes uniformly. The third component starts from the highest value, then it falls
to the smallest value, following this it increases to a middle value, and finally it
decreases a little.
Table 3.2 shows the error values of distance preservation of different mappings.
The notation DP_TRNMap yields the metric MDS based TRNMap algorithm, and
the notation NP_TRNMap yields the non-metric MDS based TRNMap algorithm
(DP comes from distance preservation and NP from neighbourhood preservation).
Table 3.2 shows that GNLP-NG and TRNMap methods outperform the Isotop and
80 3 Graph-Based Visualisation of High Dimensional Data
(a) (b)
0.6 1 0.6 1
0.4 0.4
0.2 0.2
0 0
−0.2 −0.2
−0.4 −0.4
−0.6 0 −0.6 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
(c) 0.6 1
0.4
0.2
−0.2
−0.4
−0.6 0
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Fig. 3.20 Component planes of the metric MSD based Topology Representing Network Map of
the Swiss roll data set. a Dimension 1. b Dimension 2. c Dimension 3
Table 3.2 Values of Sammon stress, metric MDS stress and residual variance of Isotop, CDA,
GNLP-NG and TRNMap algorithms on the Swiss roll data set
Algorithm Sammon stress Metric MDS stress Residual variance
kmeans_Isotop 0.54040 0.57870 0.41947
NG_Isotop 0.52286 0.53851 0.15176
kmeans_CDA 0.01252 0.00974 0.01547
NG_CDA 0.01951 0.01478 0.02524
GNLP-NG 0.00103 0.00055 0.00170
DP_TRNMap 0.00096 0.00043 0.00156
NP_TRNMap 0.00095 0.00045 0.00155
CDA methods. Although, GNLP-NG and TRNMap methods show similar per-
formances in distance preservation, the TRNMap methods show somewhat better
performances.
Figure 3.21 shows the neighbourhood preservation mapping qualities of the meth-
ods to be analysed. It can be seen that different variations of Isotop and CDA
show lower performance in neighbourhood preservation than GNLP-NG and TRN-
Map methods. The continuity and the trustworthiness of GNLP-NG and TRNMap
mappings do not show a substantive difference, the qualitative indicators move
3.6 Analysis and Application Examples 81
Fig. 3.21 Trustworthiness and continuity as a function of the number of neighbours k, for the Swiss
roll data set
Fig. 3.22 Trustworthiness and continuity of GNLP-NG and TRNMap methods as a function of the
number of neighbours k, for the Swiss roll data set
In this subsection a real problems is considered. The wine database (see Appen-
dix A.6.3) contains the chemical analysis of 178 wine, each is characterised by 13
continuous attributes, and there are three classes distinguished.
82 3 Graph-Based Visualisation of High Dimensional Data
Table 3.3 Values of Sammon stress, metric MDS stress and residual variance of GNLP-NG and
TRNMap algorithms on the Wine data set
Algorithm Sammon stress Metric MDS stress Residual variance
GNLP-NG T f = 0.5n 0.04625 0.03821 0.13926
GNLP-NG T f = 0.3n 0.04982 0.04339 0.15735
GNLP-NG T f = 0.05n 0.02632 0.02420 0.07742
DP_TRNMap T f = 0.5n 0.01427 0.00829 0.03336
DP_TRNMap T f = 0.3n 0.01152 0.00647 0.02483
DP_TRNMap T f = 0.05n 0.01181 0.00595 0.02161
NP_TRNMap T f = 0.5n 0.03754 0.02608 0.07630
NP_TRNMap T f = 0.3n 0.05728 0.04585 0.09243
NP_TRNMap T f = 0.05n 0.03071 0.01984 0.04647
On visualisation results presented in the following the class labels are also pre-
sented. The representatives are labeled based on the principle of the majority vote:
(1) each data point is assigned to the closest representative; (2) the representatives
are labeled with the class label that occurs most often among its assigned data point.
In this example the tuning of parameter T f of the TRN algorithm is also tested.
Parameters Ti and T f has an effect on the linkage of the prototypes, thereby they
also influence the geodesic distances of the representatives. As parameter T f yields
the final threshold of the age of the edges, this parameter has greater influence on
the resulted graph. Other parameters of TRN algorithm (λ, ε and tmax ) were set to
the values presented in Sect. 3.6.2. The tuning of the parameter age of the edges is
shown in the following. The number of the representatives in all cases was chosen
to be n = 0.2N , which means 35 nodes in this case.
As parameter T f has an effect on the creation of the edges of TRN, it influences
the results of the GNLP-NG and TRNMap algorithms. Table 3.3 shows the error val-
ues of the distance preservation of the GNLP-NG, DP_TRNMap and NP_TRNMap
methods. In these simulations parameter Ti was chosen to be Ti = 0.1n, and T f was
set to T f = 0.5n, T f = 0.3n and T f = 0.05n, where n denotes the number of the
representatives. It can be seen that the best distance preservation quality is obtained
at the parameter setting T f = 0.05n. At parameters T f = 0.5n and T f = 0.3n the
methods TRNMap based on non-metric MDS and GNLP-NG seem to fall in local
minima. It can be caused by their iterative minimizing process. On the contrary, the
metric MDS based TRNMap finds the coordinates of the low-dimensional represen-
tatives in a single step process by eigenvalues decomposition, and thereby it seems to
be a more robust process. It is certified by the good error values of the DP_TRNMap
in all three cases.
The effect of the change of the parameter T f on the visual presentation is shown
in Fig. 3.23. It can be seen that the deletion of edges produces smoother graphs.
Based on the previous experimental results, the parameter setting T f = 0.05n
has been chosen for the further analysis. In the following lets have a look at the
comparison of the TRNMap methods with the other visualisation methods. Table 3.4
3.6 Analysis and Application Examples 83
(a) (b)
1.5 2.5
2
1
1.5
0.5 1
0 0.5
0
−0.5
−0.5
−1 −1
−1.5
−1.5
−2
−2 −2.5
−3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 4
(c) (d)
1 2
0.5 1.5
1
0
0.5
−0.5
0
−1
−0.5
−1.5 −1
−2 −1.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −3 −2 −1 0 1 2 3 4
(e) (f)
3
3
2.5 2
2
1.5 1
1
0.5 0
0
−1
−0.5
−1 −2
−1.5
−2 −3
−1 −0.5 0 0.5 1 1.5 −3 −2 −1 0 1 2 3 4 5
Fig. 3.23 GNLP-NG and TRNMap projections of Wine data set at different settings of parameter
T f . a DP_TRNMap T f = 0.3n. b DP_TRNMap T f = 0.05n. c NP_TRNMap T f = 0.3n.
d NP_TRNMap T f = 0.05n. e GNLP-NG T f = 0.3n. f GNLP-NG T f = 0.05n
shows error values of the distance preservation of the methods to be analysed. The
parameter k for the k-neighbouring was chosen to be k = 3. It can be seen that
TRNMap method based on metric MDS mapping shows the best mapping qualities.
84 3 Graph-Based Visualisation of High Dimensional Data
Table 3.4 Values of Sammon stress, metric MDS stress and residual variance of Isotop, CDA,
GNLP-NG and TRNMap algorithms on the wine data set (T f = 0.05)
Algorithm Sammon stress Metric MDS stress Residual variance
kmeans_Isotop 0.59233 0.59797 0.54959
NG_Isotop 0.60600 0.61030 0.46479
kmeans_CDA 0.93706 0.30726 0.63422
NG_CDA 0.82031 0.27629 0.66418
GNLP-NG 0.02632 0.02420 0.07742
DP_TRNMap 0.01181 0.00595 0.02161
NP_TRNMap 0.03071 0.01984 0.04647
(a) (b)
Trustworthiness Continuity
1 1
0.95 0.95
0.9
0.9
0.85
0.85 0.8
0.8 0.75
k_means−CDA
NG−CDA
0.7 k_means−CDA
0.75 kmeans−Isotop
NG−CDA
NG−Isotop 0.65 kmeans−Isotop
NG−Isotop
0.7 GNLP−NG
GNLP−NG
DP_TRNMap 0.6 DP_TRNMap
NP_TRNMap
NP_TRNMap
0.65 0.55
0 5 10 15 20 0 5 10 15 20
Fig. 3.24 Trustworthiness and continuity as a function of the number of neighbours k, for the wine
data set
(a) (b)
1 Trustworthiness 1 Continuity
0.995 0.995
0.99
0.99
0.985
0.985
0.98
0.975 0.98
0.97 0.975
0.965 GNLP−NG 0.97 GNLP−NG
0.96 DP_TRNMap DP_TRNMap
NP_TRNMap 0.965 NP_TRNMap
0.955
0.95 0.96
0 5 10 15 20 0 5 10 15 20
Fig. 3.25 Trustworthiness and continuity of GNLP-NG and TRNMap methods as a function of the
number of neighbours k, for the wine data set
Table 3.5 The values of the Sammon stress, MDS stress and residual variance of different mapping
algorithms on the Wisconsin breast cancer data set (n = 35)
Algorithm Sammon stress MDS stress Residual variance
GNLP-NG 0.02996 0.02764 0.09733
DP_TRNMap 0.01726 0.01075 0.04272
NP_TRNMap 0.01822 0.01077 0.03790
MDS based TRNMap mappings. This figure shows that NP_TRNMap method has
not found the optimal mapping, because the characteristics of the functions of the
NP_TRNMap algorithm differ from the characteristics of functions of DP_TRNMap
algorithm. Comparing the GNLP-NG and DP_TRNMap methods we can see that
the DP_TRNMap method give better performance at larger k-nn values. Opposite to
this the GNLP-NG technique gives better performance at the local reconstruction.
(At small k-nn-s the local reconstruction performance of the model is tested, while
at larger k-nn-s the global reconstruction is measured.)
Wisconsin breast cancer database is a well-known diagnostic data set for breast
cancer (see Appendix A.6.4). This data set contains 9 attributes and class labels for
the 683 instances of which 444 are benign and 239 are malignant. It has been shown
in the previous examples that the GNLP-NG and TRNMap methods outperform the
CDA and Isotop methods both distance and neighbourhood preservation. Thereby,
in this example only the qualities of the GNLP-NG and that of the TRNMap methods
will be examined. The number of the nodes in this case was reduced to n = 35 and
n = 70. The parameter T f was chosen to be T f = 0.05n followed the previously
presented method.
To get a compact representation of the data set to be analysed, the number of the
neurons was chosen to be n = 35 in the beginning. Table 3.5 shows the numerical
evaluation of the distance preservation capabilities of the mappings. The efficiency
of the TRNMap algorithm in this case is also confirmed by the error values.
TRNMap and GNLP-NG visualisations of the Wisconsin breast cancer data set
are shown in Fig. 3.26. The results of the several runs seem to have drawn a fairly
wide partition and another compact partition. In these figures the representatives of
the benign class are labeled with square markers and the malignant class is yielded
with circle markers.
The quality of the neighbourhood preservation of the mappings is shown in
Fig. 3.27. Figures illustrate that the MDS-based techniques show better global map-
ping quality than the GNLP-NG method, but in the local area of the data points the
GNLP-NG exceeds the TRNMap methods.
To examine the robustness of the TRNMap methods another number of the rep-
resentatives has also been tried. In the second case the number of the representatives
86 3 Graph-Based Visualisation of High Dimensional Data
(a) (b)
7 4
6 3
5
2
4
3 1
2 0
1
−1
0
−2
−1
−2 −3
−3 −2 −1 0 1 2 3 4 5 −4 −3 −2 −1 0 1 2 3 4
(c) 3
2
−1
−2
−3
−4
−5
−6 −5 −4 −3 −2 −1 0 1 2 3
Fig. 3.26 GNLP-NG and TRNMap visualisations of the Wisconsin breast cancer data set. a GNLP-
NG. b DP_TRNMap. c NP_TRNMap
(a) (b)
1 Trustworthiness Continuity
1
0.99 0.995
0.98 0.99
0.985
0.97
0.98
0.96
0.975
GNLP−NG GNLP−NG
0.95
DP_TRNMap 0.97 DP_TRNMap
NP_TRNMap NP_TRNMap
0.94 0.965
0.93 0.96
0 5 10 15 20 0 5 10 15 20
Fig. 3.27 Trustworthiness and continuity as a function of the number of neighbours k, for the
Wisconsin breast cancer data set (n = 35)
3.6 Analysis and Application Examples 87
Table 3.6 Values of Sammon stress, MDS stress and residual variance of different mapping algo-
rithms on the Wisconsin breast cancer data set (n = 70)
Algorithm Sammon stress MDS stress Residual variance
GNLP-NG 0.06293 0.05859 0.22249
DP_TRNMap 0.01544 0.00908 0.03370
NP_TRNMap 0.02279 0.01253 0.02887
0.98 0.99
0.96 0.98
0.94 0.97
0.92 0.96
0.9 0.95
0.88 0.94
GNLP−NG GNLP−NG
0.86 DP_TRNMap 0.93 DP_TRNMap
NP_TRNMap NP_TRNMap
0.84 0.92
0.82 0.91
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Fig. 3.28 Trustworthiness and continuity as a function of the number of neighbours k, for the
Wisconsin breast cancer data set (n = 70)
was chosen to be n = 70. Table 3.6 and Fig. 3.28 show the numerical evaluations of
the methods in this case (other parameters were not changed). Both the error values
and the functions show that the GNLP-NG method has fallen again in a local minima.
(This incident occurs in many other cases as well.) On the other hand, the TRNMap
algorithms in these cases are also robust to the parameter settings.
It is an interesting aspect to compare the error values of the methods in the case of
mappings of different data sets (see Tables 3.2, 3.4 and 3.5). Error values of the map-
pings for the Swiss roll data set are smaller in order of magnitude, than error values
in the other two examples. It means, that stress functions of distance preservation
are also able to show the presence of such manifolds that can be defined by graphs.
References
3. Martinetz, T.M., Shulten, K.J.: A neural-gas network learns topologies. In: Kohonen, T., Mäk-
isara, K., Simula, O., Kangas, J. (eds.) Artificial Neural Networks, pp. 397–402. Elsevier
Science Publishers B.V, North-Holland (1991)
4. Johannes, M., Brase, J.C., Fröhlich, H., Gade, S., Gehrmann, M., Fälth, M., Sültmann, H.,
Beißbarth, T.: Integration of pathway knowledge into a reweighted recursive feature elimination
approach for risk stratification of cancer patients. Bioinformatics 26(17), 2136–2144 (2010)
5. Lai, C., Reinders, M.J.T., Wessels, L.: Random subspace method for multivariate feature selec-
tion. Pattern Recognit. Lett. 27(10), 1067–1076 (2006)
6. Nguyen, M.H., de la Torre, F.: Optimal feature selection for support vector machines. Pattern
Recognit. 43(3), 584–591 (2010)
7. Rong, J., Li, G., Chen, Y.P.P.: Acoustic feature selection for automatic emotion recognition
from speech. Inf. Process. Manag. 45(3), 315–328 (2009)
8. Tsang, I.W., Kocsor, A., Kwok, J.T.: Efficient kernel feature extraction for massive data sets.
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and
data mining, pp. 724–729 (2006)
9. Wang, J., Zhang, B., Wang, S., Qi, M., Kong, J.: An adaptively weighted sub-pattern locality
preserving projection for face recognition. J. Netw. Comput. Appl. 332(3), 323–332 (2010)
10. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif.
Intell. 97(12), 245271 (1997)
11. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res.
3, 1157–1182 (2003)
12. Jain, A., Zongker, D.: Feature selection: Evaluation, application, and small sample performance.
IEEE Trans. Pattern Anal. Mach. Intell. 192, 153–158 (1997)
13. Weston, J., et al.: Feature selection for SVMs. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.)
Advances in Neural Information Processing Systems, vol. 13, pp. 668–674. The MIT Press,
Cambride (2001)
14. Narendra, P., Fukunaga, K.: A branch and bound algorithm for feature subset selection. IEEE
Trans. Comput. C–26(9), 917–922 (1977)
15. Pudil, P., Novovičová, J., Kittler, J.: Floating search methods in feature selection. Pattern
Recognit. Lett. 15(1), 1119–1125 (1994)
16. Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ.
Psychol. 24, 417–441 (1933)
17. Jolliffe, T.: Principal Component Analysis. Springer, New York (1996)
18. Sammon, J.W.: A non-linear mapping for data structure analysis. IEEE Trans. Comput. 18(5),
401–409 (1969)
19. Tenenbaum, J.B., Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimen-
sionality reduction. Science 290, 2319–2323 (2000)
20. Comon, P.: Independent component analysis: a new concept? Signal Process. 36(3), 287–317
(1994)
21. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–
188 (1936)
22. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43,
59–69 (1982)
23. Kohonen, T.: Self-Organizing maps, 3rd edn. Springer, New York (2001)
24. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding.
Science 290, 2323–2326 (2000)
25. Saul, L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional
manifolds. J. Mach. Learn. Res. 4, 119–155 (2003)
26. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data represen-
tation. Neural Comput. 15(6), 1373–1396 (2003)
27. Borg, I.: Modern multidimensional scaling: theory and applications. Springer, New York (1977)
28. Kruskal, J.B., Carroll, J.D.: Geometrical models and badness-of-fit functions. In: Krishnaiah,
R. (ed.) Multivariate Analysis II, vol. 2, pp. 639–671. Academic Press Pachuri, New York
(1969)
90 3 Graph-Based Visualisation of High Dimensional Data
29. Kaski, S., Nikkilä, J., Oja, M., Venna, J., Törönen, J., Castrén, E.: Trustworthiness and metrics
in visualizing similarity of gene expression. BMC Bioinformatics 4, 48 (2003)
30. Venna, J., Kaski, S.: Local multidimensional scaling with controlled tradeoff between trustwor-
thiness and continuity, In: Proceedings of the workshop on self-organizing maps, pp. 695–702
(2005)
31. Venna, J., Kaski, S.: Local multidimensional scaling. Neural Netw. 19(6), 889–899 (2006)
32. Kiviluoto, K.: Topology preservation in self-organizing maps. Proceedings of IEEE interna-
tional conference on neural networks, pp. 294–299 (1996)
33. Bauer, H.U., Pawelzik, K.R.: Quantifying the neighborhood preservation of selforganizing
feature maps. IEEE Trans. Neural Netw. 3(4), 570–579 (1992)
34. Duda, R.O., Hart, P.E., Stork, D.: Pattern classification. Wiley, New York (2000)
35. Mika, S., Schölkopf, B., Smola, A.J., Müller, K.-R., Scholz, M., Rätsch, G.: Kernel PCA and
de-noising in feature spaces. In: Advances in neural information processing systems, vol. 11,
Cambridge, USA (1999)
36. Schölkopf, B., Smola, A.J., Müller, K.-R.: Nonlinear component analysis as a kernel eigenvalue
problem. Neural Comput. 10(5), 1299–1319 (1998)
37. Mao, J., Jain, A.K.: Artifical neural networks for feature extraction and multivariate data pro-
jection. IEEE Trans. Neural Netw 6(2), 629–637 (1995)
38. Pal, N.R., Eluri, V.K.: Two efficient connectionist schemes for structure preserving dimension-
ality reduction. IEEE Trans. Neural Netw. 9, 1143–1153 (1998)
39. Young, G., Householder, A.S.: Discussion of a set of points in terms of their mutual distances.
Psychometrika 3(1), 19–22 (1938)
40. Naud, A.: Neural and statistical methods for the visualization of multidimensional data. Tech-
nical Science Katedra Metod Komputerowych Uniwersytet Mikoaja Kopernika w Toruniu
(2001)
41. Kruskal, J.B.: Multidimensional scaling by optimizing goodness-of-fit to a nonmetric hypoth-
esis. Psychometrika 29, 1–29 (1964)
42. He, X., Niyogi, P.: Locality preserving projections. In: Lawrence, K., Saul, Weiss, Y., Bottou,
L., (eds.) Advances in Neural Information Processing Systems 17. Proceedings of the 2004
Conference, MIT Press, vol. 16, p. 37 (2004) http://mitpress.mit.edu/books/advances-neural-
information-processing-systems-17
43. UC Irvine Machine Learning Repository www.ics.uci.edu/ mlearn/ Cited 15 Oct 2012
44. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River
(1999)
45. Ultsch, A.: Self-organization neural networks for visualization and classification. In: Opitz,
O., Lausen, B., Klar, R. (eds.) Information and Classification, pp. 307–313. Springer, Berlin
(1993)
46. Blackmore, J., Miikkulainen, R.: Incremental grid growing: encoding high-dimensional struc-
ture into a two-dimensional feature map. In: Proceedong on IEEE international conference on
neural networks, vol. 1, pp. 450–455 (1993)
47. Merkl, D., He, S.H., Dittenbach, M., Rauber, A.: Adaptive hierarchical incremental grid grow-
ing: an architecture for high-dimensional data visualization. In: Proceeding of the workshop
on SOM, Advances in SOM, pp. 293–298 (2003)
48. Lee, J.A., Lendasse, A., Donckers, N., Verleysen, M.: A robust nonlinear projection method.
Proceedings of ESANN’2000, 8th european symposium on artificial, neural networks, pp.
13–20 (2000)
49. Estévez, P.A., Figueroa, C.J.: Online data visualization using the neural gas network. Neural
Netw. 19, 923–934 (2006)
50. Estévez, P.A., Chong, A.M., Held, C.M., Perez, C.A.: Nonlinear projection using geodesic
distances and the neural gas network. Lect. Notes Comput. Sci. 4131, 464–473 (2006)
51. Wu, Y., Chan, K.L.: An extended isomap algorithm for learning multi-class manifold. Proceed-
ing of IEEE international conference on machine learning and, cybernetics (ICMLC2004), vol.
6, pp. 3429–3433 (2004)
References 91
52. Lee, J.A., Verleysen, M.: Nonlinear projection with the isotop method. Proceedings of
ICANN’2002, international conference on artificial, neural networks, pp. 933–938 (2002)
53. Lee, J.A., Archambeau, C., Verleysen, M.: Locally linear embedding versus isotop. In:
ESANN’2003 proceedings: European symposium on artificial neural networks Bruges (Bel-
gium), pp. 527–534 (2003)
54. Demartines, P., Herault, J.: Curvilinear component analysis: a self-organizing neural network
for nonlinear mapping of data sets. IEEE Trans. Neural Netw. 8, 148–154 (1997)
55. Lee, J.A., Lendasse, A., Verleysen, M.: Curvilinear distance analysis versus isomap. Proceed-
ings of ESANN’2002, 10th European symposium on artificial, neural networks, pp. 185–192
(2000)
56. Lee, J.A., Lendasse, A., Verleysen, M.: Nonlinear projection with curvilinear distances: isomap
versus curvilinear distance analysis. Neurocomputing 57, 49–76 (2004)
57. Vathy-Fogarassy, A., Kiss, A., Abonyi, J.: Topology representing network map—a new tool for
visualization of high-dimensional data. Trans. Comput. Sci. I. 4750, 61–84 (2008) (Springer)
58. Vathy-Fogarassy, A., Abonyi, J.: Local and global mappings of topology representing networks.
Inf. Sci. 179, 3791–3803 (2009)
59. Dijkstra, E.W.: A note on two problems in connection with graphs. Numer. Math. 1, 269–271
(1959)
60. Martinetz, T.M., Shulten, K.J.: Topology representing networks. Neural Netw. 7(3), 507–522
(1994)
Appendix
Dijkstra’s algorithm calculates the shortest path from a selected vertex to every other
vertex in a weighted graph where the weights of the edges non-negative numbers.
Similar to Prim’s and Kruskal’s algorithms it is a greedy algorithm, too. It starts from
a selected node s and iteratively adds the closest node to a so far visited set of nodes.
The whole algorithm is described in Algorithm 16.
At the end of the algorithm the improved tentative distances of the nodes yields
their distances to vertex s.
Given a weighted graph G = (V, E). The Floyd-Warshall algorithm (or Floyd’s
algorithm) computes the shortest path between all pairs of the vertices of G. The
algorithm operates on a n ×n matrix representing the costs of edges between vertices,
where n is the number of the vertices in G (V = n). The elements of the matrix
are initialized and step by step updated as it is described in Algorithm 17.
Hierarchical clustering algorithms may be divided into two main groups: (i) agglom-
erative methods and (ii) divisive methods. The agglomerative hierarchical methods
start with N clusters, where each cluster contains only an object, and they recursively
merge the two most similar groups into a single cluster. At the end of the process the
objects will form only a single cluster. The divisive hierarchical methods begin with
all objects in a single cluster and perform splitting until all objects form a discrete
partition.
All hierarchical clustering methods work on similarity or distance matrices. The
agglomerative algorithms merge step by step those clusters that are the most similar
and the divisive methods split those clusters that are most dissimilar. The similarity
or distance matrices are updated step by step trough the iteration process. Although,
the similarity or dissimilarity matrices are generally obtained from the Euclidian
distances of pairs of objects, the pairwise similarities of the clusters are definable on
a numerous other ways.
The agglomerative hierarchical methods utilize most commonly the following
approaches to determine the distances between the clusters: (i) single linkage method,
(ii) complete linkage method and (iii) average linkage method. The single linkage
96 Appendix
method [4] is also known as the nearest neighbor technique. Using this similarity
measure the agglomerative hierarchical algorithms join together the two clusters
whose two closest members have the smallest distance. The single linkage clustering
methods are also often utilized in the graph theoretical algorithms, however these
methods suffer from the chaining effect [5]. The complete linkage methods [6] (also
known as the furthest neighbor methods) calculate the pairwise cluster similarities
based on the furthest elements of the clusters. These methods merge the two clus-
ters with the smallest maximum pairwise distance in each step. Algorithms based
on complete linkage methods produce tightly bound or compact clusters [7]. The
average linkage methods [8] consider the distance between two clusters to be equal
to the average distance from any member of one cluster to any member of the other
cluster. These methods merge those clusters where this average distance is the mini-
mal. Naturally, there are other methods to determine the merging condition, e.g. the
Ward method, in which the merging of two clusters is based on the size of an error
sum of squares criterion [9].
The divisive hierarchical methods are computationally demanding. If the number
of the objects to be clustered is N , there are 2 N −1 − 1 possible divisions to form
the next stage of the clustering procedure. The division criterion may be based on a
single variable (monothetic divisive methods) or the split can also be decided by the
use of all the variables simultaneously (polythetic divisive methods).
The nested grouping of objects and the similarity levels are usually displayed in
a dendrogram. The dendrogram is a tree-like diagram, in which the nodes represent
the clusters and the lengths of the stems represent the distances of the clusters to be
merged or split. Figure A.1 shows a typical dendrogram representation. It can be seen,
in the case of the application of an agglomerative hierarchical method, objects a and
b will be merged first, then the objects c and d will coalesce into a group, following
this, the algorithm merges the clusters containing the objects {e} and {c, d}. Finally,
all objects would belong to a single cluster.
One of the main advantages of the hierarchical algorithms is that the number of
the clusters need not be specified a priori. There are several possibilities to chose the
proper result from the nested series of the clusters. On the one hand, it is possible to
stop the running of a hierarchical algorithm when the distance between the nearest
0 e
a b c d
Appendix 97
• Step 1 reorder the dissimilarity data and get D̃, in which the adjacent points are
members of a possible cluster;
• Step 2 display the dissimilarity image based on D̃, where the gray level of a pixel
is in connection with the dissimilarity of the actual pair of points.
The key step of this procedure is the reordering of D. For that purpose, Bezdek
used Prim’s algorithm [2] (see Appendix A.1.1) for finding a minimal spanning tree.
The undirected, fully connected and weighted graph analysed here contains the data
points or samples as nodes (vertices) and the edge lengths or weights of the edges are
the values in D, the pairwise distances between the samples. There are two differences
between Prim’s algorithm and VAT: (1) VAT does not need the minimal spanning
tree itself (however, it determine also the edges but does not store them), just the
order in which the vertices (samples or objects xi ) are added to the tree; and (2) it
applies special initialization. Minimal spanning tree contains all of the vertices of the
fully connected, weighted graph of the samples, therefore any points can be selected
as initial vertex. However, to help ensure the best chance of display success, Bezdek
proposed a special initialization: the initial vertex is any of the two samples that are
the farthest from each other in the data set (xi , where i is the row or column index
of max(D)). The first row and column of D̃ will be ith row and column in D. After
the initialization, the two methods are exactly the same. Namely, D is reordered so
that the second row and column correspond to the sample closest to the first sample,
98 Appendix
the third row and column correspond to the sample closest either one of the first two
samples, and so on.
This procedure is similar to the single-linkage algorithm that corresponds to the
Kruskal’s minimal spanning tree algorithm [3] (see Appendix A.1.2) and is basically
the greedy approach to find a minimal spanning tree. By hierarchical clustering
algorithms (such as single-linkage, complete-linkage or average-linkage methods),
the results are displayed as a dendrogram, which is a nested structure of clusters.
(Hierarchical clustering methods are not described here, the interested reader can
refer e.g. [8]). Bezdek et al. followed another way and they displayed the results as
an intensity image I (D̃) with the size of N × N . The approach was presented in [13]
as follows. Let G = {gm , . . . , g M } be the set of gray levels used for image displays.
In the following, G = {0, . . . , 255}, so gm = 0 (black) and g M = 255 (white).
Calculate
gM
(I (D̃))i, j = D̃i, j . (A.2)
max(D̃)
Convert (I (D̃))i, j to its nearest integer. These values will be the intensity displayed
for pixel (i, j) of I (D̃). In this form of display, ‘white’ corresponds to the maximal
distance between the data (and always will be two white pixels), and the darker the
pixel the closer the two data are. (For large data sets, the image can easily exceed
the resolution of the display. To solve that problem, Huband, Bezdek and Hathaway
have been proposed variations of VAT [13]). This image contains information about
cluster tendency. Dark blocks along the diagonal indicate possible clusters, and if
the image exhibits many variations in gray levels with faint of indistinct dark blocks
along the diagonal, then the data set “[. . .] does not contain distinct clusters; or the
clustering scheme implicitly imbedded in the reordering strategy fails to detect the
clusters (there are cluster types for which single-linkage fails famously [. . .]).”
Figure A.2 gives a small example for the VAT representation. In this example
the number of the objects is 40, thereby VAT represents the data dissimilarities in a
square image with 40 × 40 pixels. The figure shows how the well-separated cluster
structure is indicated by dark diagonal blocks in the intensity image. Although VAT
becomes intractable for large data sets, the bigVAT [13] as a modification of VAT
allows the visualization for larger data sets, too.
One of the main advantages of hierarchical clusterings is that they are able to detect
non-convex clusters. It is e.g. an ‘S’-like cluster in two dimensions; and it can be
the case that two data points, which clearly belong to the same cluster, are relatively
far from each other. In this case, the dendrogram generated by single-linkage clearly
indicates the distinct clusters, but there will be no dark block in the intensity image by
VAT. Certainly, single-linkage does have the drawbacks, e.g. it suffers from chaining
effect, but a question naturally comes up: how much plus information can be given
by VAT? It is because it roughly does a hierarchical clustering, but the result is not
displayed as a dendrogram but based on the pairwise distance of data samples, and
it works well only if the data in the same cluster are relatively close to each other
Appendix 99
40
(a) (b)
35
1
30
0.9
0.8 25
0.7
0.6 20
0.5
15
0.4
0.3 10
0.2
5
0.1
0 0
0 0.2 0.4 0.6 0.8 1 5 10 15 20 25 30 35 40
Fig. A.2 The VAT representation. a The original data set. b VAT
0.16
0.14
0.12
0.1
Level
0.08
0.06
0.04
0.02
Fig. A.3 Result of the (left) single-linkage algorithm and (right) VAT on synthetic data
based on the original distance norm. (This problem arises not only by clusters with
non-convex shape, but very elongated ellipsoids as well.) Therefore, one advantage
of hierarchical clustering is lost.
In Fig. A.3 results of the single-linkage algorithm and VAT can be seen on the
synthetic data. The clusters are well-separated but non-convex, and single-linkage
clearly identifies them as can be seen from the dendrogram. However, the VAT image
is not as clear as the dendrogram in this case because there are data in the ‘S’ shaped
cluster that are far from each other based on the Euclidean distance norm (see the
top and left corner of the image).
100 Appendix
The meaning of the relations written above is that the degree of the membership
is a real number in [0,1] (A.3); the sum of the membership values of an object is
exactly one (A.4); each cluster must contain at least one object with membership
value larger than zero, and the sum of the degrees of the membership values can not
exceed the number of elements considered (A.5).
Based on the previous statements the fuzzy partitioning space can be formulated as
follows: Let X = {x1 , x2 , . . . , x N } be a finite set of the observed data, let 2 ≤ c ≤ N
be an integer. The fuzzy partitioning space for X is the set
c
N
M f c = U ∈ Rc×n |μi,k ∈ [0, 1] , ∀i, k; μi,k = 1, ∀k; 0 < μi,k < N , ∀i
i=1 k=1
(A.6)
where A is a symmetric and positive definite matrix. Different distance norms can
be induced by the choice of the matrix A. The Euclidean distance arises with the
choice of A = I where I is an identity matrix. The Mahalanobis normis induced
when A = F−1 where F is the covariance matrix of the objects. It can be seen
that both the Euclidean and the Mahalanobis distances are based on fixed distance
norms. The Euclidean norm based methods find only hyperspherical clusters, and the
Mahalanobis norm based methods find only hyperellipsoidal ones (see Fig. A.4) even
x2 x2
x1 x1
if those shapes of clusters are not present in the data set. The norm-inducing matrix
of the cluster prototypes can be adapted by using estimates of the data covariance,
and can be used to estimate the statistical dependence of the data in each cluster.
The Gustafson-Kessel algorithm (GK) [15] and the Gaussian mixture based fuzzy
maximum likelihood estimation algorithm (Gath-Geva algorithm (GG) [16]) are
based on such an adaptive distance measure, they can adapt the distance norm to
the underlying distribution of the data which is reflected in the different sizes of the
clusters, hence they are able to detect clusters with different orientation and volume.
N T
(t−1) m
(μi,k ) xk − vi(t) xk − vi(t)
(t) k=1
Fi = ,1≤i ≤c (A.8)
N
(t−1) m
(μi,k )
k=1
The distance function is chosen as
N √
(2π ) 2
det (Fi ) 1 (t) T −1 (t)
2
Di,k (xk , vi ) = exp xk − vi Fi xk − vi (A.9)
αi 2
N
with the a priori probability αi = 1
N μi,k
k=1
(t) 1
μi,k =
2/(m−1) , 1 ≤ i ≤ c, 1 ≤ k ≤ N (A.10)
c
j=1 Di,k (xk , vi ) /D j,k xk , v j
The Iris data set [17] (http://www.ics.uci.edu) contains measurements on three class-
es of iris flowers. The data set was made by measurements of sepal length and width
and petal length and width for a collection of 150 irises. The analysed data set
contains 50 samples from each class of iris flowers (Iris setosa, Iris versicolor and
Iris virginica). The problem is to distinguish the three different types of the iris flower.
Iris setosa is easily distinguishable from the other two types, but Iris versicolor and
Iris virginica are very similar to each other. This data set has been analysed many
times to illustrate various clustering methods.
The semeion data set contains 1593 handwritten digits from around 80 persons. Each
person wrote on a paper all the digits from 0 to 9, twice. First time in the normal way
as accurate as they can and the second time in a fast way. The digits were scanned and
stretched in a rectangular box including 16 × 16 cells in a grey scale of 256 values.
Then each pixel of each image was scaled into a boolean value using a fixed threshold.
As a result the data set contains 1593 sample digits and each digit is characterised
with 256 boolean variables. The data set is available form the UCI Machine Learning
Repository [18]. The data set in the UCI Machine Learning Repository contains 266
attributes for each sample digit, where the last 10 digits describe the classifications
of the digits.
25
20
15
10
30
0 20
0 5 10
10 15 20 25 0
683 instances (16 records with missing values were deleted) of which 444 are benign
and 239 are malignant.
The Swiss roll data set is a 3-dimensional data set with a 2-dimensional nonlinearly
embedded manifold. The 3-dimensional visualization of the Swiss roll data set is
shown in Fig. A.5.
The S curve data set is a 3-dimensional synthetic data set, in which data points are
placed on a 3-dimensional ‘S’ curve. The 3-dimensional visualization of the S curve
data set is shown in Fig. A.6.
The synthetic data set ‘boxlinecircle’ was made by the authors of the book. The data
set contains 7100 sample data placed in a cube, in a refracted line and in a circle. As
this data set contains shapes with different dimensions, it is useful to demonstrate
the various selected methods. Data points placed in the cube contain random errors
(noise), too. In Fig. A.7 data points are yield with blue points and the borders of the
points are illustrated with red lines.
Appendix 105
2.5
1.5
0.5
−0.5
−1 6
4
−1 −0.5 2
0 0.5 1 0
−5
−5 0 5
5 10 0
15 20 −5
The Variety data set is a synthetic data set which contains 100 2-dimensional data
objects. 99 objects are partitioned in 3 clusters with different sizes (22, 26 and 51
objects), shapes and densities, and it also contains an outlier. Figure A.8 shows the
normalized data set.
106 Appendix
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
The ChainLink data set is a synthetic data set which contains 75 2-dimensional data
objects. The objects can be partitioned into 3 clusters and a chain link which connects
2 clusters. Hence linkage based methods often suffer from the chaining effect, this
example tends to illustrate this problem. Figure A.9 shows the normalised data set.
Appendix 107
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
The Curves data set is a synthetic data set which contains 267 2-dimensional data
objects. The objects can be partitioned into 4 clusters. What makes this data set inter-
esting is that the objects form clusters with arbitrary shapes and sizes, furthermore
these clusters lie very near to each other. Figure A.10 shows the normalised data set.
References
1. Jarník, V.: O jistém problému minimálním [About a certain minimal problem]. Práce Moravské
Přírodovědecké Společnosti 6, 57–63 (1930)
2. Prim, R.C.: Shortest connection networks and some generalizations. In. Bell System Technical
Journal 36, 1389–1401 (1957)
3. Kruskal, J.B: On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem.
In: Proceedings of the American Mathematical Society 7, No. 1, 48–50 (1956).
4. Sneath, P.H.A.: Sokal . Numerical taxonomy. Freeman, R.R. (1973)
5. Nagy, G.: State of the art in pattern recognition. Proceedings of the IEEE 56(5), 836–862 (1968)
6. King, B.: Step-wise clustering procedures. Journal of the American Statistical Association 69,
86–101 (1967)
7. Baeza-Yates, R.A.: Introduction to data structures and algorithms related to information
retrieval. In Frakes, W.B., Baeza-Yates, R.A. (eds): Information Retrieval: Data Structures
and Algorithms, Prentice-Hall, 13–27 (1972).
8. Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall (1988).
9. Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of the American
Statistical Association 58, 236–244 (1963)
10. Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE
Transaction on Computers C20, 68–86 (1971)
11. Bezdek, J.C., Hathaway, R.J.: VAT: A Tool for Visual Assessment of (Cluster) Tendency. IJCNN
2002, 2225–2230 (2002)
108 Appendix
12. Huband, J., Bezdek, J., Hathaway, R.: Revised Visual Assessment of (Cluster) Tendency
(reVAT). Proceedings of the North American Fuzzy Information Processing Society (NAFIPS),
101–104 (2004).
13. Huband, J., Bezdek, J., Hathaway, R.: bigVAT: Visual assessment of cluster tendency for large
data sets. Pattern Recognition 38(11), 1875–1886 (2005)
14. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39, 1–38
(1977)
15. Gustafson, D.E., Kessel, W.C.: Fuzzy clustering with fuzzy covariance matrix. Proceedings of
the IEEE CDC, 761–766 (1979).
16. Gath, I., Geva, A.B.: Unsupervised Optimal Fuzzy Clustering. IEEE Transactions on Pattern
Analysis and Machine Intelligence 11, 773–781 (1989)
17. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics
7, 179–188 (1936)
18. UC Irvine Machine Learning Repository www.ics.uci.edu/ mlearn/ Cited 15 Oct 2012.
19. Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. Society for
Industrial and Applied Mathematics News 23(5), 1–18 (1990)
Index
Symbol F
e-neighbouring, 2, 44 Feature extraction, 44
Feature selection, 44
Floyd-Warshall algorithm, 95
A Forest, 1
Agglomerative hierarchical methods, 95 Fukuyama-Sugeno clustering
Average linkage, 96 measure, 21
Fuzzy hyper volume, 22
Fuzzy neighbourhood similarity
C measure, 31
Circle, 2 Fuzzy partition matrix, 100
Clustering
hierarchical, 95
Codebook, 3 G
Codebook vectors, 3 Gabriel graph, 17
Complete linkage, 96 Gath-Geva algorithm, 100, 102
Continuous projection, 48 Geodesic distance, 2
Curvilinear component analysis, 65 Geodesic nonlinear projection neural gas, 68
Graph, 1
complete, 1
D connected, 1
Delaunay triangulation, 4, 17 directed, 1
Dendrogram, 96 disconnected, 1
Dijkstra’s algorithm, 2, 94 edge of graph, 1
Dimensionality reduction, 44 finite, 1
linear, 45 node of graph, 1
nonlinear, 45 path in the graph, 2
Distance norm, 101 undirected, 1
Divisive hierarchical methods, 95 weighted, 1
Dunn’s index, 20 Growing neural gas, 7
Dynamic topology representing Gustafson-Kessel algorithm, 100
network, 11
H
E Hybrid Minimal Spanning Tree—Gath-Geva
Euclidean distance, 101 algorithm, 22
I R
Inconsistent edge, 18, 97 Relative neighbourhood graph, 17
Incremental grid growing, 59 Residual variance, 47
Independent component analysis, 45
Isomap, 45, 62
Isotop, 64 S
Sammon mapping, 44, 45, 51
Sammon stress function, 47
J Self-organizing Map, 45, 57
Jarvis-Patrick clustering, 18, 30 Separation index, 20
Single linkage, 96
Spanning tree, 2
K minimal, 2
k-means algorithm, 3
k-neighbouring, 2, 44
knn graph, 2, 30 T
Kruskal’s algorithm, 2, 18, 93, 98 Topographic error, 48
Topographic product, 48
Topology representing network, 9
L Topology representing network map, 71
LBG algorithm, 4 Total fuzzy hyper volume, 22
Locality preserving projections, 55 Transitive fuzzy neighbourhood similarity
measure, 32
Tree, 1
M Trustworthy projection, 48
Mahalanobis distance, 101
MDS stress function, 47
Minimal spanning tree, 2, 17, 18 V
Multidimensional scaling, 33, 52 Vector quantization, 2
metric, 52 Visual assessment of cluster tendency, 33, 97
non-metric, 53 Visualisation, 43
Voronoi diagram, 4
N
Neighborhood, 1 W
Neural gas algorithm, 5 Weighted incremental neural network, 13
O Z
Online visualisation neural gas, 67 Zahn’s algorithm, 17, 18, 97
P
Partition index, 20
Prim’s algorithm, 2, 18, 93
Principal component analysis, 45, 49