Importance of Clustering in Data Mining
Importance of Clustering in Data Mining
Importance of Clustering in Data Mining
Abstract— Cluster analysis groups objects clusters. Clustering is important in data analysis and data
(observations, events) based on the information found in mining applications[1]. It is the task of grouping a set of
the data describing the objects or their relationships. The objects so that objects in the same group are more
similar to each other than to those in other groups. A
goal is that the objects in a group will be similar (or
good clustering algorithm is able to identity clusters
related) to one other and different from (or unrelated to) irrespective of their shapes. The stages involved in
the objects in other groups. The greater the similarity (or clustering algorithm are as follows,
homogeneity) within a group, and the greater the
Raw data
difference between groups, the ― better‖ or more distinct
the clustering. Data mining is the process of analysing
IJSER
data from different viewpoints and summerising it into
useful information. Data mining is one of the top clustering
algorithms
research areas in recent days. Cluster analysis in data
mining is an important research field it has its own
unique position in a large number of data analysis and
processing. clusters of
data
IJSER © 2016
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016
ISSN 2229-5518 248
modify data preprocessing and model parameters until R, then the distance of the new cluster, R, to an existing
the result achieves the desired properties….. cluster, Q, is a linear function of
Besides the term clustering, there are a number of the distances of Q from the original clusters A and B.
terms with similar meanings, including automatic Any hierarchical technique that can be phrased in this
classification, numerical taxonomy, The subtle way does not need the
differences are often in the usage of the results: while in original points, only the proximity matrix, which is
data mining, the resulting groups are the matter of updated as clustering occurs.
interest, in automatic classification the resulting However, while a general formula is nice, it is often
discriminative power is of interest. This often leads to easier to understand the different
misunderstandings between researchers coming from the hierarchical methods by looking directly at the definition
fields of data mining and machine learning, since they of cluster distance that each
use the same terms and often the same algorithms, but method uses, and that is the approach that we shall take
have different goal. here. [DJ88] and [KR90] both
give a table that describes each method in terms of the
Why clustering? Lance-Williams formula
.
• Organizing data into clusters shows internal Mutual Nearest Neighbor Clustering
structure of the data
– Ex. Clusty and clustering genes above Mutual nearest neighbor clustering is described in
• Sometimes the partitioning is the goal [GK77]. It is based on the idea
– Ex. Market segmentation of the ― mutual neighborhood value (mnv)‖ of two points,
• Prepare for other AI techniques which is the sum of the ranks of
IJSER
– Ex. Summarize news (cluster and then find centroid) the two points in each other’s sorted nearest-neighbor
• Techniques for clustering is useful in knowledge lists. Two points are then said to
discovery in data be mutual nearest neighbors if they are the closest pair of
– Ex. Underlying rules, reoccurring patterns, topics, etc. points with that mnv.
Clusters are built up by starting with points as singleton
Methods: clusters and then merging
the closest pair of clusters, where close is defined in
Basic Agglomerative Hierarchical Clustering terms of the mnv. The mnv between
Algorithm two clusters is the maximum mnv between any pair of
1) Compute the proximity graph, if necessary. points in the combined cluster. If
(Sometimes the proximity graph is all that is available.) there are ties in mnv between pairs of clusters, they are
2) Merge the closest (most similar) two clusters. resolved by looking at the original
3) Update the proximity matrix to reflect the proximity distances between points. Thus, the algorithm for mutual
between the new cluster and the nearest neighbor clustering
original clusters. works in the following way.
4) Repeat steps 3 and 4 until only a single cluster a) First the k-nearest neighbors of all points are
remains. found. In graph terms this can be
The key step of the previous algorithm is the calculation regarded as breaking all but the k strongest links from a
of the proximity between point to other points in the
two clusters, and this is where the various agglomerative proximity graph.
hierarchical techniques differ. b) For each of the k points in a particular point’s k-
Any of the cluster proximities that we discuss in this nearest neighbor list, calculate
section can be viewed as a choice of the mnv value for the two points. It can happen that a
different parameters (in the Lance-Williams formula) for point is in one point’s knearest
the proximity between clusters neighbor list, but not vice-versa. In that case, set the mnv
Q and R, where R is formed by merging clusters A and B. value to some value
p(R, Q) = A p(A, Q) + B p(B, Q) + p(A, Q) + larger than 2k.
| p(A, Q) – p(B, Q) | c) Merge the pair of clusters having the lowest mnv
In words, this formula says that after you merge clusters (and the lowest distance in case
A and B to form cluster of ties).
IJSER © 2016
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016
ISSN 2229-5518 249
IJSER
algorithms do not provide a single partitioning of the
data set, but instead provide an extensive hierarchy of
clusters that merge with each other at certain distances.
In a dendrogram, the y-axis marks the distance at which
the clusters merge, while the objects are placed along the
Single-linkage on Gaussian data. At 35 clusters,
x-axis such that the clusters don't mix.
the biggest cluster starts fragmenting into
Connectivity based clustering is a whole family of smaller parts, while before it was still connected
methods that differ by the way distances are computed. to the second largest due to the single-link
Apart from the usual choice of distance functions, the effect.
user also needs to decide on the linkage criterion (since a
cluster consists of multiple objects, there are multiple
candidates to compute the distance to) to use. Popular
choices are known as single-linkage clustering (the
minimum of object distances), complete linkage
clustering (the maximum of object distances) or
UPGMA ("Unweighted Pair Group Method with
Arithmetic Mean", also known as average linkage
clustering). Furthermore, hierarchical clustering can be
agglomerative (starting with single elements and
aggregating them into clusters) or divisive (starting with
the complete data set and dividing it into partitions).
These methods will not produce a unique partitioning of Single-linkage on density-based clusters. 20
the data set, but a hierarchy from which the user still clusters extracted, most of which contain single
needs to choose appropriate clusters. They are not very elements, since linkage clustering does not have
robust towards outliers, which will either show up as a notion of "noise".
additional clusters or even cause other clusters to merge
(known as "chaining phenomenon", in particular with
single-linkage clustering). In the general case, the
IJSER © 2016
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016
ISSN 2229-5518 250
IJSER
represent a particular concept. insurance company industry risk; in the Internet, cluster
analysis was used for document classification and
III. Analysis of Clustering Algorithm information retrieval etc.
IJSER © 2016
http://www.ijser.org
International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016
ISSN 2229-5518 251
[3]
http://www.academia.edu/7764213/Analysis_and_Appli
cation_of_Clustering_Techniques_in_Data_Mining
IJSER
IJSER © 2016
http://www.ijser.org