0% found this document useful (0 votes)
62 views7 pages

A Parallel Study On Clustering Algorithms in Data Mining

This document discusses and compares different clustering algorithms used in data mining. It begins with an introduction to data mining and clustering techniques. It then provides details on various types of clustering algorithms, including partitional, hierarchical, and density-based clustering. Specific algorithms discussed in detail include k-means, k-medoids, agglomerative hierarchical clustering, and DBSCAN. The document compares the algorithms based on aspects such as complexity, advantages/disadvantages, and applicability to different types of data.

Uploaded by

Anu Ishwarya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views7 pages

A Parallel Study On Clustering Algorithms in Data Mining

This document discusses and compares different clustering algorithms used in data mining. It begins with an introduction to data mining and clustering techniques. It then provides details on various types of clustering algorithms, including partitional, hierarchical, and density-based clustering. Specific algorithms discussed in detail include k-means, k-medoids, agglomerative hierarchical clustering, and DBSCAN. The document compares the algorithms based on aspects such as complexity, advantages/disadvantages, and applicability to different types of data.

Uploaded by

Anu Ishwarya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

A Parallel Study on Clustering Algorithms In Data Mining

Abstract – Data mining is the process of extracting information from a data set and
transforms the information into comprehensible structure for further use. Clustering is
one of the unsupervised techniques in data mining. Clustering is data mining technique
used to place the data elements into their related groups. The process of partitioning
data objects into subclasses is called as ‘cluster’. It consists of data objects with high
inter similarity and low intra similarity. The quality of cluster depends on the methods
used. Clustering also called data segmentation, partitions large data sets into groups
according to their similarity. This paper deals with survey of various clustering
techniques and their comparison on key issues, pros and cons that provides guidance
for choosing a clustering algorithm for a specific application. The comparison is based
on computing performance and clustering accuracy.

Keywords – Data Mining, Clustering, Partitioning, Segmentation

Introduction:

In Computer Science BIG DATA refer to extreme large data that is of size of few
hundreds of terabytes of few petabytes or even more. So how this extreme large amount
of data is generated? Anything and every day we do today in this modern world
generates a huge amount of data and those all small data contribute to initiate a large set
of data termed as Big Data. Big data is relatively a new field of study, the amount of
data generated in today’s world is monumental and increasing exponentially. The
reason behind storing these set of data’s is to analyse and retrieve the information for
processing queries that may arise in near future. For example: An e-commerce shop has
the data of all past order of customers from different part of the world. In future they
can determine what the product, the customers of some particular area prefer over other
products and what are they most likely to buy next using the data stored.

Big data is any data regardless of its forms and generation source. Data are classifies
into three types: 1.Strctured Data, data which is easily organized and stored in
databases. For example, the data stored in RDBMS, etc. 2. Semi structured data, data
which are unorganized but has some link within the data. For example: Log files, xml
files, etc. 3. Unstructured data, data which does not have clear format in storage. For
example: Image file, video files, audio files, etc. Big data analytic is a technique of
examining the stored data to identify some hidden pattern and interdependence among
the data. It can be applied in various fields where huge amount data is generated. Big
data is defined by three characteristics generally known as three V’s. Velocity, the rate
at which the data comes into an Organization. Variety, relates various type of data.
Volume, size of data that is flowing into an organization.

Data mining is technique used in big data analytics for discovering hidden correlations
and pattern in data from data warehouses which cannot be obtained using traditional
techniques. It is designed to explore giant amount of information in search of consistent
patterns and to validate the results by the detected patterns to the new subsets of
information. One of the Data Mining techniques is Clustering.

Clustering originated in anthropology driver and Kroeber in 1932 and introduced to


psychology by Zubin in 1938 and Robert Tryon in 1939, and famously used by Cattell
beginning in 1943 for trait theory classification in personality psychology.

Clustering is organising data in such groups called clusters, in which there is high intra-
cluster similarity. The basic concept of cluster analysis is partitioning a set of data
objects or observations into subsets. Each subset is unique such that objects in one
cluster are similar to one another, yet dissimilar to objects to another cluster. It is
common technique for statistical data analysis, which is used in many fields, including
machine learning, data mining, pattern recognition, Image analysis and Bioinformatics,
etc.

Why Clustering?
1. It helps in organizing huge, voluminous data into clusters which shows internal
structure of data.
2. Sometimes the goal of clustering is partitioning of data.
3. After clustering the data is ready to be used for other AI techniques.
4. Techniques for clustering are useful in knowledge discovery in data.
5. It is used either as a stand-alone tool to get insight into data distribution or as a pre-
processing step for other algorithm

Types of clustering:

Clustering is defined as the grouping of similar text document into clusters such as that
the documents within the clusters have high similarity in comparison to one another but
are dissimilar to documents in other clusters. As thousands of electronic documents
have been added on to the World Wide Web it becomes very important to browser or
search the relevant data effectively. To identify suitable algorithms for clustering that
produces the best clustering solutions, it becomes necessary to have a method for
comparing the results of different clustering algorithms. Many different clustering
techniques have been defined in order to solve the problem from different perspective,
these are:
1. Partitional Clustering
2. Density based Clustering
3. Hierarchical clustering
Table for classification of clustering algorithm:

Name Algorithm Author Year Key Idea Type of


data
Partition k-Mean MacQueen 1967 Mean numerical
based centroid
k-Medoids Kaufman & 1987 Mediod
Rousseeuw centroid
Hierarchical Agglomerativ S.C Johnson 1967
based e
Divisive Guha, 1998 Partition numerical
Rastogi & samples
Shim
Density DBSCAN Ester et al. 1996 Fixed size numerical
based OPTICS Variable numerical
size

Partitional Clustering:

Partitioning methods obtain a single level partition of objects. Give n objects, these
method make k<n clusters of data and use an iterative relocation method. If the
following requirements are satisfied then it will group the data into k group:
1. Each group carries at least one object
2. Each object must belong to just one group
3. Algorithm used in partitioning method are
a. k-means algorithm
b. k-medoids algorithm

a. k-means algorithm:
It is one of the most simple clustering algorithm which is used to solve problem of
clustering by forming clusters iteratively. It is numerical, unsupervised, iterative and
evolutionary algorithm that had its name from the operation method. It aims to find the
positions of the clusters that minimise the distance from the data points to the cluster.
This algorithm partition the n observation into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k-
means is also known as Lloyd’s algorithm.
Algorithm:
Step1: Define number of clusters (k) and then select same number of data points as
centroids.
Step 2: Calculate distance of a point from all centroid. Assign the point to the cluster m
centroid point distances.
Step 3: Repeat step 2 for all data points.
Step 4: Calculate the mean of all points in a cluster and assign it as new centroid for
that cluster.
Step 5: Repeat from step 2, until desired clusters or certain criteria are satisfied.
Since initial centroid are selected randomly results of clusters depends on initial
centroid.
The complexity of k-means algorithm is O(tkn) where k is number of clusters, t is
number of iteration and n being number of data sets.

Advantages:
1. It is simple to implement
2. It is suitable for very large database
3. Produces denser clusters than hierarchical method especially when clusters are
spherical

Disadvantages:
1. It does not work well with clusters of different size and density
2. It does not provide the same result with each run
3. Euclidean distance measures can unequally weight underlying factors
4. It fails for categorical data and non-linear data set
5. Difficult to handle noisy data and outliers

b. k-medoids algorithm:

The k-medoids algorithm, each cluster is represented by the one of the objects located
near the centre of the cluster. The iterative process of replacing representative objects
by no representative objects continuous as long as the value of the resulting clustering
is improved. This value is estimated using a cost function that measures the average
dissimilarity between an object and the representative object of its cluster.
The algorithm proceeds in two steps:
1. BUILD-step: This sequentially selects k “centrally located” objects, to be used as
initial medoids
2. SWAP-step: If the objective function can be reduced by interchanging (swapping) a
selected object with an unselected object, then the swap is carried out. This is continued
till the objective function can no longer be decreased.
The algorithm is as follows:
Step 1: Initially select k random points as the medoids from the given n data points of
the data set.
Step 2: Associate each data point to the closest medoid by using any of the most
common distance metrics.
Step 3: For each pair of non-selected object h and selected object i, calculate the total
swapping cost TC ih.
Step 4: If TC ih< 0, I is replaced by h
Step 5: Repeat the steps 2-3 until there is no change of the medoids.
There are four to be considered in this process:
a. Shift-out membership: an object pi may need to shifted from currently considered
cluster of O jto another cluster;
b. Update the current medoid: a new medoidsO c is found to replace the current medoids
Oj;
c. No change: Objects in the current cluster result have the same or same or even smaller
square error criterion (SEC) measure for all the possible redistributions considered;
d. Shift-in membership: an outside objects pi is assigned to the current cluster with the
new (replaced) medoid Oc .
Advantages:
1. Simple to understand and implement.
2. Fast and convergent in a finite number of steps
3. Usually less sensitive to outliers than k-means
4. Allows using general dissimilarities of objects
5. It is more robust and outliers as compared to k-means because it minimizes a sum of
pairwise dissimilarities instead of sum of squared Euclidean distances.

Disadvantages:
1. Different initial sets of medoids can lead to different final clustering’s. It is thus
advisable to run the procedure several times with different initial sets of medoids.
2. The resulting clustering depends on the units of measurement. If the variables are of
different nature or are very different with respect to their magnitude, then it is advisable
to standardize them.

Hierarchical based clustering:

Hierarchical methods, tries to decompose the dataset of n objects into a hierarchy of


groups. This hierarchical decomposition can be represented by a tree structure diagram
called as a dendrogram; whose root node represents the whole dataset and each leaf
node is a single object of the dataset. The clustering results can be obtained by cutting
the dendrogram at different level.
There are two general approaches for the hierarchical method:
a. Agglomerative ( Bottom-up )
b. Divisive ( Top-down)

a. Agglomerative clustering algorithm:


Bottom-up approach begin with element as a separate clusters and merge them into
successively large cluster.
Algorithm:
Step 1: Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distance (similarities) between
the clusters the same as the distances (similarities) between the items they contain.
Step 2: Find the closest ( most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less with the help of _____________
Step 3: Compute distance (similarities) between the new cluster and each of the old
clusters.
Step 4: Repeat step 2 and 3 until all items are clustered into single cluster of size N.
Advantages:
1. Capable of identifying nested clusters.
2. Easy to implement and gives best result in some cases.
3. They are suitable for automation.
4. Reducing effect of initial values of cluster on the clustering results.
5. The method can shorten the computing time and reduce the space complexity,
improve the results of clustering.

Disadvantage:
1. It can never undo what was done previously.
2. Time complexity of distance matrix chosen for merging different algorithms can suffer
with one or more of the following:
a. Sensitivity to noise and outliers
b. Breaking large clusters
c. Difficulty handling different sized clusters and convex shapes
3. No objective function is directly minimized
4. Sometimes it is difficult to identify the correct number of clusters by the dendrogram
5. at least O(n 2 log n) is required, where ‘n’ is the number of data points.

b. Divisive Clustering algorithm:

This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Algorithm:
Step1: Transform the original categorical data into indicator matrix Z.
Step 2: Initialize a binary tree with a single root holding all the objects.
Step 3: Choose one leaf cluster Cp to split into two clusters Cp L and Cp R
Step 4: Repeat step 3 and 4 until no leaf cluster can be split to improve the clustering
quality

Advantages:

Disadvantages:
1. No provision can be made for a relocation of objects that may been in correctly
grouped at a early stage the result should be examined closely to ensure it make sense.
2. Use of different distance metrics for measuring distance between clusters may generate
different results. Performing multiple experiments and comparing the results is
recommended to support the veracity of the original results.
Density based clustering:

To discover the clusters with arbitrary shape, Density based clustering method was
discovered. As the name suggest, this clustering technique deals with density of
clusters. In this technique clusters are defined as, clusters are partitioned from other on
the basis of varying densities. Thus a cluster of certain density is enclosed by points
with low density. Here the basic idea is to check, if there are sufficient data points in
neighbourhood of any point to meet the criteria of least number of points in
neighbourhood. This least number of points is some threshold which is defined by us. If
the point does not have more than that of threshold amount of data point, then it is not
considered to be a cluster. There are two types of point in any cluster viz. core point
and the border point. The neighbourhood is defined as some distance function which is
selected by us. Shape of neighbourhood also depends on the distance function. But the
shape of the cluster is arbitrary since the cluster grows in any direction where the
density is adequate. Value of threshold for core point and threshold may change from
each other. Here outlier are discarded since they do not have sufficient data points in
their neighbourhood to

You might also like