0% found this document useful (0 votes)

62 views7 pages

A Parallel Study On Clustering Algorithms in Data Mining

This document discusses and compares different clustering algorithms used in data mining. It begins with an introduction to data mining and clustering techniques. It then provides details on various types of clustering algorithms, including partitional, hierarchical, and density-based clustering. Specific algorithms discussed in detail include k-means, k-medoids, agglomerative hierarchical clustering, and DBSCAN. The document compares the algorithms based on aspects such as complexity, advantages/disadvantages, and applicability to different types of data.

Uploaded by

Anu Ishwarya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views7 pages

A Parallel Study On Clustering Algorithms in Data Mining

Uploaded by

Anu Ishwarya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

A Parallel Study on Clustering Algorithms In Data Mining

Abstract – Data mining is the process of extracting information from a data set and
transforms the information into comprehensible structure for further use. Clustering is
one of the unsupervised techniques in data mining. Clustering is data mining technique
used to place the data elements into their related groups. The process of partitioning
data objects into subclasses is called as ‘cluster’. It consists of data objects with high
inter similarity and low intra similarity. The quality of cluster depends on the methods
used. Clustering also called data segmentation, partitions large data sets into groups
according to their similarity. This paper deals with survey of various clustering
techniques and their comparison on key issues, pros and cons that provides guidance
for choosing a clustering algorithm for a specific application. The comparison is based
on computing performance and clustering accuracy.

Keywords – Data Mining, Clustering, Partitioning, Segmentation

Introduction:

In Computer Science BIG DATA refer to extreme large data that is of size of few
hundreds of terabytes of few petabytes or even more. So how this extreme large amount
of data is generated? Anything and every day we do today in this modern world
generates a huge amount of data and those all small data contribute to initiate a large set
of data termed as Big Data. Big data is relatively a new field of study, the amount of
data generated in today’s world is monumental and increasing exponentially. The
reason behind storing these set of data’s is to analyse and retrieve the information for
processing queries that may arise in near future. For example: An e-commerce shop has
the data of all past order of customers from different part of the world. In future they
can determine what the product, the customers of some particular area prefer over other
products and what are they most likely to buy next using the data stored.

Big data is any data regardless of its forms and generation source. Data are classifies
into three types: 1.Strctured Data, data which is easily organized and stored in
databases. For example, the data stored in RDBMS, etc. 2. Semi structured data, data
which are unorganized but has some link within the data. For example: Log files, xml
files, etc. 3. Unstructured data, data which does not have clear format in storage. For
example: Image file, video files, audio files, etc. Big data analytic is a technique of
examining the stored data to identify some hidden pattern and interdependence among
the data. It can be applied in various fields where huge amount data is generated. Big
data is defined by three characteristics generally known as three V’s. Velocity, the rate
at which the data comes into an Organization. Variety, relates various type of data.
Volume, size of data that is flowing into an organization.

Data mining is technique used in big data analytics for discovering hidden correlations
and pattern in data from data warehouses which cannot be obtained using traditional
techniques. It is designed to explore giant amount of information in search of consistent
patterns and to validate the results by the detected patterns to the new subsets of
information. One of the Data Mining techniques is Clustering.

Clustering originated in anthropology driver and Kroeber in 1932 and introduced to

psychology by Zubin in 1938 and Robert Tryon in 1939, and famously used by Cattell
beginning in 1943 for trait theory classification in personality psychology.

Clustering is organising data in such groups called clusters, in which there is high intra-
cluster similarity. The basic concept of cluster analysis is partitioning a set of data
objects or observations into subsets. Each subset is unique such that objects in one
cluster are similar to one another, yet dissimilar to objects to another cluster. It is
common technique for statistical data analysis, which is used in many fields, including
machine learning, data mining, pattern recognition, Image analysis and Bioinformatics,
etc.

Why Clustering?
1. It helps in organizing huge, voluminous data into clusters which shows internal
structure of data.
2. Sometimes the goal of clustering is partitioning of data.
3. After clustering the data is ready to be used for other AI techniques.
4. Techniques for clustering are useful in knowledge discovery in data.
5. It is used either as a stand-alone tool to get insight into data distribution or as a pre-
processing step for other algorithm

Types of clustering:

Clustering is defined as the grouping of similar text document into clusters such as that
the documents within the clusters have high similarity in comparison to one another but
are dissimilar to documents in other clusters. As thousands of electronic documents
have been added on to the World Wide Web it becomes very important to browser or
search the relevant data effectively. To identify suitable algorithms for clustering that
produces the best clustering solutions, it becomes necessary to have a method for
comparing the results of different clustering algorithms. Many different clustering
techniques have been defined in order to solve the problem from different perspective,
these are:
1. Partitional Clustering
2. Density based Clustering
3. Hierarchical clustering
Table for classification of clustering algorithm:

Name Algorithm Author Year Key Idea Type of

data
Partition k-Mean MacQueen 1967 Mean numerical
based centroid
k-Medoids Kaufman & 1987 Mediod
Rousseeuw centroid
Hierarchical Agglomerativ S.C Johnson 1967
based e
Divisive Guha, 1998 Partition numerical
Rastogi & samples
Shim
Density DBSCAN Ester et al. 1996 Fixed size numerical
based OPTICS Variable numerical
size

Partitional Clustering:

Partitioning methods obtain a single level partition of objects. Give n objects, these
method make k<n clusters of data and use an iterative relocation method. If the
following requirements are satisfied then it will group the data into k group:
1. Each group carries at least one object
2. Each object must belong to just one group
3. Algorithm used in partitioning method are
a. k-means algorithm
b. k-medoids algorithm

a. k-means algorithm:
It is one of the most simple clustering algorithm which is used to solve problem of
clustering by forming clusters iteratively. It is numerical, unsupervised, iterative and
evolutionary algorithm that had its name from the operation method. It aims to find the
positions of the clusters that minimise the distance from the data points to the cluster.
This algorithm partition the n observation into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k-
means is also known as Lloyd’s algorithm.
Algorithm:
Step1: Define number of clusters (k) and then select same number of data points as
centroids.
Step 2: Calculate distance of a point from all centroid. Assign the point to the cluster m
centroid point distances.
Step 3: Repeat step 2 for all data points.
Step 4: Calculate the mean of all points in a cluster and assign it as new centroid for
that cluster.
Step 5: Repeat from step 2, until desired clusters or certain criteria are satisfied.
Since initial centroid are selected randomly results of clusters depends on initial
centroid.
The complexity of k-means algorithm is O(tkn) where k is number of clusters, t is
number of iteration and n being number of data sets.

Advantages:
1. It is simple to implement
2. It is suitable for very large database
3. Produces denser clusters than hierarchical method especially when clusters are
spherical

Disadvantages:
1. It does not work well with clusters of different size and density
2. It does not provide the same result with each run
3. Euclidean distance measures can unequally weight underlying factors
4. It fails for categorical data and non-linear data set
5. Difficult to handle noisy data and outliers

b. k-medoids algorithm:

The k-medoids algorithm, each cluster is represented by the one of the objects located
near the centre of the cluster. The iterative process of replacing representative objects
by no representative objects continuous as long as the value of the resulting clustering
is improved. This value is estimated using a cost function that measures the average
dissimilarity between an object and the representative object of its cluster.
The algorithm proceeds in two steps:
1. BUILD-step: This sequentially selects k “centrally located” objects, to be used as
initial medoids
2. SWAP-step: If the objective function can be reduced by interchanging (swapping) a
selected object with an unselected object, then the swap is carried out. This is continued
till the objective function can no longer be decreased.
The algorithm is as follows:
Step 1: Initially select k random points as the medoids from the given n data points of
the data set.
Step 2: Associate each data point to the closest medoid by using any of the most
common distance metrics.
Step 3: For each pair of non-selected object h and selected object i, calculate the total
swapping cost TC ih.
Step 4: If TC ih< 0, I is replaced by h
Step 5: Repeat the steps 2-3 until there is no change of the medoids.
There are four to be considered in this process:
a. Shift-out membership: an object pi may need to shifted from currently considered
cluster of O jto another cluster;
b. Update the current medoid: a new medoidsO c is found to replace the current medoids
Oj;
c. No change: Objects in the current cluster result have the same or same or even smaller
square error criterion (SEC) measure for all the possible redistributions considered;
d. Shift-in membership: an outside objects pi is assigned to the current cluster with the
new (replaced) medoid Oc .
Advantages:
1. Simple to understand and implement.
2. Fast and convergent in a finite number of steps
3. Usually less sensitive to outliers than k-means
4. Allows using general dissimilarities of objects
5. It is more robust and outliers as compared to k-means because it minimizes a sum of
pairwise dissimilarities instead of sum of squared Euclidean distances.

Disadvantages:
1. Different initial sets of medoids can lead to different final clustering’s. It is thus
advisable to run the procedure several times with different initial sets of medoids.
2. The resulting clustering depends on the units of measurement. If the variables are of
different nature or are very different with respect to their magnitude, then it is advisable
to standardize them.

Hierarchical based clustering:

Hierarchical methods, tries to decompose the dataset of n objects into a hierarchy of

groups. This hierarchical decomposition can be represented by a tree structure diagram
called as a dendrogram; whose root node represents the whole dataset and each leaf
node is a single object of the dataset. The clustering results can be obtained by cutting
the dendrogram at different level.
There are two general approaches for the hierarchical method:
a. Agglomerative ( Bottom-up )
b. Divisive ( Top-down)

a. Agglomerative clustering algorithm:

Bottom-up approach begin with element as a separate clusters and merge them into
successively large cluster.
Algorithm:
Step 1: Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distance (similarities) between
the clusters the same as the distances (similarities) between the items they contain.
Step 2: Find the closest ( most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less with the help of _____________
Step 3: Compute distance (similarities) between the new cluster and each of the old
clusters.
Step 4: Repeat step 2 and 3 until all items are clustered into single cluster of size N.
Advantages:
1. Capable of identifying nested clusters.
2. Easy to implement and gives best result in some cases.
3. They are suitable for automation.
4. Reducing effect of initial values of cluster on the clustering results.
5. The method can shorten the computing time and reduce the space complexity,
improve the results of clustering.

Disadvantage:
1. It can never undo what was done previously.
2. Time complexity of distance matrix chosen for merging different algorithms can suffer
with one or more of the following:
a. Sensitivity to noise and outliers
b. Breaking large clusters
c. Difficulty handling different sized clusters and convex shapes
3. No objective function is directly minimized
4. Sometimes it is difficult to identify the correct number of clusters by the dendrogram
5. at least O(n 2 log n) is required, where ‘n’ is the number of data points.

b. Divisive Clustering algorithm:

This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Algorithm:
Step1: Transform the original categorical data into indicator matrix Z.
Step 2: Initialize a binary tree with a single root holding all the objects.
Step 3: Choose one leaf cluster Cp to split into two clusters Cp L and Cp R
Step 4: Repeat step 3 and 4 until no leaf cluster can be split to improve the clustering
quality

Advantages:

Disadvantages:
1. No provision can be made for a relocation of objects that may been in correctly
grouped at a early stage the result should be examined closely to ensure it make sense.
2. Use of different distance metrics for measuring distance between clusters may generate
different results. Performing multiple experiments and comparing the results is
recommended to support the veracity of the original results.
Density based clustering:

To discover the clusters with arbitrary shape, Density based clustering method was
discovered. As the name suggest, this clustering technique deals with density of
clusters. In this technique clusters are defined as, clusters are partitioned from other on
the basis of varying densities. Thus a cluster of certain density is enclosed by points
with low density. Here the basic idea is to check, if there are sufficient data points in
neighbourhood of any point to meet the criteria of least number of points in
neighbourhood. This least number of points is some threshold which is defined by us. If
the point does not have more than that of threshold amount of data point, then it is not
considered to be a cluster. There are two types of point in any cluster viz. core point
and the border point. The neighbourhood is defined as some distance function which is
selected by us. Shape of neighbourhood also depends on the distance function. But the
shape of the cluster is arbitrary since the cluster grows in any direction where the
density is adequate. Value of threshold for core point and threshold may change from
each other. Here outlier are discarded since they do not have sufficient data points in
their neighbourhood to

Unit 5
No ratings yet
Unit 5
27 pages
B1 90 TB1300 02 Sol
No ratings yet
B1 90 TB1300 02 Sol
24 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Unit VII
No ratings yet
Unit VII
30 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
PSO and WDO Data Clusterin
No ratings yet
PSO and WDO Data Clusterin
19 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Clustering
No ratings yet
Clustering
25 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Cluster
No ratings yet
Cluster
20 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
8 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
No ratings yet
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
4 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Module V
No ratings yet
Module V
16 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Chapter 7
No ratings yet
Chapter 7
29 pages
Module 5
No ratings yet
Module 5
91 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Clustering
No ratings yet
Clustering
104 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Clustering
No ratings yet
Clustering
9 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Clustering
No ratings yet
Clustering
32 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Unit 4
No ratings yet
Unit 4
5 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
SEE Solution Computer Science 2075
No ratings yet
SEE Solution Computer Science 2075
5 pages
Advanced Database Management System
No ratings yet
Advanced Database Management System
6 pages
ALV
No ratings yet
ALV
9 pages
CDR - Draft Information Security Controls Guidance
No ratings yet
CDR - Draft Information Security Controls Guidance
39 pages
Toad Data Point 4.3 Installation Guide
No ratings yet
Toad Data Point 4.3 Installation Guide
87 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
17 pages
100 Advanced VBScript Sample
No ratings yet
100 Advanced VBScript Sample
3 pages
Mercurial HG
No ratings yet
Mercurial HG
2 pages
Lecture Notes
No ratings yet
Lecture Notes
122 pages
Multistream Satellite Receiver HZ914 R2
No ratings yet
Multistream Satellite Receiver HZ914 R2
2 pages
SCC2691 UART Datasheet
No ratings yet
SCC2691 UART Datasheet
24 pages
E-Commerce Note BBA 7th Semester Unit V Business Inteligence
No ratings yet
E-Commerce Note BBA 7th Semester Unit V Business Inteligence
5 pages
Cache Memory Sheet and Model Answer: Solution
No ratings yet
Cache Memory Sheet and Model Answer: Solution
5 pages
Sap Hana PDF
No ratings yet
Sap Hana PDF
47 pages
hp-ProCurve Switch 2610 Series
No ratings yet
hp-ProCurve Switch 2610 Series
9 pages
Big Data Analytics Opportunities and Challenges
No ratings yet
Big Data Analytics Opportunities and Challenges
43 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Entp Struct Confts
No ratings yet
Entp Struct Confts
6 pages
Zetta - 16GB eMMC and 16Gb 2 X 8Gb LPDDR3 MCP - Datasheet
No ratings yet
Zetta - 16GB eMMC and 16Gb 2 X 8Gb LPDDR3 MCP - Datasheet
10 pages
Srushti Shimpi Sanket Mandare, Tyagraj Sonawane, Aman Trivedi, K. T. V. Reddy
No ratings yet
Srushti Shimpi Sanket Mandare, Tyagraj Sonawane, Aman Trivedi, K. T. V. Reddy
2 pages
Hercules Manual
No ratings yet
Hercules Manual
142 pages
Computer-Assisted Audit Tools and Techniques
No ratings yet
Computer-Assisted Audit Tools and Techniques
18 pages
Dumpstate Log
No ratings yet
Dumpstate Log
7 pages
AIX Basics (OS Installation, Boot Up Process and Rootvg Filesystem Structure)
No ratings yet
AIX Basics (OS Installation, Boot Up Process and Rootvg Filesystem Structure)
7 pages
Importing Ericsson ™ GPEH Data For Analysis With Aexio's Xeus Pro
No ratings yet
Importing Ericsson ™ GPEH Data For Analysis With Aexio's Xeus Pro
7 pages
Bpops103 Module4
No ratings yet
Bpops103 Module4
47 pages
Simulation of Data Transfer Using TCP With Loss
No ratings yet
Simulation of Data Transfer Using TCP With Loss
5 pages
5.1.string Data Type Study Material
No ratings yet
5.1.string Data Type Study Material
27 pages
SQL 1 Lecture Notes 1
No ratings yet
SQL 1 Lecture Notes 1
16 pages

A Parallel Study On Clustering Algorithms in Data Mining

Uploaded by

A Parallel Study On Clustering Algorithms in Data Mining

Uploaded by

A Parallel Study on Clustering Algorithms In Data Mining

Keywords – Data Mining, Clustering, Partitioning, Segmentation

Clustering originated in anthropology driver and Kroeber in 1932 and introduced to

Name Algorithm Author Year Key Idea Type of

Hierarchical based clustering:

Hierarchical methods, tries to decompose the dataset of n objects into a hierarchy of

a. Agglomerative clustering algorithm:

b. Divisive Clustering algorithm:

You might also like