An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
Abstract
k-means is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data
set through a certain number of clusters fixed apriori. The main idea is to define k centers, one
for each cluster. These centers should be placed in a cunning way because
of different location causes different result. In this research work, Proposed algorithm will
perform better while handling clusters of circularly distributed data points and slightly
overlapped clusters.
Keywords : K-means algorithm, cluster, big data, hadoop, MapReduce, web logs
679
International Journal of Pure and Applied Mathematics Special Issue
Apache Hadoop is an open-source software Likas et al. [5] proposed a global K-means
framework that supports data-intensive algorithm consisting of series of K-means
distributed applications, licensed under the clustering procedures with the number of
Apache v2 license. It supports the running of clusters varying from 1 to K. One
applications on large clusters of commodity disadvantage of the algorithm lies in the
hardware. Hadoop was derived from requirement for executing K-Means N times
Google’s MapReduce and Google File for each value of K, which causes high
System (GFS) papers. The Hadoop computational burden for large data sets.
framework transparently provides both
reliability and data motion to applications. Bradley and Fayyad [3] presented a refined
Hadoop implements a computational algorithm that utilizes K-means M times to
paradigm named MapReduce, where the M random subsets sampled from the original
application is divided into many small data. The most common initialization was
fragments of work, each of which may be proposed by Pena, Lozano et al. [6]. This
executed or re-executed on any node in the method is selecting randomly K points as
cluster. In addition, it provides a distributed centroids from the data set. The main
file system that stores data on the compute advantage of the method is simplicity and an
nodes, providing very high aggregate opportunity to cover rather well the solution
bandwidth across the cluster. Both space by multiple initialization of the
map/reduce and the distributed file system algorithm. Ball and Hall proposed the
are designed so that node failures are ISODATA algorithm [7], which is
automatically handled by the framework. estimating K dynamically. For selection of a
proper K, a sequence of clustering structures
II.Related Work can be obtained by running K-means several
times from the possible minimum Kmin to
To get more efficient and effective result of the maximum Kmax[12].
K-mean algorithm there have been a lot of These structures are then evaluated based on
research happened in previous day. All constructed indices and the expected
researchers worked on different view and clustering solution is determined by
with different idea. Krishna and Murty[4] choosing the one with the best index [8].
proposed the genetic K-means(GKA) The popular approach for evaluating the
algorithm which integrate a genetic number of clusters in K-means is the Cubic
algorithm with K-means in order to achieve Clustering Criterion [9] used in SAS
a global search and fast convergence. Enterprise Miner.
680
International Journal of Pure and Applied Mathematics Special Issue
681
International Journal of Pure and Applied Mathematics Special Issue
and AMR Arbitrary shaped clusters are semi-structured and unstructured data,
formed by the grid cells. which is very difficult to handle with
traditional software tools. In many
IV.Methodology organizations, the volume of data is bigger
or it moves faster or it exceeds current
Big Data Analytics processing capacity. An example of big data
Big data analytics is the process of might be Petabytes (1,024 terabytes) or
examining big data to discover hidden Exabyte’s (1,024 petabytes) of data
containing billions to trillions of records of
patterns, unknown correlations and other
millions of various users—all from different
useful information that can be used to make sources such as social media, banking, web,
better decisions. To perform any kind of mobile, employees and customer’s data etc.
analysis on such large and complicated data, These types of data are typically loosely
scaling up the hardware platforms become structured data that is often incomplete and
necessary and choosing the right platforms inaccessible.
becomes a crucial decision to satisfy the
K-means algorithm
user’s requirement in fewer amounts of
time. There are various big data platforms The Lloyd's algorithm, mostly known as k-
available with different characteristics. To means algorithm, is used to solve the k-
choose a right platform for specific means clustering problem and works as
application one should have knowledge of follows. First, decide the number of clusters
the advantages and limitations of all these k. Then:
platforms. The platform you choose must be
Clustering is the process of partitioning a
able to cater to increased data processing
group of data points into a small number of
demands if it is appropriate to build the clusters. For instance, the items in a
analytics based solutions on a particular supermarket are clustered in categories
platform. (butter, cheese and milk are grouped in dairy
products). Of course this is a qualitative kind
This data comes from many different of partitioning. A quantitative approach
sources: The smart phones, the data they would be to measure certain features of the
generate and consume; sensors embedded products, say percentage of milk and others,
into everyday objects, which resulted in and products with high percentage of milk
billions of new and constantly updating data would be grouped together. In general, we
feed containing location, climate and other have n data points xi,i=1...n that have to be
information; posts to social media sites, partitioned in k clusters. The goal is to
digital photos and videos and purchase assign a cluster to each data point. K-means
transaction records. This data is called big is a clustering method that aims to find the
data. The first organizations to grab it were positions μi,i=1...k of the clusters that
online and startup firms. Firms such as minimize the distance from the data points
Facebook, Google and LinkedIn are built to the cluster. K-means clustering solves
around big data from the beginning. argminc∑i=1k∑x∈cid(x,μi)=argminc∑i=1k
∑x∈ci∥x−μi∥22
"Big Data" refers to data sets too where ci is the set of points that belong to
large and complicated containing structured, cluster i. The K-means clustering uses the
682
International Journal of Pure and Applied Mathematics Special Issue
683
International Journal of Pure and Applied Mathematics Special Issue
684
International Journal of Pure and Applied Mathematics Special Issue
685
International Journal of Pure and Applied Mathematics Special Issue
Outliers: Many clustering algorithms are system. Google File System (GFS) is
capable of handle outliers. Noise data cannot developed by Google is a distributed file
be making a group with data points. system that provide organized and adequate
access to data using large clusters of
Variety: commodity servers.
Variety refers to the ability of a clustering Map phase: The Master node accepts
algorithm to handle different types of data the input and then divides a large problem is
sets such as numerical, categorical, nominal into smaller sub-problems. It then distributes
and ordinal. A criterion for clustering these sub-problems among worker nodes in
algorithms is (a) type of data set (b) cluster a multi-level tree structure. These sub-
shape. problems are then processed by the worker
Type of data set: The size of the data nodes which execute and sent the result back
set is small or big but many of the clustering to the master node.
algorithms support large data sets for big Reduce phase: Reduce function
data mining. combines the output of all sub problems and
Cluster shape: Depends on the data collect it in master node and produces final
set size and type shape of the cluster formed. output. Each map function is associated with
a reduce function.
Velocity:
Velocity refers to the computations of
clustering algorithm based on the criteria (a)
running time complexity of a clustering
algorithm.
Time complexity: If the
computations of algorithms take very less no
then algorithm has less run time. The
algorithms the run time calculation done
based on Big O notation.
Value:
For a clustering algorithm to process the
data accurately and to form a cluster with Figure 3: Map Reduce Programming
less computation input parameter are play Model
key role. Operation mechanism of MapReduce is as
follows:
VIII. MapReduce Processing Model (1)Input: MapReduce framework based on
Hadoop MapReduce processes big Hadoop requires a pair ofMap and Reduce
data in parallel and provides output with functionsimplementing the appropriate
efficient performance. Map-reduce consist interface or abstract class,and should also be
of Map function and Reduce function. Map specified the input and output location and
function executes filtering and sorting of other operating parameters.In this stage, the
large data sets. Reduce function performs large datain theinput directory will be
the summary operation which combines the divided into several independent data blocks
result and provides the enhanced output. for the Map function of parallel processing.
Hadoop HDFS and Map-Reduce are (2)MapReduce framework puts the
delineated with the help of Google file application of the input as a set of key-value
686
International Journal of Pure and Applied Mathematics Special Issue
687
International Journal of Pure and Applied Mathematics Special Issue
0.9 References
0.8
0.7 [1]Anil K. Jain and Richard C. Dubes,
0.6 Michigan State University; Algorithms for
Density
688
International Journal of Pure and Applied Mathematics Special Issue
689
690