0% found this document useful (0 votes)

46 views15 pages

Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA

Soft clustering allows data objects to belong to multiple clusters with varying degrees of membership, unlike traditional hard clustering which restricts each object to a single cluster. This paper provides an overview of soft clustering techniques for analyzing very large datasets, which have become increasingly common. It describes common soft clustering algorithms like fuzzy c-means and rough k-means, which can handle imprecision in large, noisy datasets. However, scaling these algorithms to massive amounts of data remains a challenge. The paper surveys efforts to apply soft clustering to big data and discusses potential future directions for developing more efficient clustering methods for the data volumes of today's information era.

Uploaded by

Karen Sanchez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views15 pages

Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA

Uploaded by

Karen Sanchez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 15

102 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.

1, January 2017

Soft Clustering for Very Large Data

Sets

Min Chen

State University of New York, New Paltz, NY, USA

Summary objects from different groups is to be minimized. Most of
Clustering is regarded as one of the significant task in data the clustering task uses an iterative process to find locally
mining and has been widely used in very large data sets. Soft or globally optimal solutions from a high-dimensional data
clustering is unlike the traditional hard clustering which allows set. Partitioned algorithms include two main clustering
one data belong to two or more clusters. Soft clustering such as
strategies [4]: the hard clustering and the soft clustering.
fuzzy c-means and rough k-means have been proposed and
successfully applied to deal with uncertainty and vagueness. The conventional hard clustering methods classify each
However, the influx of very large amount of noisy and blur data object to only one cluster. As a consequence, the results
increases difficulties of parallelization of the soft clustering are crisp. On the other hand, soft clustering allows the
techniques. The question is how to deploy clustering algorithms objects to belong to two or more clusters with varying
for this tremendous amount of data to get the clustering result degrees of membership. Soft clustering plays a significant
within a reasonable time. This paper provides an overview of the role in various problems such as feature analysis, systems
mainstream clustering techniques proposed over the past decade identification, and classification design [5]. Soft clustering
and the trend and progress of clustering algorithms applied in big
is more realistic than hard clustering due to the ability of
data. Moreover, the improvement of clustering algorithms in big
data are introduced and analyzed. The possible future for more handling impreciseness, uncertainty, and vagueness for
advanced clustering techniques are illuminated based on today’s real-world problems.
information era.
Key words: In addition, tremendous amount of data are being
Soft clustering, big data, parallel computing accumulated at fast-speed at the beginning of this new
century. This data is potentially contaminated with
fuzziness due to the imprecision, uncertainty and
1. Introduction vagueness. The problem becomes how we can analyze and
reveal valuable knowledge that is hidden within the data in
Massive volume of structured, unstructured or an efficient and effective way. With the high complexity
heterogeneous data have been agglomerated because of the and computational cost, traditional soft clustering
growth of the web, the rise of social media, the use of techniques are however limited to handle very large
mobile, and the information of Internet of Things (IoT) by volume of data with fuzziness.
and about people, things, and their interactions [1]. Due to
the maturity of database technologies, how to store these Moreover, conventional clustering techniques cannot cope
massive amount of data is no longer a problem anymore. with this huge amount of data because of their high
The problem is how to handle and hoard these very large complexity and computational cost [6]. The question for
data sets, as well as further find out solutions to understand big data clustering is how to scale up and speed up
or dig out useful information which can turn into data clustering algorithms with minimum sacrifice to the
products is a major challenge. clustering quality. Therefore, an efficient processing model
with a reasonable computational cost of this huge,
Clustering [2] is one of the most fundamental tasks in complex, dynamic and heterogeneous data is needed in
exploratory data analysis that groups similar data points in order to exploit this huge amount of data. There have
an unsupervised process. Clustering techniques have been already been some comparative studies on conventional
exploited in many fields including in many areas, such as soft clustering algorithms. However, a current comparison
data mining, pattern recognition, machine learning, and survey of soft clustering algorithms for very large data
biochemistry and bioinformatics [3]. The main process of sets is most desirable for current big data era.
clustering algorithms is to divide a set of unlabeled data
objects into different groups. The cluster membership The rest of the paper is organized as follows: an overview
measure is based on a similarity measure. In order to of soft clustering includes general information of soft
obtain a high quality partition, the similarity measure clustering, main stream soft clustering algorithms and soft
between the data objects in the same group is to be clustering validation index are introduced in Section 2.
maximized, and the similarity measure between the data The key technologies using soft clustering in big data is
Manuscript received January
5, 2017 Manuscript revised
January 20, 2017
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017 103

illustrated in Section 3. A small selection of applications of are identified: discrete and continuous methods.
soft clustering is discussed in Section 4. The paper is Specifically, fuzzy clustering and rough clustering. Hard
concluded in Section 5. clustering can be considered as a special case of soft
clustering which membership values are discrete and
restricted to either 0 or 1 (see Fig. 1). Fuzzy clustering
2. Overview of Soft Clustering provides continuous membership degrees which range
from 0 to 1. The objective of fuzzy clustering is to
minimize the weighted sum of Euclidean distance between
2.1 General information of soft clustering the objects. Fuzzy clustering is a method of clustering that
allows one piece of data to belong to two or more clusters
Soft clustering is one of the most fundamental tasks in
(see Fig. 2). The Fuzzy C-Means (FCM) algorithm is an
exploratory data analysis that groups similar data points in iterative partition clustering technique that was first
an unsupervised process. The main process of clustering
introduced by Dunn [10], and was then extended by
algorithms is to divide a set of unlabeled data objects into Bezdek [11]. FCM uses a standard least squared error
different groups. The cluster membership measure is based
model that generalizes an earlier and very popular non-
on a similarity measure. In order to obtain a high quality fuzzy c-means model that produces hard clusters of the
partition, the similarity measure between the data objects in
data.
the same group is to be maximized, and the similarity
measure between the data objects from different groups is Rough clustering extends the theory of rough or
to be minimized [7]. Most of the clustering task uses an approximation sets. Rough k-means is first introduced by
iterative process to find locally or globally optimal Lingras [12]. Each cluster has a lower and an upper
solutions from a high-dimensional data sets. In addition, approximation. The lower approximation is a subset of the
there is no unique clustering solution for real-life data and upper approximation (see Fig. 3). In other words, the
it is also hard to interpret the ‘cluster’ representations [8]. upper approximation is a boundary region. The members
Therefore, the clustering task requires much of the lower approximation belong to any other cluster.
experimentation with different algorithms or with different The data objects in an upper approximation may belong to
features of the same data set. Hence, how to save the cluster. Since their membership is uncertain they must
computational complexity is a significant issue for the be member of an upper approximation of at least another
clustering algorithms. Moreover, clustering very large data cluster. Hence, an object to a cluster has two membership
sets that contain large numbers of records with high degrees. One for its lower approximation and one for its
dimensions is considered a very important issue nowadays.
upper approximation.
Most conventional clustering algorithms suffer from the
problem that they do not scale with larger sizes of data sets, 2.2 Fuzzy Clustering
and most of them are computationally expensive with
regards to memory space and time complexities. For these Fuzzy clustering is a method of clustering which allows
reasons, the parallelization of clustering algorithms is a one piece of data to belong to two or more clusters. The
solution to overcome the aforementioned problems, and the fuzzy c-means algorithm is a pretty standard least squared
parallel implementation of clustering algorithms is error model that generalizes an earlier and very popular
inevitable. non-fuzzy c-means model that produces hard clusters of
the data. An optimal partition is produced iteratively by
More importantly, clustering analysis is unsupervised minimizing the weighted within group sum of squared
‘nonpredictive’ learning. It divides the data sets into error objective function [13]:
several clusters based on a subjective measurement. (1)
Clustering analysis is unlike supervised learning and it is where = [ 1, 2, ...,] is the data set in a d-dimensional
not based on a ‘trained characterization’. In general, there
is a set of desirable features for a clustering algorithm [9]:
space. is the number of data items. is the
scalability, robustness, order insensitivity, minimum user- vector
clusters which is defined by the user where 2 ≤
specified input, arbitrary-shaped clusters, and point number of
proportion admissibility. Thus, a clustering algorithm ≤ .
cluster.
is the degree of membership of
is a weighted exponent on
in the
each
ℎ
fuzzy

should be chosen such that duplicating the data set and the is the center of cluster . 2 ,
re-clustering task should not change the clustering results.
membership. ( ) is a

distance measure between object

and cluster .
square
Depending on how the membership of an instance to a
cluster is define, two groups of soft clustering algorithms An optimal solution with partitions can be obtained via
an iterative process which is as follows:
104 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
2. Initialize

1. ]
3.

Input(c, m,
Iteration starts and set t=1

, data)
the fuzzy partition matrix =[
4. Calculate the cluster centers with
:
(2)
+1
5. Calculate the membership using:

(3)
6. If the stopping criteria is not met, = + 1 and go to
Step 4.

2.3 Rough Clustering

Fig. 2 Fuzzy clustering with overlapping region.
Rough clustering is a partitioning algorithm which divides
a set of objects into several rough clusters. A rough cluster
is described by a lower approximation and an upper
approximation. Data points in the lower approximation
belong to the corresponding cluster only. Data points in the
upper approximation can be members of upper
approximations of other clusters. Hence, a data point has
two membership degrees to a cluster , one for its lower
approximation and one for its upper approximation [14]:
(4)
(5)

Rough k-means clustering uses the squared Euclidean

distance to measure the dissimilarity between a vector and
cluster centroids.
Fig. 3 Rough clustering with a lower and an upper approximation.
(6)
where is the weight for the lower approximation and Let be the centroid of cluster and be the
squared Euclidean distance between the data point and
is the weight for the upper approximation.
and are number of objects in lower . The new centroids are updated as follows:
approximation and upper approximation, respectively. (7)

The procedure of rough k-means algorithm is as follows:

1. Randomly assign each data point of the data set S to
the lower approximation and upper approximation
with a predefined k clusters.
2. Update the cluster centroid using Eq. (7).
3. For each remaining data object
a. Find its nearest cluster centroid and update
upper approximation.
Fig. 1 Hard clustering with crisp
membership
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017 105
validity measure to Euclidean distance 1989. It is defined
determine the “optimum” between and . takes as:
b. Check if number of clusters. A
its minimum value
further collection of validity
when the cluster where
centroids indices in fuzzy clustering The first term equals to which is
are not is listed below [24]. structure is best. the least squared error. It measures
the compactness. The best
significantl clustering performance for a data
Table
y 1: Extensions and derivatives of
farther
set is found by maximizing the
value of FS.

away than fuzzy clustering 2) Partition Coefficient

the closest and rough (PC) Index: The 6) Xie-Beni (XB)
one and clustering. Index: Xie and Beni
partition coefficient
update List of soft clustering algorithms proposed a validity
(PC) is defined as:
lower function in 1991,
Fuzzy C-Means [10, 11] and later it was
approximat Possibilistic c-means[15] (9)
modified by Bezdek
ion.
Possibilistic fuzzy c-means PC obtains its in 1995:
4. Recalculate the (13)
Gustafson-Kessel(GK) maximum value XB reaches its minimum
cluster centroid algorithm[17] value when the cluster
when the cluster
using Eq. (7). structure is optimal.
Fuzzy C-Varieties[18] structure is optimal.
5. Repeat step 3 to step Fuzzy C-Regression[19] 7) Partition Coefficient
5 until stop criteria 3) Partition Entropy and Exponential
Evidential c-means[20]
have been met. (PE) Index: The Separation (PCAES)
Rough k-means[12] partition entropy was Index: The partition
2.4 Mainstream Soft Rough-fuzzy clustering[21] defined as: coefficient and
Clustering Algorithms Rough-fuzzy possibilistic exponential
clustering[22] separation (PCAES)
Table 1 list some (10)
extensions and derivatives Shadowed set clustering[23] where is the index is defined as:
of fuzzy clustering and logarithmic base. PE
rough clustering 1) Least Squared Error
gets its minimum (14)
algorithms. Rough (SE) Index: The
weighted within value when the
clustering is relatively
cluster sum of cluster structure is
new. The number of its
squared error optimal.
extensions and derivatives
is smaller than that of the function is used:
fuzzy clustering. 4) Modified Partition
the
2.5 Soft Clustering Coefficient (MPC)
Validation Index where is the ℎ data point with Index: Modification
of the PC index,
ℎ

value of the cluster, and

Clustering validation is which can reduce the
aimed to evaluate the monotonic tendency,
clustering results to find is proposed by Dave
the best partition that fits in 1996:
the underlying data. Thus,
cluster validity is used to where is the number of cluster. An optimal cluster
quantitatively evaluate the number is found by maximizing MPC to produce a
result of clustering
best clustering
algorithms. Compactness
performance for a
and separation are
data set.
considered as two widely
used criteria in measuring 5) Fukuyama and
the quality of partitions. Sugeno (FS) Index:
Traditional approaches run Fukuyama and
the algorithm iteratively Sugeno proposed a
using different input validity function in
values and select the best
106 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
where and
. PCAES takes its

maximum value when the cluster structure is optimal.

popular sampling-based Due to the virtue of
techniques [26]. Another simplicity, scalability and
3. Key approach uses randomized fault-tolerance,
algorithms to reduce the MapReduce [30] is one of
Technologies
data dimension. The data the most efficient big data
Using Soft sets are projected from a solutions which enables
Clustering in high dimensional space to
to process a massive
a lower dimensional
Big Data volume of data in parallel
space. BIRCH and
CLARANS are two well- with many low-end
known algorithms of this computing nodes. This
3.1 Big data problem
type. The most common programming paradigm is
for soft clustering
approach is to apply a scalable and fault-
Conventional soft parallel and distributed tolerant data processing
clustering algorithms only algorithms use multiple technique that was
deal with structural data machines to speed up the developed to provide
with limited size. computation in order to Fig. 4 A general framework of significant improvements
soft clustering for very large
However, due to growth increase the scalability. in large-scale data-
data sets.
of the web, the rise of intensive applications in
Parallel processing is
social media, the use of clusters.
essential to processing a 3.2 Parallel Soft
mobile, and the
huge volume of data in a
information of Internet of Clustering Algorithms
timely manner in the big
Things (IoT) by and about
data era. It uses a divide
people, things, and their
and conquer approach to Fuzzy c-means is the most
interactions, huge volume
divide the huge amount of popular algorithm for fuzzy
of structured, unstructured
data into small data clustering and a parallel
or heterogeneous data
chunks. These small data fuzzy c-means is proposed
have been agglomerated. chunks can be handled in [28]. The data is
Substantial changes in the and loaded on different partitioned equally among
architecture of storage machines and the the available processors.
system because of the solutions are combined to The initial centers is set and
large volume of data is solve the huge problem. A broadcasts them to all the
necessary for soft general framework for processors. Each processor
clustering. both parallel and compute the geometrical
MapReduce clustering center of its local data and
Thus, the issue of soft
algorithms is illustrated in communicate its centers to
clustering for very large
Fig 4. The most common other processors in order to
data sets is how to speed
parallel processing models compute the global centers.
up and scale up the
for computing data- The procedure is repeated
clustering algorithms with
intensive applications are as many times as
the minimum sacrifice to
OpenMP, MPI [27], and convergence is achieved.
the clustering quality.
MapReduce are common Bit-reduced fuzzy c -means
Generally, there are three
parallel. In here, we only is designed to handle large
approaches to speed up
and scale up soft discuss the conventional images clustering [29].
clustering techniques [25]. parallel and MapReduce Moreover, kernel fuzzy c-
The most basic way to clustering algorithms. means algorithm is another
address big data is to use approach to address very
sampling-based large data. One high-speed
techniques to reduce the rough clustering is
iterative process. A proposed to deal with very
sample of the datasets large document collections
instead of using on the [34].
whole dataset is used to
perform the clustering.
CLARA, CURE and the
3.3 MapReduce based
core set algorithms are Soft Clustering
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017 107

4. Applications

Soft clustering has been proved to perform better for noisy

data. Soft clustering has been used in a numerous number
of real life applications. The sources of big data are mainly
social media, mobile and internet of things. Soft clustering
is well for this huge volume of structured, unstructured or
heterogeneous data due to the capability of handling
Fig. 5 The procedure of a MapReduce system. uncertainty and vagueness. A small selection of
applications of soft clustering for very large data sets has
provided (see Table 2) [14, 26].
The MapReduce model hides the details of the parallel
execution, which allows users to focus only on data
Table 2: Selections of applications of soft clustering.
processing strategies. The MapReduce model consists of
Applications of Fuzzy Applications of Rough
mappers and reducers. The main aspect of the MapReduce
algorithm is that if every map and reduce is independent of Clustering Clustering
all other ongoing maps and reduces, then the operations
can be run in parallel on different keys and lists of data. Community detection Patterns of gene
Image segmentation Speech recognition
The process of the MapReduce approach can be Pattern recognition Retail data clustering
decomposed as follows (see Fig. 5): (1) data preparation: Metabolomics in Path profiles on a
an underlying storage layer to read input and store output is bioinformatics website
a Distributed File System (DFS) [31]. GFS and HDFS are Market segmentation Traffic monitor
most common systems which are a chunk-based distributed
file system that supports fault-tolerance by data
partitioning and replication. The input data is divided into 5. Concluding Remarks
small chunks on different slave machines. (2) Map step: the
map function of each node is applied to local data and the
output is written to a temporary storage space. (3) Sort and Soft clustering for very large data and the corresponding
combine step: the output from step (2) is sorted and soft clustering algorithms are reviewed. Due to the ability
shuffled with key such that all data belonging to one key of handling impreciseness, uncertainty, and vagueness for
are located on the same node. The sorted results are emitted real-world problems, soft clustering is more realistic than
to the reducers. (4) reduce step: each group of output data hard clustering. Soft clustering as a partitioning algorithm
(per key) is processed in parallel on each reduce node. The is well for big data due to the heterogeneous structure of
user-provided reduce function is executed once for each very large data. Parallelism in soft clustering is potentially
key value produced by the map step. (5) Final output: The useful for big data. MapReduce has gained significant
final output is produced by the reducer of the MapReduce momentum from industry and academia in recent years.
system and is stored in the DFS.

A MapReduced-based fuzzy c-means clustering algorithm References

is implemented to explore the parallelization and [1] B. Mirkin, “Clustering: A Data Recovery Approach,”
scalability [32]. A parallel method for computing rough set Second Edition (Chapman & Hall/CRC Computer Science
approximations is proposed. This parallel method based on & Data Analysis).
the MapReduce technique are put forward to deal with the [2] H. A. Edelstein. Introduction to data mining and knowledge
discovery (3rd ed). Potomac, MD: Two Crows Corp. 1999
massive data [33].
[3] Chen, Min, and Simone A. Ludwig. "Fuzzy clustering using
automatic particle swarm optimization." 2014 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE).
IEEE, 2014.
[4] V. P. Guerrero-Bote, et al., “Comparison of neural models
for document clustering,'' Int. Journal of Approximate
Reasoning, vol. 34, pp.287-305, 2003.
108 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
and rough approaches Systems,Man, and
and their extensions and Cybernetics – Part B:
[5] B. Mirkin, “Clustering: derivatives." Cybernetics 37 (6) [24] Chen, Min, and Simone
A Data Recovery International Journal of (2007) 1529–1540. A. Ludwig. "Particle
Approach,” Second Approximate Reasoning [23] S. Mitra,W. Pedrycz, B. swarm optimization
Edition (Chapman & 54.2 (2013): 307-322. Barman, Shadowed c- based fuzzy clustering
Hall/CRC Computer [15] Krishnapuram, means: Integrating approach to identify
Science & Data Raghuram, and James fuzzy and rough optimal number of
Analysis). M. Keller. "The clustering, Pattern clusters." Journal of
[6] Z. Xu and Y. Shi, possibilistic c-means Artificial Intelligence
Recognition 43 (2010)
Exploring Big Data algorithm: insights and and Soft Computing
1282–1291.
Analysis: Fundamental recommendations." Research 4.1 (2014):
Scientific Problems. IEEE transactions on 43-56.
Annals of Data Science, Fuzzy Systems 4.3 [25] C. C.Aggarwal, C. K.
2(4), 363-372, 2015. (1996): 385-393. Reddy, Data clustering:
[7] A. S. Shirkhorshidi, S. [16] Pal, Nikhil R., et al. "A algorithms and
Aghabozorgi, T. Y. Wah, possibilistic fuzzy c- applications. CRC
and T. Herawan, Big means clustering Press, 2013.
data clustering: a algorithm." IEEE [26] Min Chen, Simone A.
review. In International transactions on fuzzy Ludwig and Keqin Li,
Conference on systems 13.4 (2005): “Clustering in Big
Computational Science 517-530. Data,” Big Data:
and Its Applications (pp. [17] V. P. Guerrero-Bote, et Management,
707-720). Springer al., Comparison of Architecture, and
International Publishing, neural moddls for Processing, Ch 16: p.g.
2014. document clustering, 331-246. CRC Press,
[8] B. L. Kaufman, P. J. Int. Journal of Taylor & Francis Group,
Rousseeuw, Finding Approximate 2017.
groups in data: an Reasoning, vol. 34, [27] J. A. Zhang, Parallel
introduction to cluster pp.287-305, 2003. Clustering Algorithm
analysis (Vol. 344). [18] Łęski, Jacek M. "Fuzzy with MPI– Kmeans.
John Wiley & Sons, c-varieties/elliptotypes Journal of computers
2009. clustering in 8.1 (2013): 10-17.
[9] G. L. Carl, “A fuzzy reproducing kernel [28] Kwok, Terence, et al.
clustering and fuzzy hilbert space." Fuzzy "Parallel fuzzy c-means
merging algorithm, Sets and Systems 141.2 clustering for large data
Technical Report,” CS- (2004): 259-280. sets." European
UNR-101, 1999. [19] Hathaway, Richard J., Conference on Parallel
[10] J. C. Dunn, ``A Fuzzy and James C. Bezdek. Processing. Springer
Relative of the "Switching regression Berlin Heidelberg,
ISODATA Process and models and fuzzy 2002.
Its Use in Detecting clustering." IEEE [29] Havens, Timothy C., et
Compact Well- Transactions on fuzzy al. "Fuzzy c-means
Separated Clusters,'' systems 1.3 (1993): algorithms for very
Journal of Cybernetics 195-204. large data." IEEE
3: 32-57, 1973. [20] Masson, Marie-Hélène, Transactions on Fuzzy
[11] J. C. Bezdek, ``Pattern and Thierry Denoeux. Systems 20.6 (2012):
Recognition with Fuzzy "ECM: An evidential 1130-1146.
Objective Function version of the fuzzy c- [30] X.Wu, X. Zhu, G. Q.
Algorithms". ISBN 0- means algorithm." Wu, and W. Ding, Data
306-40671-3, 1981. Pattern Recognition mining with big data.
[12] Lingras, Pawan, and 41.4 (2008): 1384-1397. IEEE transactions on
Georg Peters. "Applying [21] Dubois, Didier, and knowledge and data
rough set concepts to Henri Prade. "Rough engineering, 26(1), 97-
clustering." Rough Sets: fuzzy sets and fuzzy 107, 2014.
Selected Methods and rough sets*." [31] K. Shim, MapReduce
Applications in International Journal of algorithms for big data
Management and General System 17.2-3 analysis. Proceedings of
Engineering. Springer (1990): 191-209. the VLDB Endowment,
London, 2012. 23-37. [22] P. Maji, S.K. Pal, Rough 5(12), 2016-2017, 2012.
[13] J. C. Bezdek. “Cluster set based generalized [32] Ludwig, Simone A.
validity with fuzzy fuzzy c-means "MapReduce-based
sets.” (1973): 58-73. algorithm and fuzzy c-means
[14] Peters, Georg, et al. quantitative indices, clustering algorithm:
"Soft clustering–fuzzy IEEE Transactions on implementation and
scalability." s
International Journal of degree
Machine Learning and in
Cybernetics 6.6 (2015): comput
923-934. er
[33] Zhang, Junbo, Tianrui science
Li, and Yi Pan. "Parallel and
rough set based doctora
knowledge acquisition l degree
using MapReduce from in
big data." Proceedings softwar
of the 1st International e
Workshop on Big Data, enginee
Streams and ring at
Heterogeneous Source North
Mining: Algorithms, Dakota
Systems, Programming State
Models and Univers
Applications. ACM, ity in
2012. 2011
[34] Kishida, Kazuaki. and
"High‐speed rough 2015,
clustering for very large respecti
document collections." vely.
Journal of the American Her
Society for Information current
Science and Technology researc
61.6 (2010): 1092-1104. h
interests include artificial
Min intelligence, machine
Chen is learning and big data
now an computing.
Assista
nt
Profess
or at
State
Univers
ity of
New
York at
New
Paltz.
She
receive
d her
bachelo
r‘s
degree
in
mathe
matics
and
physics
from
College
of St.
Benedi
ct in
2009,
and
earned
her
master’

Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
RapidMiner Data Engineering Professional Certification Exam Quiz Answers
No ratings yet
RapidMiner Data Engineering Professional Certification Exam Quiz Answers
8 pages
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
No ratings yet
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
192 pages
Home Automation Catalogue
No ratings yet
Home Automation Catalogue
78 pages
ARL-500 Programming Manual.V233.En
No ratings yet
ARL-500 Programming Manual.V233.En
74 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
Advances in Data Clustering Theory and Applications by F Dornaika
No ratings yet
Advances in Data Clustering Theory and Applications by F Dornaika
225 pages
Memory Forensics
No ratings yet
Memory Forensics
8 pages
Flipkart Wired 8 0 Case Comp Deck 1729591631
No ratings yet
Flipkart Wired 8 0 Case Comp Deck 1729591631
5 pages
Unit 4
No ratings yet
Unit 4
106 pages
UD39811B-A Network-Video-Recorder User-Manual V5.04.000 20250407
No ratings yet
UD39811B-A Network-Video-Recorder User-Manual V5.04.000 20250407
150 pages
Guia de Importacion de Productos para El Software Unicenta
0% (1)
Guia de Importacion de Productos para El Software Unicenta
18 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
Living in Information Technology Era
No ratings yet
Living in Information Technology Era
14 pages
Ericsson Rbs 2202 PDF
0% (2)
Ericsson Rbs 2202 PDF
2 pages
A Rapid Hybird Clustring Algorithm For A Large Volumes of High
No ratings yet
A Rapid Hybird Clustring Algorithm For A Large Volumes of High
77 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
78 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
Manual
No ratings yet
Manual
72 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
ATR SemiRugged Flyer
No ratings yet
ATR SemiRugged Flyer
2 pages
Automatic Clustering Algorithms A Systematic Revie
No ratings yet
Automatic Clustering Algorithms A Systematic Revie
61 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
41 pages
A Survey On Parallel Clustering Algorithms For Big Data
No ratings yet
A Survey On Parallel Clustering Algorithms For Big Data
33 pages
ClusteringAlgorithms ConventionalandRecent
No ratings yet
ClusteringAlgorithms ConventionalandRecent
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Chapter 7
No ratings yet
Chapter 7
29 pages
Clustering
No ratings yet
Clustering
34 pages
Surveyofclusteringmethods
No ratings yet
Surveyofclusteringmethods
29 pages
Threading
No ratings yet
Threading
36 pages
Specification: Short Message Service Centre External Machine Interface
No ratings yet
Specification: Short Message Service Centre External Machine Interface
68 pages
PSO and WDO Data Clusterin
No ratings yet
PSO and WDO Data Clusterin
19 pages
BC2017
No ratings yet
BC2017
28 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
Session One
No ratings yet
Session One
26 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
A Rapid Review of Clustering Algorithms
No ratings yet
A Rapid Review of Clustering Algorithms
25 pages
Clustering
No ratings yet
Clustering
20 pages
Article Intéressant
No ratings yet
Article Intéressant
23 pages
CO3053 - Lecture 3 - Embedded Systems Development Process
No ratings yet
CO3053 - Lecture 3 - Embedded Systems Development Process
19 pages
Unit 15
No ratings yet
Unit 15
26 pages
Data Science
No ratings yet
Data Science
20 pages
Fundamentals of Wireless Module 1 Answers
100% (1)
Fundamentals of Wireless Module 1 Answers
4 pages
Module V
No ratings yet
Module V
16 pages
Lec04 - Unsupervised
No ratings yet
Lec04 - Unsupervised
18 pages
Tense Limited ECommerce Website Proposal
No ratings yet
Tense Limited ECommerce Website Proposal
15 pages
Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework For Big Data Clustering Using The Moth-Flame Bat Optimization and Sparse Fuzzy C-Means
No ratings yet
Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework For Big Data Clustering Using The Moth-Flame Bat Optimization and Sparse Fuzzy C-Means
15 pages
Influence of Machining Parameter On Concentricity of The Hole On VMC Machining Using RSM (Central Composite Design)
No ratings yet
Influence of Machining Parameter On Concentricity of The Hole On VMC Machining Using RSM (Central Composite Design)
8 pages
DT 1
No ratings yet
DT 1
8 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
OPTICS: Ordering Points To Identify The Clustering Structure
No ratings yet
OPTICS: Ordering Points To Identify The Clustering Structure
12 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
No ratings yet
A Survey of Clustering Algorithms For Big Data: Taxonomy & Empirical Analysis
12 pages
A Survey of Clustering Algorithms For An Industrial Context: Sciencedirect
No ratings yet
A Survey of Clustering Algorithms For An Industrial Context: Sciencedirect
12 pages
Comparison of Graph Clustering Algorithms
No ratings yet
Comparison of Graph Clustering Algorithms
6 pages
Clustering Techniquesin Data Mining
No ratings yet
Clustering Techniquesin Data Mining
7 pages
Na BIC20122
No ratings yet
Na BIC20122
8 pages
A Novel Clustering Technique For Efficient Clustering of Big Data in Hadoop Ecosystem
No ratings yet
A Novel Clustering Technique For Efficient Clustering of Big Data in Hadoop Ecosystem
8 pages
AReviewof Clustering Algorithms
No ratings yet
AReviewof Clustering Algorithms
8 pages
Ijctt V71i2p105
No ratings yet
Ijctt V71i2p105
7 pages
Experiment No 3: Mitesh Chauhan Te It - 1 B1 Roll No:-08
No ratings yet
Experiment No 3: Mitesh Chauhan Te It - 1 B1 Roll No:-08
6 pages
415 Media SQP T1
No ratings yet
415 Media SQP T1
8 pages
A Review of Self Optimal Clustering Technique and Data Mining Approach
No ratings yet
A Review of Self Optimal Clustering Technique and Data Mining Approach
6 pages
Incremental Clustering by Fast Search and Find of Density Peaks
No ratings yet
Incremental Clustering by Fast Search and Find of Density Peaks
7 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
7 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
IT Assistant: Job Description
No ratings yet
IT Assistant: Job Description
3 pages
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
No ratings yet
Map-Reduce (Hadoop) Based Data Clustering For BigData A Survey
6 pages
Cricket Management System Scenario
No ratings yet
Cricket Management System Scenario
4 pages
Roubleshooting Percona XtraDB Cluster
No ratings yet
Roubleshooting Percona XtraDB Cluster
4 pages
Quantum Security Gateway R81.10 Administration Guide
No ratings yet
Quantum Security Gateway R81.10 Administration Guide
4 pages
I Jcs It 20140506204
No ratings yet
I Jcs It 20140506204
4 pages
5 CS 03 Ijsrcse
No ratings yet
5 CS 03 Ijsrcse
4 pages
Ijcrcst January17 12
No ratings yet
Ijcrcst January17 12
4 pages
SOC2 Checklist
No ratings yet
SOC2 Checklist
2 pages
Characteristics of Computer System: Speed
No ratings yet
Characteristics of Computer System: Speed
4 pages
Internet
No ratings yet
Internet
2 pages
Visual Clustering Approaches
No ratings yet
Visual Clustering Approaches
3 pages
Ghid Instalare - Invertor Growatt MIN 6000TL-XH
No ratings yet
Ghid Instalare - Invertor Growatt MIN 6000TL-XH
2 pages
How To Barred All Access Class of Call in Utran Cell of Wcdma
No ratings yet
How To Barred All Access Class of Call in Utran Cell of Wcdma
1 page
Alok Kumar Singh: A-407 Gangavadi Co - Ho. Society Gopal Bhawan L.B.S Marg Ghatkopar Mumbai - 400086
No ratings yet
Alok Kumar Singh: A-407 Gangavadi Co - Ho. Society Gopal Bhawan L.B.S Marg Ghatkopar Mumbai - 400086
3 pages

Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA

Uploaded by

Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA

Uploaded by

102 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.

Soft Clustering for Very Large Data

State University of New York, New Paltz, NY, USA

distance measure between object

2.3 Rough Clustering

Rough k-means clustering uses the squared Euclidean

The procedure of rough k-means algorithm is as follows:

away than fuzzy clustering 2) Partition Coefficient

value of the cluster, and

maximum value when the cluster structure is optimal.

Soft clustering has been proved to perform better for noisy

A MapReduced-based fuzzy c-means clustering algorithm References

You might also like