0% found this document useful (0 votes)
46 views15 pages

Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA

Soft clustering allows data objects to belong to multiple clusters with varying degrees of membership, unlike traditional hard clustering which restricts each object to a single cluster. This paper provides an overview of soft clustering techniques for analyzing very large datasets, which have become increasingly common. It describes common soft clustering algorithms like fuzzy c-means and rough k-means, which can handle imprecision in large, noisy datasets. However, scaling these algorithms to massive amounts of data remains a challenge. The paper surveys efforts to apply soft clustering to big data and discusses potential future directions for developing more efficient clustering methods for the data volumes of today's information era.

Uploaded by

Karen Sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views15 pages

Soft Clustering For Very Large Data Sets: State University of New York, New Paltz, NY, USA

Soft clustering allows data objects to belong to multiple clusters with varying degrees of membership, unlike traditional hard clustering which restricts each object to a single cluster. This paper provides an overview of soft clustering techniques for analyzing very large datasets, which have become increasingly common. It describes common soft clustering algorithms like fuzzy c-means and rough k-means, which can handle imprecision in large, noisy datasets. However, scaling these algorithms to massive amounts of data remains a challenge. The paper surveys efforts to apply soft clustering to big data and discusses potential future directions for developing more efficient clustering methods for the data volumes of today's information era.

Uploaded by

Karen Sanchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 15

102 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.

1, January 2017

Soft Clustering for Very Large Data


Sets

Min Chen

State University of New York, New Paltz, NY, USA


Summary objects from different groups is to be minimized. Most of
Clustering is regarded as one of the significant task in data the clustering task uses an iterative process to find locally
mining and has been widely used in very large data sets. Soft or globally optimal solutions from a high-dimensional data
clustering is unlike the traditional hard clustering which allows set. Partitioned algorithms include two main clustering
one data belong to two or more clusters. Soft clustering such as
strategies [4]: the hard clustering and the soft clustering.
fuzzy c-means and rough k-means have been proposed and
successfully applied to deal with uncertainty and vagueness. The conventional hard clustering methods classify each
However, the influx of very large amount of noisy and blur data object to only one cluster. As a consequence, the results
increases difficulties of parallelization of the soft clustering are crisp. On the other hand, soft clustering allows the
techniques. The question is how to deploy clustering algorithms objects to belong to two or more clusters with varying
for this tremendous amount of data to get the clustering result degrees of membership. Soft clustering plays a significant
within a reasonable time. This paper provides an overview of the role in various problems such as feature analysis, systems
mainstream clustering techniques proposed over the past decade identification, and classification design [5]. Soft clustering
and the trend and progress of clustering algorithms applied in big
is more realistic than hard clustering due to the ability of
data. Moreover, the improvement of clustering algorithms in big
data are introduced and analyzed. The possible future for more handling impreciseness, uncertainty, and vagueness for
advanced clustering techniques are illuminated based on today’s real-world problems.
information era.
Key words: In addition, tremendous amount of data are being
Soft clustering, big data, parallel computing accumulated at fast-speed at the beginning of this new
century. This data is potentially contaminated with
fuzziness due to the imprecision, uncertainty and
1. Introduction vagueness. The problem becomes how we can analyze and
reveal valuable knowledge that is hidden within the data in
Massive volume of structured, unstructured or an efficient and effective way. With the high complexity
heterogeneous data have been agglomerated because of the and computational cost, traditional soft clustering
growth of the web, the rise of social media, the use of techniques are however limited to handle very large
mobile, and the information of Internet of Things (IoT) by volume of data with fuzziness.
and about people, things, and their interactions [1]. Due to
the maturity of database technologies, how to store these Moreover, conventional clustering techniques cannot cope
massive amount of data is no longer a problem anymore. with this huge amount of data because of their high
The problem is how to handle and hoard these very large complexity and computational cost [6]. The question for
data sets, as well as further find out solutions to understand big data clustering is how to scale up and speed up
or dig out useful information which can turn into data clustering algorithms with minimum sacrifice to the
products is a major challenge. clustering quality. Therefore, an efficient processing model
with a reasonable computational cost of this huge,
Clustering [2] is one of the most fundamental tasks in complex, dynamic and heterogeneous data is needed in
exploratory data analysis that groups similar data points in order to exploit this huge amount of data. There have
an unsupervised process. Clustering techniques have been already been some comparative studies on conventional
exploited in many fields including in many areas, such as soft clustering algorithms. However, a current comparison
data mining, pattern recognition, machine learning, and survey of soft clustering algorithms for very large data
biochemistry and bioinformatics [3]. The main process of sets is most desirable for current big data era.
clustering algorithms is to divide a set of unlabeled data
objects into different groups. The cluster membership The rest of the paper is organized as follows: an overview
measure is based on a similarity measure. In order to of soft clustering includes general information of soft
obtain a high quality partition, the similarity measure clustering, main stream soft clustering algorithms and soft
between the data objects in the same group is to be clustering validation index are introduced in Section 2.
maximized, and the similarity measure between the data The key technologies using soft clustering in big data is
Manuscript received January
5, 2017 Manuscript revised
January 20, 2017
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017 103

illustrated in Section 3. A small selection of applications of are identified: discrete and continuous methods.
soft clustering is discussed in Section 4. The paper is Specifically, fuzzy clustering and rough clustering. Hard
concluded in Section 5. clustering can be considered as a special case of soft
clustering which membership values are discrete and
restricted to either 0 or 1 (see Fig. 1). Fuzzy clustering
2. Overview of Soft Clustering provides continuous membership degrees which range
from 0 to 1. The objective of fuzzy clustering is to
minimize the weighted sum of Euclidean distance between
2.1 General information of soft clustering the objects. Fuzzy clustering is a method of clustering that
allows one piece of data to belong to two or more clusters
Soft clustering is one of the most fundamental tasks in
(see Fig. 2). The Fuzzy C-Means (FCM) algorithm is an
exploratory data analysis that groups similar data points in iterative partition clustering technique that was first
an unsupervised process. The main process of clustering
introduced by Dunn [10], and was then extended by
algorithms is to divide a set of unlabeled data objects into Bezdek [11]. FCM uses a standard least squared error
different groups. The cluster membership measure is based
model that generalizes an earlier and very popular non-
on a similarity measure. In order to obtain a high quality fuzzy c-means model that produces hard clusters of the
partition, the similarity measure between the data objects in
data.
the same group is to be maximized, and the similarity
measure between the data objects from different groups is Rough clustering extends the theory of rough or
to be minimized [7]. Most of the clustering task uses an approximation sets. Rough k-means is first introduced by
iterative process to find locally or globally optimal Lingras [12]. Each cluster has a lower and an upper
solutions from a high-dimensional data sets. In addition, approximation. The lower approximation is a subset of the
there is no unique clustering solution for real-life data and upper approximation (see Fig. 3). In other words, the
it is also hard to interpret the ‘cluster’ representations [8]. upper approximation is a boundary region. The members
Therefore, the clustering task requires much of the lower approximation belong to any other cluster.
experimentation with different algorithms or with different The data objects in an upper approximation may belong to
features of the same data set. Hence, how to save the cluster. Since their membership is uncertain they must
computational complexity is a significant issue for the be member of an upper approximation of at least another
clustering algorithms. Moreover, clustering very large data cluster. Hence, an object to a cluster has two membership
sets that contain large numbers of records with high degrees. One for its lower approximation and one for its
dimensions is considered a very important issue nowadays.
upper approximation.
Most conventional clustering algorithms suffer from the
problem that they do not scale with larger sizes of data sets, 2.2 Fuzzy Clustering
and most of them are computationally expensive with
regards to memory space and time complexities. For these Fuzzy clustering is a method of clustering which allows
reasons, the parallelization of clustering algorithms is a one piece of data to belong to two or more clusters. The
solution to overcome the aforementioned problems, and the fuzzy c-means algorithm is a pretty standard least squared
parallel implementation of clustering algorithms is error model that generalizes an earlier and very popular
inevitable. non-fuzzy c-means model that produces hard clusters of
the data. An optimal partition is produced iteratively by
More importantly, clustering analysis is unsupervised minimizing the weighted within group sum of squared
‘nonpredictive’ learning. It divides the data sets into error objective function [13]:
several clusters based on a subjective measurement. (1)
Clustering analysis is unlike supervised learning and it is where = [ 1, 2, ...,] is the data set in a d-dimensional
not based on a ‘trained characterization’. In general, there
is a set of desirable features for a clustering algorithm [9]:
space. is the number of data items. is the
scalability, robustness, order insensitivity, minimum user- vector
clusters which is defined by the user where 2 ≤
specified input, arbitrary-shaped clusters, and point number of
proportion admissibility. Thus, a clustering algorithm ≤ .
cluster.
is the degree of membership of
is a weighted exponent on
in the
each

fuzzy

should be chosen such that duplicating the data set and the is the center of cluster . 2 ,
re-clustering task should not change the clustering results.
membership. ( ) is a

distance measure between object


and cluster .
square
Depending on how the membership of an instance to a
cluster is define, two groups of soft clustering algorithms An optimal solution with partitions can be obtained via
an iterative process which is as follows:
104 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
2. Initialize

1. ]
3.

Input(c, m,
Iteration starts and set t=1

, data)
the fuzzy partition matrix =[
4. Calculate the cluster centers with
:
(2)
+1
5. Calculate the membership using:

(3)
6. If the stopping criteria is not met, = + 1 and go to
Step 4.

2.3 Rough Clustering


Fig. 2 Fuzzy clustering with overlapping region.
Rough clustering is a partitioning algorithm which divides
a set of objects into several rough clusters. A rough cluster
is described by a lower approximation and an upper
approximation. Data points in the lower approximation
belong to the corresponding cluster only. Data points in the
upper approximation can be members of upper
approximations of other clusters. Hence, a data point has
two membership degrees to a cluster , one for its lower
approximation and one for its upper approximation [14]:
(4)
(5)

Rough k-means clustering uses the squared Euclidean


distance to measure the dissimilarity between a vector and
cluster centroids.
Fig. 3 Rough clustering with a lower and an upper approximation.
(6)
where is the weight for the lower approximation and Let be the centroid of cluster and be the
squared Euclidean distance between the data point and
is the weight for the upper approximation.
and are number of objects in lower . The new centroids are updated as follows:
approximation and upper approximation, respectively. (7)

The procedure of rough k-means algorithm is as follows:


1. Randomly assign each data point of the data set S to
the lower approximation and upper approximation
with a predefined k clusters.
2. Update the cluster centroid using Eq. (7).
3. For each remaining data object
a. Find its nearest cluster centroid and update
upper approximation.
Fig. 1 Hard clustering with crisp
membership
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017 105
validity measure to Euclidean distance 1989. It is defined
determine the “optimum” between and . takes as:
b. Check if number of clusters. A
its minimum value
further collection of validity
when the cluster where
centroids indices in fuzzy clustering The first term equals to which is
are not is listed below [24]. structure is best. the least squared error. It measures
the compactness. The best
significantl clustering performance for a data
Table
y 1: Extensions and derivatives of
farther
set is found by maximizing the
value of FS.

away than fuzzy clustering 2) Partition Coefficient


the closest and rough (PC) Index: The 6) Xie-Beni (XB)
one and clustering. Index: Xie and Beni
partition coefficient
update List of soft clustering algorithms proposed a validity
(PC) is defined as:
lower function in 1991,
Fuzzy C-Means [10, 11] and later it was
approximat Possibilistic c-means[15] (9)
modified by Bezdek
ion.
Possibilistic fuzzy c-means PC obtains its in 1995:
4. Recalculate the (13)
Gustafson-Kessel(GK) maximum value XB reaches its minimum
cluster centroid algorithm[17] value when the cluster
when the cluster
using Eq. (7). structure is optimal.
Fuzzy C-Varieties[18] structure is optimal.
5. Repeat step 3 to step Fuzzy C-Regression[19] 7) Partition Coefficient
5 until stop criteria 3) Partition Entropy and Exponential
Evidential c-means[20]
have been met. (PE) Index: The Separation (PCAES)
Rough k-means[12] partition entropy was Index: The partition
2.4 Mainstream Soft Rough-fuzzy clustering[21] defined as: coefficient and
Clustering Algorithms Rough-fuzzy possibilistic exponential
clustering[22] separation (PCAES)
Table 1 list some (10)
extensions and derivatives Shadowed set clustering[23] where is the index is defined as:
of fuzzy clustering and logarithmic base. PE
rough clustering 1) Least Squared Error
gets its minimum (14)
algorithms. Rough (SE) Index: The
weighted within value when the
clustering is relatively
cluster sum of cluster structure is
new. The number of its
squared error optimal.
extensions and derivatives
is smaller than that of the function is used:
fuzzy clustering. 4) Modified Partition
the
2.5 Soft Clustering Coefficient (MPC)
Validation Index where is the ℎ data point with Index: Modification
of the PC index,

value of the cluster, and


Clustering validation is which can reduce the
aimed to evaluate the monotonic tendency,
clustering results to find is proposed by Dave
the best partition that fits in 1996:
the underlying data. Thus,
cluster validity is used to where is the number of cluster. An optimal cluster
quantitatively evaluate the number is found by maximizing MPC to produce a
result of clustering
best clustering
algorithms. Compactness
performance for a
and separation are
data set.
considered as two widely
used criteria in measuring 5) Fukuyama and
the quality of partitions. Sugeno (FS) Index:
Traditional approaches run Fukuyama and
the algorithm iteratively Sugeno proposed a
using different input validity function in
values and select the best
106 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
where and
. PCAES takes its

maximum value when the cluster structure is optimal.


popular sampling-based Due to the virtue of
techniques [26]. Another simplicity, scalability and
3. Key approach uses randomized fault-tolerance,
algorithms to reduce the MapReduce [30] is one of
Technologies
data dimension. The data the most efficient big data
Using Soft sets are projected from a solutions which enables
Clustering in high dimensional space to
to process a massive
a lower dimensional
Big Data volume of data in parallel
space. BIRCH and
CLARANS are two well- with many low-end
known algorithms of this computing nodes. This
3.1 Big data problem
type. The most common programming paradigm is
for soft clustering
approach is to apply a scalable and fault-
Conventional soft parallel and distributed tolerant data processing
clustering algorithms only algorithms use multiple technique that was
deal with structural data machines to speed up the developed to provide
with limited size. computation in order to Fig. 4 A general framework of significant improvements
soft clustering for very large
However, due to growth increase the scalability. in large-scale data-
data sets.
of the web, the rise of intensive applications in
Parallel processing is
social media, the use of clusters.
essential to processing a 3.2 Parallel Soft
mobile, and the
huge volume of data in a
information of Internet of Clustering Algorithms
timely manner in the big
Things (IoT) by and about
data era. It uses a divide
people, things, and their
and conquer approach to Fuzzy c-means is the most
interactions, huge volume
divide the huge amount of popular algorithm for fuzzy
of structured, unstructured
data into small data clustering and a parallel
or heterogeneous data
chunks. These small data fuzzy c-means is proposed
have been agglomerated. chunks can be handled in [28]. The data is
Substantial changes in the and loaded on different partitioned equally among
architecture of storage machines and the the available processors.
system because of the solutions are combined to The initial centers is set and
large volume of data is solve the huge problem. A broadcasts them to all the
necessary for soft general framework for processors. Each processor
clustering. both parallel and compute the geometrical
MapReduce clustering center of its local data and
Thus, the issue of soft
algorithms is illustrated in communicate its centers to
clustering for very large
Fig 4. The most common other processors in order to
data sets is how to speed
parallel processing models compute the global centers.
up and scale up the
for computing data- The procedure is repeated
clustering algorithms with
intensive applications are as many times as
the minimum sacrifice to
OpenMP, MPI [27], and convergence is achieved.
the clustering quality.
MapReduce are common Bit-reduced fuzzy c -means
Generally, there are three
parallel. In here, we only is designed to handle large
approaches to speed up
and scale up soft discuss the conventional images clustering [29].
clustering techniques [25]. parallel and MapReduce Moreover, kernel fuzzy c-
The most basic way to clustering algorithms. means algorithm is another
address big data is to use approach to address very
sampling-based large data. One high-speed
techniques to reduce the rough clustering is
iterative process. A proposed to deal with very
sample of the datasets large document collections
instead of using on the [34].
whole dataset is used to
perform the clustering.
CLARA, CURE and the
3.3 MapReduce based
core set algorithms are Soft Clustering
IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017 107

4. Applications

Soft clustering has been proved to perform better for noisy


data. Soft clustering has been used in a numerous number
of real life applications. The sources of big data are mainly
social media, mobile and internet of things. Soft clustering
is well for this huge volume of structured, unstructured or
heterogeneous data due to the capability of handling
Fig. 5 The procedure of a MapReduce system. uncertainty and vagueness. A small selection of
applications of soft clustering for very large data sets has
provided (see Table 2) [14, 26].
The MapReduce model hides the details of the parallel
execution, which allows users to focus only on data
Table 2: Selections of applications of soft clustering.
processing strategies. The MapReduce model consists of
Applications of Fuzzy Applications of Rough
mappers and reducers. The main aspect of the MapReduce
algorithm is that if every map and reduce is independent of Clustering Clustering
all other ongoing maps and reduces, then the operations
can be run in parallel on different keys and lists of data. Community detection Patterns of gene
Image segmentation Speech recognition
The process of the MapReduce approach can be Pattern recognition Retail data clustering
decomposed as follows (see Fig. 5): (1) data preparation: Metabolomics in Path profiles on a
an underlying storage layer to read input and store output is bioinformatics website
a Distributed File System (DFS) [31]. GFS and HDFS are Market segmentation Traffic monitor
most common systems which are a chunk-based distributed
file system that supports fault-tolerance by data
partitioning and replication. The input data is divided into 5. Concluding Remarks
small chunks on different slave machines. (2) Map step: the
map function of each node is applied to local data and the
output is written to a temporary storage space. (3) Sort and Soft clustering for very large data and the corresponding
combine step: the output from step (2) is sorted and soft clustering algorithms are reviewed. Due to the ability
shuffled with key such that all data belonging to one key of handling impreciseness, uncertainty, and vagueness for
are located on the same node. The sorted results are emitted real-world problems, soft clustering is more realistic than
to the reducers. (4) reduce step: each group of output data hard clustering. Soft clustering as a partitioning algorithm
(per key) is processed in parallel on each reduce node. The is well for big data due to the heterogeneous structure of
user-provided reduce function is executed once for each very large data. Parallelism in soft clustering is potentially
key value produced by the map step. (5) Final output: The useful for big data. MapReduce has gained significant
final output is produced by the reducer of the MapReduce momentum from industry and academia in recent years.
system and is stored in the DFS.

A MapReduced-based fuzzy c-means clustering algorithm References


is implemented to explore the parallelization and [1] B. Mirkin, “Clustering: A Data Recovery Approach,”
scalability [32]. A parallel method for computing rough set Second Edition (Chapman & Hall/CRC Computer Science
approximations is proposed. This parallel method based on & Data Analysis).
the MapReduce technique are put forward to deal with the [2] H. A. Edelstein. Introduction to data mining and knowledge
discovery (3rd ed). Potomac, MD: Two Crows Corp. 1999
massive data [33].
[3] Chen, Min, and Simone A. Ludwig. "Fuzzy clustering using
automatic particle swarm optimization." 2014 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE).
IEEE, 2014.
[4] V. P. Guerrero-Bote, et al., “Comparison of neural models
for document clustering,'' Int. Journal of Approximate
Reasoning, vol. 34, pp.287-305, 2003.
108 IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.1, January 2017
and rough approaches Systems,Man, and
and their extensions and Cybernetics – Part B:
[5] B. Mirkin, “Clustering: derivatives." Cybernetics 37 (6) [24] Chen, Min, and Simone
A Data Recovery International Journal of (2007) 1529–1540. A. Ludwig. "Particle
Approach,” Second Approximate Reasoning [23] S. Mitra,W. Pedrycz, B. swarm optimization
Edition (Chapman & 54.2 (2013): 307-322. Barman, Shadowed c- based fuzzy clustering
Hall/CRC Computer [15] Krishnapuram, means: Integrating approach to identify
Science & Data Raghuram, and James fuzzy and rough optimal number of
Analysis). M. Keller. "The clustering, Pattern clusters." Journal of
[6] Z. Xu and Y. Shi, possibilistic c-means Artificial Intelligence
Recognition 43 (2010)
Exploring Big Data algorithm: insights and and Soft Computing
1282–1291.
Analysis: Fundamental recommendations." Research 4.1 (2014):
Scientific Problems. IEEE transactions on 43-56.
Annals of Data Science, Fuzzy Systems 4.3 [25] C. C.Aggarwal, C. K.
2(4), 363-372, 2015. (1996): 385-393. Reddy, Data clustering:
[7] A. S. Shirkhorshidi, S. [16] Pal, Nikhil R., et al. "A algorithms and
Aghabozorgi, T. Y. Wah, possibilistic fuzzy c- applications. CRC
and T. Herawan, Big means clustering Press, 2013.
data clustering: a algorithm." IEEE [26] Min Chen, Simone A.
review. In International transactions on fuzzy Ludwig and Keqin Li,
Conference on systems 13.4 (2005): “Clustering in Big
Computational Science 517-530. Data,” Big Data:
and Its Applications (pp. [17] V. P. Guerrero-Bote, et Management,
707-720). Springer al., Comparison of Architecture, and
International Publishing, neural moddls for Processing, Ch 16: p.g.
2014. document clustering, 331-246. CRC Press,
[8] B. L. Kaufman, P. J. Int. Journal of Taylor & Francis Group,
Rousseeuw, Finding Approximate 2017.
groups in data: an Reasoning, vol. 34, [27] J. A. Zhang, Parallel
introduction to cluster pp.287-305, 2003. Clustering Algorithm
analysis (Vol. 344). [18] Łęski, Jacek M. "Fuzzy with MPI– Kmeans.
John Wiley & Sons, c-varieties/elliptotypes Journal of computers
2009. clustering in 8.1 (2013): 10-17.
[9] G. L. Carl, “A fuzzy reproducing kernel [28] Kwok, Terence, et al.
clustering and fuzzy hilbert space." Fuzzy "Parallel fuzzy c-means
merging algorithm, Sets and Systems 141.2 clustering for large data
Technical Report,” CS- (2004): 259-280. sets." European
UNR-101, 1999. [19] Hathaway, Richard J., Conference on Parallel
[10] J. C. Dunn, ``A Fuzzy and James C. Bezdek. Processing. Springer
Relative of the "Switching regression Berlin Heidelberg,
ISODATA Process and models and fuzzy 2002.
Its Use in Detecting clustering." IEEE [29] Havens, Timothy C., et
Compact Well- Transactions on fuzzy al. "Fuzzy c-means
Separated Clusters,'' systems 1.3 (1993): algorithms for very
Journal of Cybernetics 195-204. large data." IEEE
3: 32-57, 1973. [20] Masson, Marie-Hélène, Transactions on Fuzzy
[11] J. C. Bezdek, ``Pattern and Thierry Denoeux. Systems 20.6 (2012):
Recognition with Fuzzy "ECM: An evidential 1130-1146.
Objective Function version of the fuzzy c- [30] X.Wu, X. Zhu, G. Q.
Algorithms". ISBN 0- means algorithm." Wu, and W. Ding, Data
306-40671-3, 1981. Pattern Recognition mining with big data.
[12] Lingras, Pawan, and 41.4 (2008): 1384-1397. IEEE transactions on
Georg Peters. "Applying [21] Dubois, Didier, and knowledge and data
rough set concepts to Henri Prade. "Rough engineering, 26(1), 97-
clustering." Rough Sets: fuzzy sets and fuzzy 107, 2014.
Selected Methods and rough sets*." [31] K. Shim, MapReduce
Applications in International Journal of algorithms for big data
Management and General System 17.2-3 analysis. Proceedings of
Engineering. Springer (1990): 191-209. the VLDB Endowment,
London, 2012. 23-37. [22] P. Maji, S.K. Pal, Rough 5(12), 2016-2017, 2012.
[13] J. C. Bezdek. “Cluster set based generalized [32] Ludwig, Simone A.
validity with fuzzy fuzzy c-means "MapReduce-based
sets.” (1973): 58-73. algorithm and fuzzy c-means
[14] Peters, Georg, et al. quantitative indices, clustering algorithm:
"Soft clustering–fuzzy IEEE Transactions on implementation and
scalability." s
International Journal of degree
Machine Learning and in
Cybernetics 6.6 (2015): comput
923-934. er
[33] Zhang, Junbo, Tianrui science
Li, and Yi Pan. "Parallel and
rough set based doctora
knowledge acquisition l degree
using MapReduce from in
big data." Proceedings softwar
of the 1st International e
Workshop on Big Data, enginee
Streams and ring at
Heterogeneous Source North
Mining: Algorithms, Dakota
Systems, Programming State
Models and Univers
Applications. ACM, ity in
2012. 2011
[34] Kishida, Kazuaki. and
"High‐speed rough 2015,
clustering for very large respecti
document collections." vely.
Journal of the American Her
Society for Information current
Science and Technology researc
61.6 (2010): 1092-1104. h
interests include artificial
Min intelligence, machine
Chen is learning and big data
now an computing.
Assista
nt
Profess
or at
State
Univers
ity of
New
York at
New
Paltz.
She
receive
d her
bachelo
r‘s
degree
in
mathe
matics
and
physics
from
College
of St.
Benedi
ct in
2009,
and
earned
her
master’

You might also like