clustering

This paper reviews recent advances in clustering algorithms for data streams, highlighting the challenges posed by high-speed data and limited memory. It classifies existing approaches into centroid-based and density-based methods, discussing their performance, scalability, and robustness to noise. The study emphasizes the need for efficient algorithms suitable for real-time applications and identifies future research directions in this active field.

Uploaded by

sugikrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views5 pages

clustering

Uploaded by

sugikrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Abstract: Clustering is an important task in data mining that aims to group data instances together

based on their similarities. However, traditional clustering algorithms are designed for static datasets
and are not well-suited for streaming data, which is rapidly generated and arrives in a continuous
flow. Clustering on data stream is a challenging task due to the high-speed nature of data and the
constraint of limited memory. In this paper, we review recent advances in clustering on data streams
and classify them based on their characteristics, performance, and scalability. We also discuss the
challenges and open research issues in this field and highlight future directions for research.

Keywords: Clustering, Data Stream, Stream Processing, Streaming Algorithms

Introduction:

Clustering is an unsupervised machine learning task that groups data points into clusters based on
their similarity. Clustering has many practical applications, such as image segmentation, customer
segmentation, and anomaly detection. Clustering on data streams is a challenging task because
streaming data refers to a continuous flow of data that is generated rapidly and arrives in an
unbounded manner. In contrast to static data, where we can store the entire dataset in memory and
process it offline, with streaming data, we have to process the incoming data in real-time using
limited resources.

In this paper, we will review the recent advances in clustering on data streams. We will first provide
an overview of the challenges and requirements of clustering on data streams. We will then review
the existing approaches for clustering on data streams and classify them based on their
characteristics, performance, and scalability. We will also discuss the open research issues and
challenges in this field and highlight future directions for research.

Challenges and Requirements:

Clustering on data streams poses several challenges, such as handling the high-speed nature of data,
constraints of limited memory and computational resources, and preserving the clustering quality
over time. Additionally, the data quality may also vary over time, requiring the clustering algorithms
to be robust to outliers and noise. The following are the primary requirements of clustering on data
streams:

Incremental Processing: The algorithms should be able to process the incoming data instances in a
single pass and update the clustering model incrementally in real-time.
Limited Memory: Clustering on data streams requires algorithms that consume a limited amount of
memory since data streams have unbounded size.

Scalability: The algorithms should be scalable, meaning they should handle large data streams
efficiently and be able to process data instances in constant time or sub-linear time.

Robustness: The algorithms should be robust to noise and outliers since data streams may contain
irrelevant or noisy data that would affect clustering quality.

Adaptable: The algorithms should be adaptable to changing data distributions over time and be able
to update and restructure the clustering model to reflect the changing data properties.

Approaches for Clustering on Data Streams:

Several approaches have been proposed for clustering on data streams. In general, these
approaches can be classified into two categories: centroid-based and density-based clustering.

Centroid-based Clustering:

Centroid-based clustering algorithms partition the data space into non-overlapping clusters based on
the distance metric between the data instances and the centroid of the existing clusters. The
algorithms update the centroids of the clusters after processing each new data instance. K-means is
a popular centroid-based clustering algorithm in the batch setting, but it is not suitable for streaming
data due to the constraint of limited memory and the inability to handle a variable number of
clusters.

To overcome the limitations of K-means, several variants of K-means have been proposed for
streaming data. CluStream is a popular centroid-based streaming algorithm that uses micro-clusters
to represent the data distribution and approximate the cluster centroids. The micro-clusters are
generated by summarizing the statistical properties of the data instances in the stream. CluStream
can handle a varying number of clusters and adapt to changes in the data distribution over time.
However, CluStream may suffer from loss of accuracy when the number of clusters is large.

Density-based Clustering:
Density-based clustering algorithms partition the data space into clusters based on the density of
data points in the vicinity of each other. Density-based clustering is more suitable for data streams
since it does not require the explicit definition of the number of clusters.

DBSCAN is a popular density-based clustering algorithm in the batch setting. DBSTREAM is a

streaming algorithm that adapts DBSCAN for data streams using a sliding window approach.
DBSTREAM maintains a set of representative points, and a set of density-connected clusters based
on these points. The representative points are updated using a sliding window approach, and the
clusters are updated incrementally based on the updated representative points.

Conclusion:

Clustering on data stream is a challenging task due to the high-speed nature of data and the
constraint of limited memory. In this paper, we have reviewed recent advances in clustering on data
streams and classified them based on their characteristics, performance, and scalability. We have
also discussed the challenges and open research issues in this field and highlighted future directions
for research. Clustering on data streams is an active research area that requires new and innovative
approaches to handle large
Introduction:

Data stream clustering is an important technique that is used to analyze data in real-time. With the
increasing amount of data generated by different sources, the traditional clustering techniques that
require random access to the entire dataset become inefficient. Therefore, developing efficient
algorithms that can analyze data continuously and in real-time is essential. In this survey paper, we
aim to evaluate the efficiency of different data stream clustering algorithms based on various
characteristics such as their suitability for different applications, scalability, and robustness to noise.

Methods:

We conducted an extensive survey of the research literature related to data stream clustering. We
searched for relevant papers published in various scientific databases, including IEEE Xplore, ACM
Digital Library, and ScienceDirect. We also consulted conference proceedings and books related to
the topic. We selected articles that discussed different data stream clustering algorithms and
evaluated their efficiency based on various parameters.

Results:

Our analysis of the literature revealed that there are several data stream clustering algorithms
available that differ in terms of their efficiency. Some of the popular algorithms are CluStream,
DenStream, BIRCH, and DBSCAN. CluStream is suitable for applications that require real-time
clustering and online processing. DenStream, on the other hand, is appropriate for datasets with
varying densities and changing cluster shapes. BIRCH is suitable for large datasets, whereas DBSCAN
is useful for datasets with complex structures and varying densities.

We also found that the scalability of the algorithms is an essential factor that influences their
efficiency. Several algorithms such as CluStream, DenStream, and BIRCH can handle large datasets
efficiently. However, other algorithms such as DBSCAN and OPTICS may be affected by the curse of
dimensionality and become inefficient for high-dimensional datasets.

Furthermore, the issue of the robustness of the algorithms to noise was found to be an important
characteristic. While CluStream and DenStream can handle noisy data with varying densities, other
algorithms such as BIRCH and DBSCAN may be affected by the presence of noise.

Conclusion:

In conclusion, the design of efficient data stream clustering algorithms is essential for real-time
applications. Our study showed that there are several algorithms available that offer different
features based on the characteristics of the datasets. Therefore, choosing the appropriate algorithm
based on the requirements of the application is crucial. Moreover, there is still scope for further
research in this area to develop more efficient algorithms that can handle noisy and high-
dimensional datasets.

Supply Chain - Pret
No ratings yet
Supply Chain - Pret
10 pages
Splitting Stories With The Hamburger Method - A Simple 5 Step Process
100% (2)
Splitting Stories With The Hamburger Method - A Simple 5 Step Process
42 pages
GFJHFN
No ratings yet
GFJHFN
21 pages
Stream
No ratings yet
Stream
30 pages
Overview of Streaming-Data Algorithms
No ratings yet
Overview of Streaming-Data Algorithms
10 pages
Bonus Tema Grupiranje Tijekovnih Podataka
No ratings yet
Bonus Tema Grupiranje Tijekovnih Podataka
36 pages
Clustering Data Streams Theory Practice
No ratings yet
Clustering Data Streams Theory Practice
33 pages
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
No ratings yet
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
19 pages
Data Stream Clustering
No ratings yet
Data Stream Clustering
3 pages
Evolving Fuzzy Model
No ratings yet
Evolving Fuzzy Model
9 pages
Visual Clustering Approaches
No ratings yet
Visual Clustering Approaches
3 pages
DM Unit V
No ratings yet
DM Unit V
20 pages
A Study On Weather Forecast Using Data Streams
No ratings yet
A Study On Weather Forecast Using Data Streams
11 pages
Data Streams: Models and Algorithms
No ratings yet
Data Streams: Models and Algorithms
372 pages
A Framework For Clustering Evolving Data Streams
No ratings yet
A Framework For Clustering Evolving Data Streams
12 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Mining Techniques for Streaming Data
No ratings yet
Mining Techniques for Streaming Data
14 pages
Methodologies for Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies for Stream Data Processing and Stream Data Systems
20 pages
Advances in Data Stream Mining
No ratings yet
Advances in Data Stream Mining
7 pages
unit-3 notes
No ratings yet
unit-3 notes
10 pages
CDSC Al
No ratings yet
CDSC Al
7 pages
Mining Frequent Itemsets Based On CBSW Method: K Jothimani, DR Antony Selvadossthanmani
No ratings yet
Mining Frequent Itemsets Based On CBSW Method: K Jothimani, DR Antony Selvadossthanmani
5 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Elasticsearch Engineering in Practice: Definitive Reference for Developers and Engineers
From Everand
Elasticsearch Engineering in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
From Everand
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Elastic Essentials: Definitive Reference for Developers and Engineers
From Everand
Elastic Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
7 pages
Requirements For Clustering Data Streams: Dbarbara@gmu - Edu
No ratings yet
Requirements For Clustering Data Streams: Dbarbara@gmu - Edu
5 pages
BDA-2
No ratings yet
BDA-2
16 pages
Data Stream Mg
No ratings yet
Data Stream Mg
528 pages
Bda Ut2 Que Ans
No ratings yet
Bda Ut2 Que Ans
14 pages
Unit II(Big Data)
No ratings yet
Unit II(Big Data)
19 pages
Big Data Stream Mining Using Integrated Framework With Classification and Clustering Methods
No ratings yet
Big Data Stream Mining Using Integrated Framework With Classification and Clustering Methods
9 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Mining&Data Stream Unit-3_removed
No ratings yet
Mining&Data Stream Unit-3_removed
50 pages
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
From Everand
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
AReviewofClusteringAlgorithms
No ratings yet
AReviewofClusteringAlgorithms
8 pages
StreamSets Pipeline Design and Best Practices: Definitive Reference for Developers and Engineers
From Everand
StreamSets Pipeline Design and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mod4_DWDM_BTECH
No ratings yet
Mod4_DWDM_BTECH
9 pages
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AWS Timestream Data Management and Analysis: Definitive Reference for Developers and Engineers
From Everand
AWS Timestream Data Management and Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
E-Stream_Evolution-Based_Technique_for_Stream_Clus (1)
No ratings yet
E-Stream_Evolution-Based_Technique_for_Stream_Clus (1)
12 pages
Adaptive Clustering For Dynamic IoT Data Streams
No ratings yet
Adaptive Clustering For Dynamic IoT Data Streams
11 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
A_Novel_Drift_Detection_Algorithm_Based
No ratings yet
A_Novel_Drift_Detection_Algorithm_Based
12 pages
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
From Everand
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
221302037 rajat AIML
No ratings yet
221302037 rajat AIML
8 pages
Clustering Techniquesin Data Mining
No ratings yet
Clustering Techniquesin Data Mining
7 pages
2401.07389v1
No ratings yet
2401.07389v1
25 pages
Unit-5 Data Mining AIML
No ratings yet
Unit-5 Data Mining AIML
31 pages
SQL and NoSQL: Building Hybrid Data Solutions for Modern Applications
From Everand
SQL and NoSQL: Building Hybrid Data Solutions for Modern Applications
Robert Johnson
No ratings yet
Adaptive Clustering
No ratings yet
Adaptive Clustering
11 pages
Oracle Data Modeling and Relational Database Design
No ratings yet
Oracle Data Modeling and Relational Database Design
32 pages
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Code Text
No ratings yet
Code Text
2 pages
E Governance Ncert
No ratings yet
E Governance Ncert
8 pages
Enhancing Collaborative Filtering by User Interest Expansion Via Personalized Ranking
No ratings yet
Enhancing Collaborative Filtering by User Interest Expansion Via Personalized Ranking
16 pages
Iterative Alignment Method Sheet CAM2
No ratings yet
Iterative Alignment Method Sheet CAM2
4 pages
Yardi Voyager-7S-RSA SecurID Access
No ratings yet
Yardi Voyager-7S-RSA SecurID Access
8 pages
MCQ Sample Questions
50% (2)
MCQ Sample Questions
36 pages
Komunikasi Saintifik Proposal BM
No ratings yet
Komunikasi Saintifik Proposal BM
25 pages
Qubee Report
No ratings yet
Qubee Report
14 pages
Known and Unknown #Windows Shortcuts
No ratings yet
Known and Unknown #Windows Shortcuts
4 pages
Hill Cipher Lab Exercise
No ratings yet
Hill Cipher Lab Exercise
4 pages
Mohannad Hlayel: Qualifications' Summary
No ratings yet
Mohannad Hlayel: Qualifications' Summary
4 pages
Indian Motor Tariff
No ratings yet
Indian Motor Tariff
9 pages
Migrating-Data-from-ERP-HCM-to-Employee-Central (1)
No ratings yet
Migrating-Data-from-ERP-HCM-to-Employee-Central (1)
168 pages
Bhopal
0% (1)
Bhopal
2 pages
My Doodle Game User Guide
No ratings yet
My Doodle Game User Guide
18 pages
Bangladesh Online MRV Portal PDF
No ratings yet
Bangladesh Online MRV Portal PDF
2 pages
BPO Call Center Database 2024 Samples
No ratings yet
BPO Call Center Database 2024 Samples
8 pages
PrimeFaces Cookbook - Second Edition - Sample Chapter
No ratings yet
PrimeFaces Cookbook - Second Edition - Sample Chapter
33 pages
Chapter 12 Slides
No ratings yet
Chapter 12 Slides
17 pages
Cisco Umbrella Package Comparison PDF
No ratings yet
Cisco Umbrella Package Comparison PDF
2 pages
Complete Download You Don t Know JS ES6 Beyond Kyle Simpson PDF All Chapters
100% (8)
Complete Download You Don t Know JS ES6 Beyond Kyle Simpson PDF All Chapters
85 pages
Internship Report Sample
No ratings yet
Internship Report Sample
56 pages
Dragon Age Chargenmorph Compiler V1.1.0 - Mini Guide Coded by Terra - Ex
No ratings yet
Dragon Age Chargenmorph Compiler V1.1.0 - Mini Guide Coded by Terra - Ex
6 pages
MIU Excel Revision
No ratings yet
MIU Excel Revision
26 pages
C++ Hotel Management Project - The Crazy Programmer
No ratings yet
C++ Hotel Management Project - The Crazy Programmer
8 pages
RouterOS Manual
No ratings yet
RouterOS Manual
4 pages
Ds Yhy522 en RFID
No ratings yet
Ds Yhy522 en RFID
41 pages
SlotServer Configuration Manual
No ratings yet
SlotServer Configuration Manual
37 pages