Data stream clustering is described as the clustering of data that appar continuously including telephone data, multimedia data, monetary transactions etc. Data stream clustering is generally treated as a streaming algorithm and the objective is, given a sequence of points, to make a best clustering of the stream, utilizing a small amount of memory and time.
Some applications needed the automated clustering of such data into set based on their similarities. Examples contains applications for web intrusion detection, analyzing Web clickstreams, and stock market analysis.
There are several dynamic methods for clustering static data sets clustering data streams places additional force on such algorithms. It can be seen the data stream model of computation needed algorithms to create a single pass over the data, with bounded memory and definite processing time, whereas the stream may be highly dynamic and evolving over time.
There are several methodologies of data stream clustering which are as follows −
Compute and store summaries of past data − Because of limited memory space and quick response requirements, compute summaries of the previously view data, save the relevant results, and use such summaries to calculate important statistics when needed.
Apply a divide-and-conquer strategy − It can divide data streams into chunks based on order of arrival, compute summaries for these chunks, and then merge the summaries. In this method, higher models can be constructed out of smaller building blocks.
Incremental clustering of incoming data streams − Because stream data introduce the system continuously and incrementally, the clusters changed should be incrementally sophisticated.
Perform microclustering as well as macroclustering analysis − Stream clusters can be computed in two steps are as follows −
It can compute and store summaries at the microcluster level, where microclusters are formed by applying a hierarchical bottom-up clustering algorithm.
It can compute macroclusters (such as by using another clustering algorithm to group the microclusters) at the user-specified level. This two-step calculation efficiently compresses the data and provide results in a smaller area of error.
Explore multiple time granularity for the analysis of cluster evolution − Because the more recent data often play a different role from that of the remote (i.e., older) data in stream data analysis, use a tilted time frame model to store snapshots of summarized data at different points in time.
Divide stream clustering into on-line and off-line processes − While data are streaming in, basic summaries of data snapshots should be computed, stored, and incrementally updated.
Therefore, an on-line process is needed to maintain such dynamically changing clusters. Meanwhile, a user may pose queries to ask about past, current, or evolving clusters. Such analysis can be performed off-line or as a process independent of online cluster maintenance.