0% found this document useful (0 votes)
2 views30 pages

Unit VII

Cluster analysis is a method of grouping similar objects into clusters, utilized in various applications such as market research and image processing. It involves different types of data structures, including data matrices and dissimilarity matrices, and employs various clustering methods like partitioning, hierarchical, density-based, grid-based, and model-based methods. Each method has its own advantages and challenges, particularly in handling different data types, scalability, and the ability to manage noise and outliers.

Uploaded by

dr.c.sowjanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views30 pages

Unit VII

Cluster analysis is a method of grouping similar objects into clusters, utilized in various applications such as market research and image processing. It involves different types of data structures, including data matrices and dissimilarity matrices, and employs various clustering methods like partitioning, hierarchical, density-based, grid-based, and model-based methods. Each method has its own advantages and challenges, particularly in handling different data types, scalability, and the ability to manage noise and outliers.

Uploaded by

dr.c.sowjanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

3.

7 Clusters Analysis: Types Of Data In Cluster Analysis

What is Cluster Analysis?


The process of grouping a set of physical objects into classes of similar
objects is called clustering.

Cluster – collection of data objects


– Objects within a cluster are similar and objects in different clusters
are dissimilar.

Cluster applications – pattern recognition, image processing and market


research.
- helps marketers to discover the characterization of customer
groups based on purchasing patterns
- Categorize genes in plant and animal taxonomies
- Identify groups of house in a city according to house type, value
and geographical location
- Classify documents on WWW for information discovery

Clustering is a preprocessing step for other data mining steps


like classification, characterization.
Clustering – Unsupervised learning – does not rely on predefined classes
with class labels.

Typical requirements of clustering in data mining:


1. Scalability – Clustering algorithms should work for huge databases
2. Ability to deal with different types of attributes – Clustering
algorithms should work not only for numeric data, but also for
other data types.
3. Discovery of clusters with arbitrary shape – Clustering algorithms
(based on distance measures) should work for clusters of any
shape.
4. Minimal requirements for domain knowledge to determine input
parameters – Clustering results are sensitive to input parameters to
a clustering algorithm (example
– number of desired clusters). Determining the value of these
parameters is difficult and requires some domain knowledge.
5. Ability to deal with noisy data – Outlier, missing, unknown and
erroneous data detected by a clustering algorithm may lead to
clusters of poor quality.
6. Insensitivity in the order of input records – Clustering algorithms
should produce same results even if the order of input records is
changed.
7. High dimensionality – Data in high dimensional space can be
sparse and highly skewed, hence it is challenging for a clustering
algorithm to cluster data objects in high dimensional space.
8. Constraint-based clustering – In Real world scenario, clusters are
performed based on various constraints. It is a challenging task to
find groups of data with good clustering behavior and satisfying
various constraints.
9. Interpretability and usability – Clustering results should be
interpretable, comprehensible and usable. So we should
study how an application goal may influence the selection of
clustering methods.

Types of data in Clustering Analysis


1. Data Matrix: (object-by-variable structure)
Represents n objects, (such as persons) with p variables (or
attributes) (such as age, height, weight, gender, race and so on. The
structure is in the form of relational tableUnit III - DATA
WAREHOUSING AND DATA MINING -CA5010 15

or n x p matrix as shown below:

 called as “two mode” matrix

2. Dissimilarity Matrix: (object-by-object structure)


This stores a collection of proximities (closeness or distance) that
are available for all pairs of n objects. It is represented by an n-by-n
table as shown below.

 called as “one mode” matrix

Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) =


d (j, i) and d (i,
i) = 0

Many clustering algorithms use Dissimilarity Matrix. So data


represented using Data Matrix are converted into Dissimilarity
Matrix before applying such clustering algorithms.

Clustering of objects done based on their similarities or


dissimilarities. Similarity coefficients or dissimilarity
coefficients are derived from correlation coefficients.
3.8 Categorization of Major Clustering Methods
The choice of many available clustering algorithms depends on type of
data available and the application used.

Major Categories are:

1. Partitioning Methods:
- Construct k-partitions of the n data objects, where each partition is
a cluster and k
<= n.
- Each partition should contain at least one object & each object
should belong to exactly one partition.
- Iterative Relocation Technique – attempts to improve
partitioning by moving objects from one group to another.
- Good Partitioning – Objects in the same cluster are “close” /
related and objects in the different clusters are “far apart” / very
different.
- Uses the Algorithms:
o K-means Algorithm: - Each cluster is represented by the
mean value of the objects in the cluster.
o K-mediods Algorithm: - Each cluster is represented by
one of the objects located near the center of the cluster.
o These work well in small to medium sized database.
Hierarchical Methods:
- Creates hierarchical decomposition of the given set of data
objects.
- Two types – Agglomerative and Divisive
- Agglomerative Approach: (Bottom-Up Approach):
o Each object forms a separate group
o Successively merges groups close to one another
(based on distance between clusters)
o Done until all the groups are merged to one or until a
termination condition holds. (Termination condition can be
desired number of clusters)
- Divisive Approach: (Top-Down Approach):
o Starts with all the objects in the same cluster
o Successively clusters are split into smaller clusters
o Done until each object is in one cluster or until a
termination condition holds (Termination condition can
be desired number of clusters)
- Disadvantage – Once a merge or split is done it can not be undone.
- Advantage – Less computational cost
- If both these approaches are combined it gives more advantage.
- Clustering algorithms with this integrated approach are BIRCH and
CURE.

2. Density Based Methods:


- Above methods produce Spherical shaped clusters.
- To discover clusters of arbitrary shape, clustering done based
on the notion of density.
- Used to filter out noise or outliers.

- Continue growing a cluster so long as the density in the


neighborhood exceeds some threshold.
- Density = number of objects or data points
- That is for each data point within a given cluster; the
neighborhood of a given radius has to contain at least a
minimum number of points.
- Uses the algorithms: DBSCAN and OPTICS

3. Grid-Based Methods:
- Divides the object space into finite number of cells to forma grid
structure.
- Performs clustering operations on the grid structure.
- Advantage – Fast processing time – independent on the number
of data objects & dependent on the number of cells in the data
grid.
- STING – typical grid based method
- CLIQUE and Wave-Cluster – grid based and density based
clustering algorithms.

4. Model-Based Methods:
- Hypothesizes a model for each of the clusters and finds a best fit
of the data to the model.
- Forms clusters by constructing a density function that
reflects the spatial distribution of the data points.
- Robust clustering methods
- Detects noise / outliers.

Many algorithms combine several


clustering methods.

3.9 Partitioning Methods

Partitioning Methods
Database has n objects and k partitions where k<=n; each partition is a
cluster.

Partitioning criterion = Similarity function:


Objects within a cluster are similar; objects of different clusters are
dissimilar.

Classical Partitioning Methods: k-means and k-mediods:

(A) Centroid-based technique: The k-means method:


- Cluster similarity is measured using mean value of objects
in the cluster (or clusters center of gravity)
- Randomly select k objects. Each object is a cluster mean or center.
- Each of the remaining objects is assigned to the most similar
cluster – based on the distance between the object and the
cluster mean.
- Compute new mean for each cluster.
- This process iterates until all the objects are assigned to
a cluster and the partitioning criterion is met.
- This algorithm determines k partitions that minimize the squared
error function.
- Square Error Function is defined as:
- Where x is the point representing an object, mi is the mean of the
cluster Ci.
Algorithm:
- Advantages: Scalable; efficient in large databases
- Computational Complexity of this algorithm:
o O(nkt); n = number of objects, k number of partitions,
t = number of iterations
o k << n and t << n
- Disadvantage:
o Cannot be applied for categorical data – as mean cannot be
calculated.
o Need to specify the number of partitions – k
o Not applicable for clusters of different size.
o Noise and outliers cannot be detected

(B) Representative Point-based technique: The k-mediods method:


- Mediod – most centrally located point in a cluster – Reference point
- Partitioning is based on the principle of minimizing the sum of
the dissimilarities between each object with its corresponding
reference point.

- PAM – Partitioning Around Mediods – k-mediods type clustering


algorithm.
- Finds k clusters in n objects by finding mediod for each cluster.
- Initial set of k mediods are arbitrarily selected.
- Iteratively replaces one of the mediods with one of the non-
mediods so that the total distance of the resulting clustering is
improved.
- After initial selection of k-mediods, the algorithm repeatedly tries
to make a better choice of mediods by analyzing all the possible
pairs of objects such that one object is the mediod and the other
is not.
- The measure of clustering quality is calculated for each such
combination.
- The best choice of points in one iteration is chosen as the
mediods for the next iteration.
- Cost of single iteration is O(k(n-k)2).
- For large values of n and k, the cost of such computation could be
high.

- Advantage: - k-mediods method is more robust than k-means


method.
- Disadvantage: - k-mediods method is more costly than k-means
method.
- User needs to specify k the number of clusters in both these
methods.

(C) Partitioning method in large databases: from k-mediods to


CLARANS:
- (i) CLARA – Clustering LARge Applications – Sampling based
method.
- In this method, only a sample set of data is considered from
the whole dataset and the mediods are selected from this
sample using PAM. Sample selected randomly.
- CLARA draws multiple samples of the data set, applies PAM on
each sample and gives the best clustering as the output.
Classifies the entire dataset to the resulting clusters.
- Complexity of each iteration in this case is: O(kS2 + k(n-
k)); S = size of the sample; k = number of clusters; n =
total number of objects.
- Effectiveness of CLARA depends on sample size.
- Good clustering of samples does not imply good clustering
of the dataset if the sample is biased.
- (ii) CLARANS – Clustering LARge Applications based on
RANdomized Search
– To improve quality and scalability of CLARA.
-
- This is similar to PAM & CLARA
- It does not consider a sample or does not consider the entire
database.
- Begins like PAM – selects k-mediods by applying
Randomized Iterative Optimization
- Then randomly selects few pairs (k, j) = “maxneighbour”
number of pairs for swapping.
-
- If the pair with minimum cost found then updates the mediod set
and continues.
- Else current selections of mediods are considered as the local
optimum set.
- Now repeat by randomly selecting new mediods – search
for another local optimum set.
-
- Stops after finding “num local” number of local optimum sets.
- Returns best of local optimum sets.
- CLARANS enables detection of outliers – Best mediod based
method.
- Drawbacks – Assumes object fits into main memory; Result is
based on input order.
3.10 Hierarchical Methods

Hierarchical Methods
This works by grouping data objects into a tree of clusters. Two types –
Agglomerative and Divisive.
Clustering algorithms with integrated approach of these two types are
BIRCH, CURE, ROCK and CHAMELEON.

BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies:


- Integrated Hierarchical Clustering algorithm.
- Introduces two concepts – Clustering Feature and CF tree
(Clustering Feature Tree)
- CF Trees – Summarized Cluster Representation – Helps to
achieve good speed & clustering scalability
- Good for incremental and dynamical clustering of incoming data
points.
- Clustering Feature CF is the summary statistics for the cluster
defined as:
;
- where N is the number of points in the sub cluster (Each point is
represented as
);
- is the linear sum of N points = ; SS is the square sum
of data points

- CF Tree – Height balanced tree that stores the Clustering Features.


- This has two parameters – Branching Factor B and threshold T
- Branching Factor specifies the maximum number of children.

- nodes.
- Change the threshold value => Changes the size of the tree.
- The non-leaf nodes store sums of their children’s CF’s –
summarizes information about their children.

- BIRCH algorithm has the following two phases:


o Phase 1: Scan database to build an initial in-memory CF
tree – Multi-level compression of the data – Preserves the
inherent clustering structure of the data.

 CF tree is built dynamically as data points are


inserted to the closest leaf entry.
 If the diameter of the subcluster in the leaf node
after insertion becomes larger than the threshold
then the leaf node and possibly other nodes are
split.
 After a new point is inserted, the information
about it is passed towards the root of the tree.
- Threshold parameter T = maximum diameter of sub clusters
stored at the leaf
 Is the size of the memory to store the CF tree is
larger than the the size of the main memory, then a
smaller value of threshold is specified and the CF
tree is rebuilt.
 This rebuild process builds from the leaf nodes of
the old tree. Thus for building a tree data has to be
read from the database only once.

o Phase 2: Apply a clustering algorithm to cluster the leaf


nodes of the CF- tree.
- Advantages:
o Produces best clusters with available resources.
o Minimizes the I/O time
- Computational complexity of this algorithm is – O(N) – N is
the number of objects to be clustered.
- Disadvantage:
o Not a natural way of clustering;
o Does not work for non-spherical shaped clusters.

CURE – Clustering Using Representatives:


- Integrates hierarchical and partitioning algorithms.
- Handles clusters of different shapes and sizes; Handles outliers
separately.
- Here a set of representative centroid points are used to represent
a cluster.
- These points are generated by first selecting well scattered
points in a cluster and shrinking them towards the center of the
cluster by a specified fraction (shrinking factor)
- Closest pair of clusters are merged at each step of the algorithm.
- Having more than one representative point in a cluster allows
BIRCH to handle clusters of non-spherical shape.
- Shrinking helps to identify the outliers.
- To handle large databases – CURE employs a combination of
random sampling and partitioning.
- The resulting clusters from these samples are again merged to get
the final cluster.

- CURE Algorithm:
o Draw a random sample s
o Partition sample s into p partitions each of size s/p
o Partially cluster partitions into s/pq clusters where q > 1
o Eliminate outliers by random sampling – if a cluster is too
slow eliminate it.
o Cluster partial clusters
o Mark data with the corresponding cluster labels
o
- Advantage:
o High quality clusters
o Removes outliers
o Produces clusters of different shapes & sizes
o Scales for large database
- Disadvantage:
o Needs parameters – Size of the random sample; Number
of Clusters and Shrinking factor
o These parameter settings have significant effect on the
results.

ROCK:
- Agglomerative hierarchical clustering algorithm.
- Suitable for clustering categorical attributes.
- It measures the similarity of two clusters by comparing the
aggregate inter- connectivity of two clusters against a user
specified static inter-connectivity model.
- Inter-connectivity of two clusters C1 and C2 are defined by the
number of cross links between the two clusters.
- link(pi, pj) = number of common neighbors between two points pi
and pj.

- Two steps:
o First construct a sparse graph from a given data
similarity matrix using a similarity threshold and the
concept of shared neighbors.
o Then performs a hierarchical clustering algorithm on the
sparse graph.
CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:
- In this clustering process, two clusters are merged if the inter-
connectivity and closeness (proximity) between two clusters
are highly related to the internal interconnectivity and
closeness of the objects within the clusters.
- This merge process produces natural and homogeneous clusters.
- Applies to all types of data as long as the similarity function is
specified.

- This first uses a graph partitioning algorithm to cluster the


data items into large number of small sub clusters.
- Then it uses an agglomerative hierarchical clustering algorithm
to find the genuine clusters by repeatedly combining the sub
clusters created by the graph partitioning algorithm.
- To determine the pairs of most similar sub clusters,
it considers the interconnectivity as well as the
closeness of the clusters.

- In this objects are represented using k-nearest neighbor graph.


- Vertex of this graph represents an object and the edges are
present between two vertices (objects)
- Partition the graph by removing the edges in the sparse
region and keeping the edges in the dense region. Each of
these partitioned graph forms a cluster
- Then form the final clusters by iteratively merging the
clusters from the previous cycle based on their
interconnectivity and closeness.

- CHAMELEON determines the similarity between each pair of


clusters Ci and Cj according to their relative inter-
connectivity RI(Ci, Cj) and their relative closeness
RC(Ci, Cj).

-
- = edge-cut of the cluster containing both Ci and Cj

- = size of min-cut bisector

-
- = Average weight of the edges that connect
vertices in Ci to vertices in Cj

- = Average weight of the edges that belong to the


min-cut bisector of cluster Ci.
- Advantages:
o More powerful than BIRCH and CURE.
o Produces arbitrary shaped clusters
- Processing cost:

- - n = number of objects.

Cluster analysis is a primary method for database mining. It is either


used as a stand-alone tool to get insight into the distribution of a data
set, e.g. to focus further analysis and data processing, or as a
preprocessing step for other algorithms operating on the detected
clusters. Density-based approaches apply a local cluster criterion.
Clusters are regarded as regions in the data space in which the objects
are dense, and which are separated by regions of low object density
(noise). These regions may have an arbitrary shape and the points inside
a region may be arbitrarily distributed.
For other KDD applications, finding the outliers, i.e. the rare events, is
more interesting and useful than finding the common cases, e.g.
detecting criminal activities in E-commerce.

DBSCAN (Density-
Based Spatial Clustering
of Applications with Noise)
The algorithm DBSCAN, based on the formal
notion of density-reachability for k-
dimensional points, is designed to discover
clusters of arbitrary shape. The runtime of
the algorithm is of the order O(n log n) if
region queries are efficiently supported by
spatial index structures, i.e. at least in
moderately dimensional spaces.

GDBSCAN (Generalized Density-


Based Spatial Clustering
of Applications with Noise)
GDBSCAN generalizes the notion of point
densitiy and therefore it can be applied to
objects of arbitrary data type, e.g. 2-
dimensional polygons.
The code for GDBSCAN (including the
specialization to DBSCAN) is available by e-
mail from Dr. Jörg Sander.

OPTICS (Ordering Points To Identify


the Clustering Structure)
While DBSCAN computes a single level
clustering, i.e. clusters of s single, user
defined density, the algorithm OPTICS
represents the intrinsic, hirarchical structure
of the data by a (one-dimensional) ordering
of the points. The resulting graph (called
reachability plot) visualizes clusters of
different densities as well as hierachical
clusters.
Click here for our Online-Demo. The code for
OPTICS is available by e-mail from Dr. Jörg
Sander or Markus Breunig.

LOF (Local Outlier Factors)


Based on the same theoretical foundation as
OPTICS, LOF computes the local outliers of a
dataset, i.e. objects that are outliers relative
to their surrounding space, by assigning an
outlier factor to each object. This outlier
factor can be used to rank the objects
regarding their outlier-ness. The outlier
factors can be computed very efficiently if
OPTICS is used to analyze the clustering
structure.

Grid-Based Clustering Method

 Using multi-resolution grid data structure

 Several interesting methods

 STING (a STatistical INformation Grid approach) by Wang, Yang

and Muntz (1997)

 WaveCluster by Sheikholeslami, Chatterjee, and Zhang

(VLDB’98)

 A multi-resolution clustering approach using wavelet method

 CLIQUE: Agrawal, et al. (SIGMOD’98)

 Both grid-based and subspace clustering

CLIQUE: The Major Steps

 Partition the data space and find the number of points that lie inside

each cell of the partition.

 Identify the subspaces that contain clusters using the Apriori

principle

 Identify clusters
 Determine dense units in all subspaces of interests

 Determine connected dense units in all subspaces of interests.

 Generate minimal description for the clusters

 Determine maximal regions that cover a cluster of connected

dense units for each cluster

 Determination of minimal cover for each cluster


Model-based clustering

• One disadvantage of hierarchical clustering algorithms, k-means


algorithms and others is that they are largely heuristic and not
based on formal models. Formal inference is not possible.

• Not necessarily a disadvantage since clustering is largely exploratory.

• Model-based clustering is an alternative. Banfield and Raftery

(1993, Biometrics) is the classic reference. A more comprehensive

and up-to- date reference is Melnykov and Maitra (2010, Statistics

Surveys) also available on Professor Maitra’s “Manuscripts Online”

link.

• SAS will not implement model-based clustering algorithms.

• With R, you need to load a package called mclust and accept the
terms of the (free) license. mclust is a very good package, but it
can have issues with initialization.
Basic are p-variate normal distributions. (This does not
idea necessarily mean things are easy: inference in tractable,
behin
d however.)
Model
-
based • Thus, the probability model for clustering will often be a
Clust mixture of multivariate normal distributions.
ering
• Each component in the mixture is what we call a cluster.
• Sample
observations
arise from a
distribution
that is a
mixture of two
or more
components.

• Each
component is
described by
a density
function and
has an
associated
probability or
“weight” in
the mixture.

• In principle,

we can adopt

any

probability

model for the

components,

but typically

we will

assume that

components
Clustering high-dimensional data :

Clustering high-dimensional data is the cluster analysis of data with


anywhere from a few dozen to many thousands of dimensions. Such high-
dimensional data spaces are often encountered in areas such as medicine,
where DNA microarray technology can produce a large number of measurements
at once, and the clustering of text documents, where, if a word-frequency vector
is used, the number of dimensions equals the size of the vocabulary.

Four problems need to be overcome for clustering in high-dimensional data:

 Multiple dimensions are hard to think in, impossible to visualize, and, due to
the exponential growth of the number of possible values with each dimension,
complete enumeration of all subspaces becomes intractable with increasing
dimensionality. This problem is known as the curse of dimensionality.
 The concept of distance becomes less precise as the number of dimensions
grows, since the distance between any two points in a given dataset
converges. The discrimination of the nearest and farthest point in particular
becomes meaningless:

 A cluster is intended to group objects that are related, based on


observations of their attribute's values. However, given a large number of
attributes some of the attributes will usually not be meaningful for a given
cluster. For example, in newborn screening a cluster of samples might
identify newborns that share similar blood values, which might lead to
insights about the relevance of certain blood values for a disease. But for
different diseases, different blood values might form a cluster, and other
values might be uncorrelated. This is known as the local feature
relevance problem: different clusters might be found in different
subspaces, so a global filtering of attributes is not sufficient.
 Given a large number of attributes, it is likely that some attributes
are correlated. Hence, clusters might exist in arbitrarily oriented affine
subspaces.
Recent research indicates that the discrimination problems only occur when
there is a high number of irrelevant dimensions, and that shared-nearest-
neighbor approaches can improve results.

Mining Data Streams


Tremendous and potentially infinite volumes of data streams are often
generated by real-time surveillance systems, communication networks,
Internet traffic, on-line trans- actions in the financial market or retail
industry, electric power grids, industry pro- duction processes, scientific
and engineering experiments, remote sensors, and other dynamic
environments. Unlike traditional data sets, stream data flow in and out of
a computer system continuously and with varying update rates. They are
temporally ordered, fast changing, massive, and potentially infinite. It may be
impossible to store an entire data stream or to scan through it multiple times
due to its tremendous volume. More- over, stream data tend to be of a
rather low level of abstraction, whereas most analysts are interested in
relatively high-level dynamic changes, such as trends and deviations. To
discover knowledge or patterns from data streams, it is necessary to
develop single-scan, on-line, multilevel, multidimensional stream
processing and analysis methods.

Such single-scan, on-line data analysis methodology should not be


confined to only stream data. It is also critically important for processing
nonstream data that are mas- sive. With data volumes mounting by
terabytes or even petabytes, stream data nicely capture our data
processing needs of today: even when the complete set of data is col-
lected and can be stored in massive data storage devices, single scan (as
in data stream systems) instead of random access (as in database
systems) may still be the most realistic processing mode, because it is
often too expensive to scan such a data set multiple times. In this section,
we introduce several on-line stream data analysis and mining methods.
Methodologies for Stream Data Processing and Stream Data
Systems
As seen from the previous discussion, it is impractical to scan through an
entire data stream more than once. Sometimes we cannot even “look” at
every element of a stream because the stream flows in so fast and
changes so quickly. The gigantic size of such data sets also implies that
we generally cannot store the entire stream data set in main memory or
even on disk. The problem is not just that there is a lot of data, it is that
the universes that we are keeping track of are relatively large, where a
universe is the domain of possible values for an attribute. For example, if
we were tracking the ages of millions of people, our universe would be
relatively small, perhaps between zero and one hundred and twenty. We
could easily maintain exact summaries of such data. In contrast, the
universe corresponding to the set of all pairs of IP addresses on the
Internet is very large, which makes exact storage intractable. A
reasonable way of thinking about data streams is to actually think of a
physical stream of water. Heraclitus once said that you can never step in
the same stream twice,1 and so it is with stream data.
From the algorithmic point of view, we want our algorithms to be
efficient in both space and time. Instead of storing all or most elements seen
so far, using O(N) space, we often want to use polylogarithmic space, O(logk
N), where N is the number of elements in the stream data. We may relax the
requirement that our answers are exact, and ask for approximate answers
within a small error range with high probability. That is, many data stream–
based algorithms compute an approximate answer within a factor  of the
actual answer, with high probability. Generally, as the approximation factor
(1+) goes down, the space requirements go up. In this section, we
examine some common synopsis data structures and techniques.

Random Sampling
Rather than deal with an entire data stream, we can think of sampling the
stream at peri- odic intervals. “To obtain an unbiased sampling of the data, we need to
know the length of the stream in advance. But what can we do if we do not know this length in
advance?” In this case, we need to modify our approach. A technique called
reservoir sampling can be used to select an unbiased random sample of s
elements without replacement. The idea behind reservoir sampling is rel- atively
simple. We maintain a sample of size at least s, called the “reservoir,” from
which a random sample of size s can be generated. However, generating this
sample from the reservoir can be costly, especially when the reservoir is large
Sliding Windows
Instead of sampling the data stream randomly, we can use the sliding window
model to analyze stream data. The basic idea is that rather than running
computations on all of the data seen so far, or on some sample, we can
make decisions based only on recent data. More formally, at every time t, a new
data element arrives. This element “expires” at time t + w, where w is the
window “size” or length. The sliding window model is useful for stocks or sensor
networks, where only recent events may be important. It also reduces memory
requirements because only a small window of data is stored.
Histograms
The histogram is a synopsis data structure that can be used to approximate the
frequency distribution of element values in a data stream. A histogram
partitions the data into a set of contiguous buckets. Depending on the
partitioning rule used, the width (bucket value range) and depth (number of
elements per bucket) can vary. The equal-width par- titioning rule is a simple
way to construct histograms, where the range of each bucket is the same.
Although easy to implement, this may not sample the probability distribution
function well.

Multiresolution Methods
A common way to deal with a large amount of data is through the use of data
reduction methods . A popular data reduction method is the use of divide-and-
conquer strategies such as multiresolution data structures. These allow a
program to trade off between accuracy and storage, but also offer the ability to
understand a data streamatmultiplelevelsofdetail.

Data Stream Management Systems and Stream Queries


Intraditionaldatabasesystems,
dataarestoredinfiniteandpersistentdatabases. However, stream
data are infinite and impossible to store fully in a database. In a
Data Stream Man- agement System (DSMS), there may be
multiple data streams. They arrive on-line and are continuous,
temporally ordered, and potentially infinite. Once an element
from a data stream has been processed, it is discarded or
archived, and it cannot be easily retrieved unless it is explicitly
stored in memory.
A stream data query processing architecture includes three
parts: end user, query pro- cessor, and scratch space (which may
consist of main memory and disks). An end user issues a query to
the DSMS, and the query processor takes the query, processes it
using the information stored in the scratch space, and returns the
results to the user.
Queries can be either one-time queries or continuous queries. A one-time query is
eval- uated once over a point-in-time snapshot of the data set, with the answer
returned to the user. A continuous query is evaluated continuously as data
streams continue to arrive. The answer to a continuous query is produced over
time, always reflecting the stream data seen so far. A continuous query can act
as a watchdog, as in “sound the alarm if the power consumption for Block 25 exceeds
a certain threshold.” Moreover, a query can be pre- defined (i.e., supplied to the data
stream management system before any relevant data have arrived) or ad hoc
(i.e., issued on-line after the data streams have already begun).
Stream OLAP and Stream Data Cubes
Stream data are generated continuously in a dynamic environment, with
huge volume, infinite flow, and fast-changing behavior. It is impossible to
store such data streams com- pletely in a data warehouse. Most stream data
represent low-level information, consisting of various kinds of detailed
temporal and other features. To find interesting or unusual patterns, it is
essential to perform multidimensional analysis on aggregate measures (such as
sum and average). This would facilitate the discovery of critical changes in
the data at higher levels of abstraction, from which users can drill down to
examine more detailed levels, when needed.

Mining Time-Series Data


“What is a time-series database?” A time-series database consists of
sequences of val- ues or events obtained over repeated
measurements of time. The values are typically measured at
equal time intervals (e.g., hourly, daily, weekly). Time-series
databases are popular in many applications, such as stock
market analysis, economic and sales fore- casting, budgetary
analysis, utility studies, inventory studies, yield projections, work-
load projections, process and quality control, observation of
natural phenomena (such as atmosphere, temperature, wind,
earthquake), scientific and engineering experiments, and medical
treatments. A time-series database is also a sequence database.
However, a sequence database is any database that consists of
sequences of ordered events, with or without concrete notions of
time. For example, Web page traversal sequences and customer
shopping transaction sequences are sequence data, but they may
not be time-series data.
With the growing deployment of a large number of sensors,
telemetry devices, and other on-line data collection tools, the
amount of time-series data is increasing rapidly, often in the
order of gigabytes per day (such as in stock trading) or even per
minute (such as from NASA space programs). How can we find
correlation relationships within time-series data? How can we
analyze such huge numbers of time series to find similar or
regular patterns, trends, bursts (such as sudden sharp changes),
and outliers, with
fast or even on-line real-time response? This has become
an increasingly important and challenging problem. In
this section, we examine several aspects of mining time-
series databases, with a focus on trend analysis and
similarity search Similarity Search in Time-Series Analysis
“What is a similarity search?” Unlike normal database queries, which find data
that match the given query exactly, a similarity search finds data sequences
that differ only slightly from the given query sequence. Given a set of time-series
sequences, S, there are two types of similarity searches: subsequence matching and
whole sequence matching. Subse- quence matching finds the sequences in S that
contain subsequences that are similar to a given query sequence x, while
whole sequence matching finds a set of sequences in S that are similar to
each other (as a whole). Subsequence matching is a more fre- quently
encountered problem in applications. Similarity search in time-series analysis is
useful for financial market analysis (e.g., stock data analysis), medical diagnosis
(e.g., car- diogram analysis), and in scientific or engineering databases (e.g.,
power consumption analysis).

Query Languages for Time Sequences


“How can I specify the similarity search to be performed?” We need to
design and develop powerful query languages to facilitate the
specification of similarity searches in time sequences. A time-
sequence query language should be able to specify not only
simple similarity queries like “Find all of the sequences similar to a given
subsequence Q,” but also sophisticated queries like “Find all of the
sequences that are similar to some sequence in class C1, but not similar to
any sequence in class C2.” Moreover, it should be able to support
various kinds of queries, such as range queries and nearest-neighbor
queries.
An interesting kind of time-sequence query language is a shape definition
lan- guage. It allows users to define and query the overall shape of time
sequences using human-readable series of sequence transitions or macros,
while ignoring the specific details.

Mining Sequence Patterns in Transactional Databases


A sequence database consists of sequences of ordered elements or
events, recorded with or without a concrete notion of time. There
are many applications involving sequence data. Typical examples
include customer shopping sequences, Web clickstreams, bio-
logical sequences, sequences of events in science and engineering,
and in natural and social developments. In this section, we study
sequential pattern mining in transactional databases. In particular,
we start with the basic concepts of sequential pattern mining in
Section 8.3.1. Section 8.3.2 presents several scalable methods for
such mining. Constraint-based sequential pattern mining is
described in Section 8.3.3. Periodicity analysis for sequence data is
discussed in Section 8.3.4. Specific methods for mining sequence
patterns in biological data are addressed in Section 8.4.

Sequential Pattern Mining: Concepts and Primitives


“What is sequential pattern mining?” Sequential pattern mining is the
mining of fre- quently occurring ordered events or subsequences as
patterns. An example of a sequen- tial pattern is “Customers who buy a
Canon digital camera are likely to buy an HP color printer within a month.”
For retail data, sequential patterns are useful for shelf placement
and promotions. This industry, as well as telecommunications and
other businesses, may also use sequential patterns for targeted
marketing, customer retention, and many other tasks. Other areas
in which sequential patterns can be applied include Web access
pat- tern analysis, weather prediction, production processes, and
network intrusion detec- tion. Notice that most studies of
sequential pattern mining concentrate on categorical (or symbolic)
patterns, whereas numerical curve analysis usually belongs to the
scope of trend analysis and forecasting in statistical time-series
analysis, as discussed in Section 8.2.
The sequential pattern mining problem was first introduced by
Agrawal and Srikant in 1995 [AS95] based on their study of
customer purchase sequences, as follows: “Given a set of sequences,
where each sequence consists of a list of events (or elements) and each event
consists of a set of items, and given a user-specified minimum support threshold of
min sup, sequential pattern mining finds all frequent subsequences, that is, the
subsequences whose occurrence frequency in the set of sequences is no less than
min sup.”
Let’s establish some vocabulary for our discussion of sequential
pattern mining. Let I = {I1, I2, . . . , Ip} be the set of all items.
An itemset is a nonempty set of items. A sequence is an
ordered list of events. A sequence s is denoted (e1e2e3 · · · el),
where
event e1 occurs before e2, which occurs before e3, and so on.
Event e j is also called an element of s. In the case of customer
purchase data, an event refers to a shopping trip in which a
customer bought items at a certain store. The event is thus an
itemset, that is, an unordered list of items that the customer
purchased during the trip. The itemset (or
event) is denoted (x1x2 · · · xq), where xk is an item. For brevity, the
brackets are omitted
if an element has only one item, that is, element (x) is written as x.
Suppose that a cus-
tomer made several shopping trips to the store. These ordered
events form a sequence for the customer. That is, the
customer first bought the items in s1, then later bought
the items in s2, and so on. An item can occur at most once in an
event of a sequence, but can occur multiple times in different
events of a sequence. The number of instances of items in a
sequence is called the length of the sequence. A sequence with
length l is
called an l-sequence. A sequence  = (a1a2 · · · an) is called a
subsequence of another sequence  = (b1b2 · · · bm), and  is a
supersequence of , denoted as  ± , if there exist integers 1 ≤ j1
< j2 < · · · < jn ≤ m such that a1 ⊆ b j1 , a2 ⊆ b j2 , . . . , an ⊆ b jn .
For example, if  = ((ab), d) and  = ((abc), (de)), where a, b, c, d,
and e are items, then 
is a subsequence of  and  is a supersequence of .
A sequence database, S, is a set of tuples, (SID, s), where SID is a
sequence ID and
s is a sequence. For our example, S contains sequences for all
customers of the store. A tuple (SID, s) is said to contain a
sequence , if  is a subsequence of s. The support
of a sequence  in a sequence database S is the number of tuples in
the database con-
taining , that is, supportS() = | {(SID, s)|((SID, s) ∈ S) ∧ ( ± s)} |.
It can be denoted as support() if the sequence database is clear
from the context. Given a positive inte-
ger min sup
frequent in as the minimum
sequence database support threshold,
S if support a sequence  is
S() ≥ min sup. That is,
for sequence  to be frequent, it must occur
at least min sup times in S. A frequent sequence is called a sequential
pattern. A sequen- tial pattern with length l is called an l-pattern.
The following example illustrates these
concepts.
Scalable Methods for Mining Sequential Patterns
Sequential pattern mining is computationally challenging because
such mining may gen- erate and/or test a combinatorially
explosive number of intermediate subsequences.
“How can we develop efficient and scalable methods for sequential pattern
mining?” Recent developments have made progress in two
directions: (1) efficient methods for mining the full set of
sequential patterns, and (2) efficient methods for mining only
the set of closed sequential patterns, where a sequential pattern s
is closed if there exists
no sequential pattern st where st is a proper supersequence of
s, and st has the same
(frequency) support as s.6 Because all of the subsequences of a
frequent sequence are also frequent, mining the set of closed
sequential patterns may avoid the generation of unnecessary
subsequences and thus lead to more compact results as well as
more effi- cient methods than mining the full set. We will first
examine methods for mining the full set and then study how
they can be extended for mining the closed set. In addition, we
discuss modifications for mining multilevel, multidimensional
sequential patterns (i.e., with multiple levels of granularity).

GSP: A Sequential Pattern Mining Algorithm Based on


Candidate Generate-and-Test
GSP (Generalized Sequential Patterns) is a sequential pattern mining method
that was developed by Srikant and Agrawal in 1996. It is an extension of
their seminal algorithm for frequent itemset mining, known as Apriori (Section
5.2). GSP uses the downward-closure property of sequential patterns and
adopts a multiple-pass, candi- date generate-and-test approach. The
algorithm is outlined as follows. In the first scan of the database, it finds all of
the frequent items, that is, those with minimum sup- port. Each such item
yields a 1-event frequent sequence consisting of that item. Each subsequent
pass starts with a seed set of sequential patterns—the set of sequential
patterns found in the previous pass. This seed set is used to generate new
potentially frequent patterns, called candidate sequences. Each candidate
sequence contains one more item than the seed sequential pattern from
which it was generated (where each event in the pattern may contain one or
multiple items). Recall that the number of instances of items in a sequence is
the length of the sequence. So, all of the candidate sequences in a given pass
will have the same length. We refer to a sequence with length k as a k-
sequence. Let Ck denote the set of candidate k-sequences. A pass over the
database finds the support for each candidate k-sequence. The candidates
in Ck with at least min sup form Lk, the set of all frequent k-sequences. This
set then becomes the seed set for the next pass, k + 1. The algorithm
terminates when no new sequential pattern is found in a pass, or no
candidate sequence can be generated.

SPADE: An Apriori-Based Vertical Data Format


Sequential Pattern Mining Algorithm
The Apriori-like sequential pattern mining approach (based on
candidate generate-and- test) can also be explored by
mapping a sequence database into vertical data format. In
vertical data format, the database becomes a set of tuples of the
form (itemset :
(sequence ID, event ID)). That is, for a given itemset, we record the
sequence identifier
and corresponding event identifier for which the itemset occurs. The
event identifier
serves as a timestamp within a sequence. The event ID of the ith
itemset (or event) in a sequence is i. Note than an itemset can
occur in more than one sequence. The set of (sequence ID, event
ID) pairs for a given itemset forms the ID list of the itemset. The
mapping from horizontal to vertical format requires one scan of the
database. A major advantage of using this format is that we can
determine the support of any k-sequence
by simply joining the ID lists of any two of its (k − 1)-length
subsequences. The length
of the resulting ID list (i.e., unique sequence ID values) is equal to
the support of the
k-sequence, which tells us whether the sequence is frequent.
SPADE (Sequential PAttern Discovery using Equivalent
classes) is an Apriori-based sequential pattern mining algorithm
that uses vertical data format. As with GSP, SPADE requires one
scan to find the frequent 1-sequences. To find candidate 2-
sequences, we join all pairs of single items if they are
frequent (therein, it applies the Apriori

PrefixSpan: Prefix-Projected Sequential Pattern Growth


Pattern growth is a method of frequent-pattern mining that does not
require candi- date generation. The technique originated in the FP-
growth algorithm for transaction databases, presented in Section
5.2.4. The general idea of this approach is as follows: it finds the
frequent single items, then compresses this information into a
frequent-pattern
tree, or FP-tree. The FP-tree is used to generate a set of projected
databases, each associ- ated with one frequent item. Each of
these databases is mined separately. The algorithm builds prefix
patterns, which it concatenates with suffix patterns to find
frequent pat- terns, avoiding candidate generation. Here, we
look at PrefixSpan, which extends the pattern-growth approach
to instead mine sequential patterns.
where πk = probability that xi belongs to the kth component (0 <
.
πk < 1, kπk = 1).

• Difficult to find MLEs of these parameters directly: use EM.


You might also like