0% found this document useful (0 votes)
11 views38 pages

Data Mining Unit-4

Cluster analysis is the process of partitioning data objects into subsets called clusters, where objects in a cluster are similar to each other but dissimilar to those in other clusters. Effective clustering requires scalability, the ability to handle various data types and shapes, robustness to noise, and interpretability. Applications of clustering include business intelligence, image recognition, web search, biology, and security.

Uploaded by

anitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views38 pages

Data Mining Unit-4

Cluster analysis is the process of partitioning data objects into subsets called clusters, where objects in a cluster are similar to each other but dissimilar to those in other clusters. Effective clustering requires scalability, the ability to handle various data types and shapes, robustness to noise, and interpretability. Applications of clustering include business intelligence, image recognition, web search, biology, and security.

Uploaded by

anitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT-IV

CLUSTERING AND APPLICATIONS


Cluster Analysis:
 Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations) into subsets.
 Each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other
clusters. The set of clusters resulting from a cluster analysis can be referred to as a clustering.
 Clustering is also called data segmentation in some applications because clustering partitions large data sets into
groups according to their similarity.
Cluster Analysis Requirements
Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data
objects; however, a large database may contain millions or even billions of objects, particularly in Web search scenarios.
Clustering on only a sample of a given large data set may lead to biased results. Therefore, highly scalable clustering
algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to cluster numeric (interval-based)
data. However, applications may require clustering other data types, such as binary, nominal (categorical), and ordinal
data, or mixtures of these data types. Recently, more and more applications need clustering techniques for complex data
types such as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or
Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar
size and density. However, a cluster could be of any shape. Consider sensors, for example, which are often deployed for
environment surveillance. Cluster analysis on sensor readings can detect interesting phenomena. We may want to use
clustering to find the frontier of a running forest fire, which is often not spherical. It is important to develop algorithms
that can detect clusters of arbitrary shape.
Requirements for domain knowledge to determine input parameters: Many clustering algorithms require users
to provide domain knowledge in the form of input parameters such as the desired number of clusters. Consequently, the
clustering results may be sensitive to such parameters. Parameters are often hard to determine, especially for high-
dimensionality data sets and where users have yet to grasp a deep understanding of their data. Requiring the
specification of domain knowledge not only burdens users, but also makes the quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data. Sensor readings, for example, are often noisy—some readings may be inaccurate due to the sensing mechanisms,
and some readings may be erroneous due to interferences from surrounding transient objects. Clustering algorithms can
be sensitive to such noise and may produce poor-quality clusters. Therefore, we need clustering methods that are robust
to noise.
Incremental clustering and insensitivity to input order: In many applications, incremental updates (representing
newer data) may arrive at any time. Some clustering algorithms cannot incorporate incremental updates into existing
clustering structures and, instead, have to recompute a new clustering from scratch. Clustering algorithms may also be
sensitive to the input data order. That is, given a set of data objects, clustering algorithms may return dramatically
different clustering depending on the order in which the objects are presented. Incremental clustering algorithms and
algorithms that are insensitive to the input order are needed.
Capability of clustering high-dimensionality data: A data set can contain numerous dimensions or attributes. When
clustering documents, for example, each keyword can be regarded as a dimension, and there are often thousands of
keywords. Most clustering algorithms are good at handling low-dimensional data such as data sets involving only two or
three dimensions. Finding clusters of data objects in a high dimensional space is challenging, especially considering that
such data can be very sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of
constraints. Suppose that your job is to choose the locations for a given number of new automatic teller machines (ATMs)
in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers and
highway networks and the types and number of customers per cluster. A challenging task is to find data groups with
good clustering behavior that satisfy specified constraints.
Interpretability and usability: Users want clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied in with specific semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering features and clustering methods.
Cluster Analysis Applications
1. Business Intelligence
2. Image Recognition
3. Web Search
4. Biology

5. Security
Types of Data in Cluster Analysis
 Data Structures
 Interval-Valued (Numeric) Variables
 Binary Variables
 Categorical Variables
 Ordinal Variables
 Variables of Mixed Types
Basic Clustering Methods
Types of Data in Cluster
Analysis

You might also like