0% found this document useful (0 votes)

61 views49 pages

DM Mod5

This document discusses cluster analysis techniques for data mining. It begins by defining cluster analysis as finding groups of similar objects within data. Various applications of cluster analysis are described, including in biology, information retrieval, climate studies, psychology, medicine, and business. Common types of clusterings like hierarchical and partitional are introduced. Specific clustering algorithms covered include k-means clustering, which groups data into k clusters based on minimizing distance to cluster centroids. Issues with k-means and an extension called bisecting k-means are also summarized. Finally, agglomerative hierarchical clustering is briefly mentioned.

Uploaded by

Srushti PS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views49 pages

DM Mod5

Uploaded by

Srushti PS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

MODULE-5

Cluster Analysis
Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups.
The greater the similarity within a group and the greater the difference between groups,the better
or more distinct the clustering.

Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both.

In the context of understanding data, clusters are potential classes and cluster analysis is the
study of techniques for automatically finding classes.

Clustering: Applications

Biology: biologists have applied clustering to analyze the large amounts of genetic information
that are now available.
For example, clustering has been used to find groups of genes that have similar functions.

Information Retrieval:. The World Wide Web consists of billions of Web pages, and the results
of a query to a search engine can return thousands of pages. Clustering can be used to group
these search results into a small number of clusters, each of which captures a particular aspect of
the query.

Climate: Understanding the Earth's climate requires finding patterns in the atmosphere and
ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure
of polar regions and are as of the ocean that have a significant impact on land climate.

Psychology and Medicine: An illness or condition frequently has a number of variations, and
cluster analysis can be used to identify these different subcategories.

Page 1
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional
analysis and marketing activities.

Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

- A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset

- Hierarchical clustering

A set of nested clusters organized as a hierarchical tree

Types of Clusters

➢ Well-separated clusters
➢ Center-based clusters
➢ Contiguous clusters
➢ Density-based clusters
➢ Property or Conceptual

Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every
other point in the cluster than to any point not in the cluster

Page 2
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Center-based (proto type based)

- A cluster is a set of objects such that an object in a cluster is closer (more similar)
to the “center” of a cluster, than to the center of any other cluster
- The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster

Contiguous Cluster (Graph based)

- A cluster is a set of points such that a point in a cluster is closer (or more similar)
to one or more other points in the cluster than to any point not in the cluster.

Page 3
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Density-based

- A cluster is a dense region of points, which is separated by low-density regions,

from other regions of high density.
- Used when the clusters are irregular or intertwined, and when noise and outliers
are present.

Shared Property or Conceptual Clusters

- Finds clusters that share some common property or represent a particular concept.

K-means Clustering
➢ Partitional clustering approach
➢ Each cluster is associated with a centroid (center point)
➢ Each point is assigned to the cluster with the closest centroid
➢ Number of clusters, K, must be specified

Page 4
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

➢ Initial centroids are often chosen randomly.

➢ Clusters produced vary from one run to another.
➢ The centroid is (typically) the mean of the points in the cluster. ➢
‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
➢ K-means will converge for common similarity measures mentioned above.
➢ Most of the convergence happens in the first little iteration.
➢ Often the stopping condition is changed to ‘Until relatively few points
change clusters’
➢ Complexity is O( n * K * I * d )
➢ n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE).

For each point, the error is the distance to the nearest cluster.

To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi
corresponds to the center (mean) of the cluster.

Given two clusters, we can choose the one with the smallest error.

Page 5
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

One easy way to reduce SSE is to increase K, the number of clusters.

A good clustering with smaller K can have a lower SSE than a poor clustering with higher K.

Problems with Selecting Initial Points

If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.

Chance is relatively small when K is large

If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036

Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t

K-means: Additional Issues

Handling Empty Clusters
Basic K-means algorithm can yield empty clusters
Several strategies to address this..
➢ Choose the point that contributes most to SSE
➢ Choose a point from the cluster with the highest SSE
➢ If there are several empty clusters, the above can be repeated several times
Updating Centers Incrementally

In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

- Each assignment updates zero or two centroids

- More expensive

- Introduces an order dependency

- Never get an empty cluster

- Can use “weights” to change the impact

Page 6
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Pre-processing and Post-processing

Pre-processing
➢ Normalize the data
➢ Eliminate outliers
Post-processing
➢ Eliminate small clusters that may represent outliers
➢ Split ‘loose’ clusters, i.e., clusters with relatively high SSE
➢ Merge clusters that are ‘close’ and that have relatively low SSE
➢ Can use these steps during the clustering process

Strengths and Weaknesses of K-means (Limitations)

➢ K-means is simple and can be used for a wide variety of data types.
➢ It is also quite efficient, even though multiple runs are often performed.
➢ K-means is not suitable for all types of data.
➢ K-means has problems when clusters are of differing
o Sizes
o Densities
o Non-globular shapes
➢ K-means has problems when the data contains outliers.

BisectingK-means

The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm
that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters,
select one of these clusters to split, and so on, until K clusters have been produced.

Page 7
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

There are a number of different ways to choose which cluster to split. We can choose the largest
cluster at each step, choose the one with the largest SSE, or use a criterion based on both size and
SSE. Different choices result in different clusters.

Page 8
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 9
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Agglomerative Hierarchical Clustering:

➢ More popular hierarchical clustering technique.
➢ Produces a set of nested clusters organized as a hierarchical tree.
➢ Can be visualized as a dendrogram.
➢ A tree like diagram that records the sequences of merges or splits.
6 5

4
3 4
0.2 2
5
0.15 2

0.1
1
3 1
0.05

0
1 3 2 5 4 6

Do not have to assume any particular number of clusters

- Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the
proper level

Page 10
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Basic algorithm is

How to Define Inter-Cluster Similarity (Proximity of two clusters):

MIN
MAX
Group Average

1) Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is
defined as the minimum of the distance (maximum of the similarity) between any two points in
the two different clusters.

we shall use sample data that consists of 6 two-dimensional points, which are shown in Figure
8.15. The r and g coordinates of the points and the Euclidean distances between them are shown
in Tables 8.3 and 8.4. respectively.

Page 11
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Figure 8.16 shows the result of applying the single link technique to our example data set of six
points. Figure 8.16(a) shows the nested clusters as a sequence of nested ellipses, where the
numbers associated with the ellipses indicate the order of the clustering. Figure S.16(b) shows
the same information, but as a dendrogram.

The height at which two clusters are merged in the dendrogram reflects the distance of the two
clusters.

For instance, from Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is
the height at which they are joined into one cluster in the dendrogram. As another example, the
distance between clusters {3,6} and {2,5} is given by

Page 12
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

2) Complete Link or MAX or CLIQUE

For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined as the maximum of the distance (minimum of the similarity) between any two points in
the two different clusters. Using graph terminology, if you start with all points as singleton
clusters and add links between points one at a time, shortest links first, then a group of points is
not a cluster until all the points in it are completely linked, i.e., form a clique.

Example 8.5 (Complete Link). Figure 8.17 shows the results of applying MAX to the sample
data set of six points. As with single link, points 3 and 6 are merged first. However, {3,6} is
merged with {4}, instead of {2,5} or {1}.

Page 13
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

3) Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is defined
as the average pairwise proximity among all pairs of points in the different clusters.

This is an intermediate approach between the single and complete link approaches. Thus, for
group average, the cluster proximity proximity(Ci,Cj) of clusters C,i and Cj', which are of size
mi and mj, respectively, is expressed by the following equation

Page 14
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Figure 8.18 shows the results of applying the group average approach to the sample data set of
six points. To illustrate how group average works, we calculate the distance between some
clusters.

Key Issues in Agglomerative Hierarchical Clustering (Strengths and Weaknesses)

➢ Once a decision is made to combine two clusters, it cannot be undone.
➢ No objective function is directly minimized.
➢ Different schemes have problems with one or more of the following:
- Sensitivity to noise and outliers
- Difficulty handling different sized clusters and convex shapes
- Breaking large clusters

Page 15
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 16
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 17
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 18
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 19
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

The DBSCAN Algorithm

DBSCAN is a density-based algorithm.
➢ Density = number of points within a specified radius (Eps)
➢A point is a core point if it has more than a specified number of
points (MinPts) within Eps
➢ These are points that are at the interior of a cluster
➢A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
➢ A noise point is any point that is not a core point or a border point.

Page 20
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Eliminate noise points

Perform clustering on the remaining points

Strengths and weaknesses of DBSCAN

➢ It is relatively Resistant to Noise.
➢ It can Can handle clusters of different shapes and sizes
➢ Does NOT Work Well when the clusters having Varying densities
➢ Does NOT Work Well With High-dimensional data.

Page 21
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 22
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Cluster Evaluation:
Why do we want to evaluate them?
➢ To avoid finding patterns in noise.
➢ To compare clustering algorithms.
➢ To compare two sets of clusters. ➢
To compare two clusters.
Different Aspects of Cluster Validation:
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-
random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally
given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to
external information.
- Use only the data
1. Comparing the results of two different sets of cluster analyses to determine which is
better.
2. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.

Supervised Cluster Evaluation Using Cohesion and Separation

Cluster cohesion (compactness, tightness), which determine how closely related the objects in a
cluster are.

cluster separation (isolation), which determine how distinct or well separated a cluster is from
other clusters.

Graph-Based View of Cohesion and Separation:

For graph-based clusters, the cohesion of a cluster can be defined as the sum of the weights of
the links in the proximity graph that connect points within the cluster.

Page 23
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

The separation between two clusters can be measured by the sum of the weights of the links from
points in one cluster to points in the other cluster.

Mathematically, cohesion and separation for a graph-based cluster can be expressed using
Equations 8.9 and 8.10, respectively.

Prototype-Based View of Cohesion and Separation

For prototype-based clusters, the cohesion of a cluster can be defined as the sum of the
proximities with respect to the prototype (centroid or medoid) of the cluster. Similarly, the
separation between two clusters can be measured
by the proximity of the two cluster prototypes

Page 24
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Measuring Cluster Validity Via Correlation

• Two matrices
• Similarity or Distance Matrix
• One row and one column for each data point
• An entry is the similarity or distance of the associated pair of points
• “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
• Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation (positive for similarity, negative for distance) indicates that points that
belong to the same cluster are close to each other.
• Not a good measure for some density or contiguity based clusters.
Supervised Measures of Cluster Validity (External Measures for Clustering Validity)
Two different kinds of approaches.

The first set of techniques use measures from classification, such as entropy, purity, and the F-
measure. These measures evaluate the extent to which a cluster contains objects of a single
class.

Page 25
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

The second group of methods is related to the similarity measures for binary data, such as the
Jaccard measure . These approaches measure the extent to which two objects that are in the same
class are in the same cluster and vice versa.

Classification-Oriented Measures of Cluster Validity

• Assume that the data is labeled with some class labels
• E.g., documents are classified into topics, people classified according to their
income, politicians classified according to the political party.
• This is called the “ground truth”
• In this case we want the clusters to be homogeneous with respect to classes
• Each cluster should contain elements of mostly one class
Each class should ideally be assigned to a single cluster
Confusion matrix

Measures:

Page 26
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Example:

Page 27
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 28
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 29
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 30
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 31
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Similarity-Oriented Measures of Cluster Validity

The measures that we discuss in this section are all based on the premise that any two objects
that are in the same cluster should be in the same class and vice versa.
We can view this approach to cluster validity as involving the comparison of two matrices:
(1)The ideal cluster similarity matrix discussed previously, which has a 1 in the (i,j)th entry if
two objects, i. and j, are in the same cluster and 0, otherwise.

Page 32
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

(2) An ideal class similarity matrix defined with respect to class labels, which has a 1 in the (i,j)
th entry if two objects, i and j, belong to the same class, and a 0 otherwise.

Page 33
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 34
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 35
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

In particular, the simple matching coefficient, which is known as the Rand statistic in this
context, and the Jaccard coefficient are two of the most frequently used cluster validity measures.

Page 36
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 37
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Page 38
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Density-Based Clustering
• Grid-Based Clustering
• Subspace Clustering
• CLIQUE
• DENCLUE: A Kernel-Based Scheme for Density-Based
• Clustering

Grid-Based Clustering
The idea is to split the possible values of each attribute into a number of contiguous intervals,
creating a set of grid cells.

Objects can be assigned to grid cells in one pass through the data, and information about each
cell, such as the number of points in the cell, can also be gathered at the same time.

Defining Grid Cells: This is a key step in the process, but also the least well defined, as there

Page 39
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

are many ways to split the possible values of each attribute into a number of contiguous
intervals.

For continuous attributes, one common approach is to split the values into equal width intervals.
If this approach is applied to each attribute, then the resulting grid cells all have the same olume,
and the density of a cell is conveniently defined as the number of points in the cell.

The Density of Grid Cells: A natural way to define the density of a grid cell (or a more generally
shaped region) is as the number of points divided by the volume of the region. In other words,
density is the number of points per amount of space, regardless of the dimensionality of that
space

Example: Figure 9.10 shows two sets of two dimensional points divided into 49 cells using a 7-
by-7 grid. The first set contains 200 points generated from a uniform distribution over a circle
centered at (2, 3) of radius 2, while the second set has 100 points generated from a uniform
distribution over a circle centered at (6, 3) of radius 1. The counts for the grid cells are shown in
Table 9.2.

Page 40
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

CLIQUE
CLIQUE (Clustering In QUEst) is a grid-based clustering algorithm that methodically finds
subspace clusters. It is impractical to check each subspace for clusters since the number of such
subspaces is exponential in the number of dimensions. Instead, CLIQUE relies on the following
property;
Monotonicity property of density-based clusters If a set of points forms a density-based cluster in
k dimensions (attributes), then the same set of points is also part of a density-based cluster in all
possible subsets of those dimensions.

DENCLUE: A Kernel-Based Scheme for Density-Based Clustering

DENCLUE (DENsity ClUstEring) is a density-based clustering approach that models the overall
density of a set of points as the sum of influence functions associated with each point. The
resulting overall density function will have local peaks, i.e., local density maxima, and these
local peaks can be used to define clusters in a natural way. Specifically, for each data point, a hill
climbing procedure finds the nearest peak associated with that point, and the set of all data points
associated with a particular peak (called a local density attractor) becomes a cluster.

Page 41
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Graph-Based Clustering
Graph-Based clustering uses the proximity graph
➢ Start with the proximity matrix
➢ Consider each point as a node in a graph
➢Each edge between two nodes has a weight which is the proximity between the
two points
➢ Initially the proximity graph is fully connected
➢ MIN (single-link) and MAX (complete-link) can be viewed as starting with this
graph
In the simplest case, clusters are connected components in the graph.
Sparsification
The amount of data that needs to be processed is drastically reduced
➢Sparsification can eliminate more than 99% of the entries in a proximity
matrix
➢ The amount of time required to cluster the data is drastically reduced
➢ The size of the problems that can be handled is increased. Clustering
may work better
➢Sparsification techniques keep the connections to the most similar (nearest)
neighbors of a point while breaking the connections to less similar points. ➢ The
nearest neighbors of a point tend to belong to the same class as the point
itself.

Page 42
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

➢ This reduces the impact of noise and outliers and sharpens the distinction between
clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph
partitioning algorithms.
- Chameleon and Hypergraph-based Clustering

Limitations of Current Merging Schemes

Existing merging schemes in hierarchical clustering algorithms are static in nature
- MIN or CURE:
merge two clusters based on their closeness (or minimum distance)
- GROUP-AVERAGE:
merge two clusters based on their average connectivity
Chameleon: Clustering Using Dynamic Modeling

Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters
➢Main property is the relative closeness and relative inter-connectivity of the
cluster
➢ Two clusters are combined if the resulting cluster shares certain properties with
the constituent clusters
➢ The merging scheme preserves self-similarity
Steps
Preprocessing Step:
Represent the Data by a Graph
➢Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture
the relationship between a point and its k nearest neighbors

Page 43
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

➢ Concept of neighborhood is captured dynamically (even if region is sparse)

Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of
clusters of well-connected vertices
➢Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-
cluster of a “real” cluster.
Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters
➢ Two clusters are combined if the resulting cluster shares certain properties with
the constituent clusters
➢ Two key properties used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal
connectivity of the clusters
Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of
the clusters

SNN Clustering Algorithm:

1)Compute the similarity matrix
This corresponds to a similarity graph with data points for nodes and edges whose weights
are the similarities between data points
2)Sparsify the similarity matrix by keeping only the k most similar neighbors
This corresponds to only keeping the k strongest links of the similarity graph
3)Construct the shared nearest neighbor graph from the sparsified similarity matrix.
At this point, we could apply a similarity threshold and find the connected components to
obtain the clusters (Jarvis-Patrick algorithm)
4)Find the SNN density of each Point.
Using a user specified parameters, Eps, find the number points that have an SNN imilarity
of Eps or greater to each point. This is the SNN density of the point.
5)Find the core points
Using a user specified parameter, MinPts, find the core points, i.e., all points that have an
SNN density greater than MinPts.
6)Form clusters from the core points.
If two core points are within a radius, Eps, of each other they are place in the same cluster

Page 44
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

7)Discard all noise points.

All non-core points that are not within a radius of Eps of a core point are discarded .
8)Assign all non-noise, non-core points to clusters.
This can be done by assigning such points to the nearest core point

Limitations of SNN Clustering

Complexity of SNN Clustering is high
- O( n * time to find numbers of neighbor within Eps)
- In worst case, this is O(n2)

Scalable Clustering Algorithms

BIRCH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a highly efficient
clustering technique for data in Euclidean vector spaces, i'e., data for which averages make
sense. BIRCH can efficiently cluster such data with one pass and can improve that clustering
with additional passes. BIRCH can also deal effectively with outliers.

Page 45
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

CURE
CURE (Clustering Using REpresentatives) is a clustering algorithm that uses a variety of
different techniques to create an approach that can handle large data sets, outliers, and clusters
with non-spherical shapes and non-uniform sizes. CURE represents a cluster by using multiple
representative points from the cluster. These points will, in theory, capture the geometry and
shape of the cluster. The first representative point is chosen to be the point farthest from
the center of the cluster, while the remaining points are chosen so that they are farthest from all
the previously chosen points. In this way, the representative points are naturally relatively well
distributed.

Page 46
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

Question Bank: Clustering Analysis

1. Explain desired features of cluster analysis.

2. Explain how distance between a pair of points can be computed.
3. Write a short note on density-based methods.
4. Write and explain basic K-Means algorithm.
5. Explain DBSCAN clustering algorithm.
6. What are the limitations of K Means algorithm.
7. Explain cluster analysis methods briefly.
8. Explain agglomerative hierarchical clustering.
9. Explain bisecting K Means algorithm.
10. Distinguish between various types of clustering.
11. What are unsupervised, supervised and relative evaluation measures that are applied to
judge various aspects of cluster validity.
12. Explain different types of defining proximity between clusters.
13. Differentiate between exclusive and overlapping clustering.
14. What are the various issues considered for cluster validation? Explain different
evaluation measures used for cluster validity.
15. Explain unsupervised cluster evaluation using cohesion and separation.

Page 47
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

16. Explain unsupervised cluster evaluation using proximity matrix.

17. List and explain classification-oriented measures of cluster validity.
18. Explain similarity - oriented measures of cluster validity.
19. Explain grid-based clustering algorithm.
20. Explain subspace clustering.
21. Write and explain CLIQUE algorithm.
22. Write and explain DENCLUE algorithm.
23. Explain different graph-based clustering.

Page 48
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

DWDM Unit 5
No ratings yet
DWDM Unit 5
43 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
80 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Module 5
No ratings yet
Module 5
91 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
ML - 8
No ratings yet
ML - 8
70 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
DM 4
No ratings yet
DM 4
76 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Unit 5
No ratings yet
Unit 5
85 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
BMW M-5
No ratings yet
BMW M-5
48 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
UNIT5
No ratings yet
UNIT5
60 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
65 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Clustering
No ratings yet
Clustering
38 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit 5
No ratings yet
Unit 5
63 pages
Source: Diginotes - In: Cluster Analysis
No ratings yet
Source: Diginotes - In: Cluster Analysis
48 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Clustering
No ratings yet
Clustering
34 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Comparison of Different Clustering Algorithms Using WEKA Tool
No ratings yet
Comparison of Different Clustering Algorithms Using WEKA Tool
3 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Unit 15
No ratings yet
Unit 15
26 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Clustering
No ratings yet
Clustering
11 pages
Cluster
100% (1)
Cluster
72 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Unit 4
No ratings yet
Unit 4
4 pages
Byt BZB 1 8 Tsfi Eng PDF
No ratings yet
Byt BZB 1 8 Tsfi Eng PDF
51 pages
514 614 L28 32H Fuel Oil System
100% (1)
514 614 L28 32H Fuel Oil System
30 pages
ASTM A249 Stainless Steel Tubes
No ratings yet
ASTM A249 Stainless Steel Tubes
10 pages
Grade 8 Pretechnical
No ratings yet
Grade 8 Pretechnical
8 pages
TMS374 Family In-Circuit Programming: Users Manual Rev. 1.3 2005.05.11
100% (1)
TMS374 Family In-Circuit Programming: Users Manual Rev. 1.3 2005.05.11
10 pages
Technology and Livelihood Education: Module 5 & 6
No ratings yet
Technology and Livelihood Education: Module 5 & 6
18 pages
Data Structure Course
No ratings yet
Data Structure Course
48 pages
Axis Mobile Features
No ratings yet
Axis Mobile Features
34 pages
Unit 1
No ratings yet
Unit 1
23 pages
K039-Pic-Checklist For JCB
No ratings yet
K039-Pic-Checklist For JCB
1 page
Data Quality Model
No ratings yet
Data Quality Model
107 pages
DAA Presentation Greedy Aproch of Coloring
No ratings yet
DAA Presentation Greedy Aproch of Coloring
11 pages
Nexans NYY 80-0-6 1 KV Single Core
No ratings yet
Nexans NYY 80-0-6 1 KV Single Core
6 pages
PPS Unit 3
No ratings yet
PPS Unit 3
16 pages
Unit 8
No ratings yet
Unit 8
4 pages
? Excel VLOOKUP - Massive Guide With 8 Examples
No ratings yet
? Excel VLOOKUP - Massive Guide With 8 Examples
19 pages
Agri-Fishery LAS 5
No ratings yet
Agri-Fishery LAS 5
5 pages
7 PPT
No ratings yet
7 PPT
21 pages
Solar Photovoltaic Glint and Glare Guidance First Edition
No ratings yet
Solar Photovoltaic Glint and Glare Guidance First Edition
55 pages
IHHA sts2011 - Turner
No ratings yet
IHHA sts2011 - Turner
9 pages
RIL - List of Subsidiaries
No ratings yet
RIL - List of Subsidiaries
7 pages
Form B Level 200
No ratings yet
Form B Level 200
1 page
Ul-1 13
No ratings yet
Ul-1 13
13 pages
Printer Friendly View
No ratings yet
Printer Friendly View
4 pages
Oil Seals Met
No ratings yet
Oil Seals Met
22 pages
The Basic Concepts of Information Systems: July 2021
No ratings yet
The Basic Concepts of Information Systems: July 2021
19 pages
Radware DefensePro Imp Points
No ratings yet
Radware DefensePro Imp Points
3 pages
JP-Finance Officer
No ratings yet
JP-Finance Officer
2 pages
3.1 Critical Thinking Rubric
No ratings yet
3.1 Critical Thinking Rubric
1 page
Department of Civil Engineering (Bbit B.Tech Wing) A.Y. 2019-2020 Even Semester Faculty Database For Online Internal Exam in May 2020
No ratings yet
Department of Civil Engineering (Bbit B.Tech Wing) A.Y. 2019-2020 Even Semester Faculty Database For Online Internal Exam in May 2020
1 page
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet

DM Mod5

Uploaded by

DM Mod5

Uploaded by

DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE

A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

A set of nested clusters organized as a hierarchical tree

Center-based (proto type based)

Contiguous Cluster (Graph based)

- A cluster is a dense region of points, which is separated by low-density regions,

Shared Property or Conceptual Clusters

➢ Initial centroids are often chosen randomly.

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE).

To get SSE, we square these errors and sum them.

One easy way to reduce SSE is to increase K, the number of clusters.

Problems with Selecting Initial Points

Chance is relatively small when K is large

If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036

K-means: Additional Issues

An alternative is to update the centroids after each assignment (incremental approach)

- Each assignment updates zero or two centroids

- Introduces an order dependency

- Never get an empty cluster

- Can use “weights” to change the impact

Pre-processing and Post-processing

Strengths and Weaknesses of K-means (Limitations)

Agglomerative Hierarchical Clustering:

Do not have to assume any particular number of clusters

How to Define Inter-Cluster Similarity (Proximity of two clusters):

1) Single Link or MIN

2) Complete Link or MAX or CLIQUE

Key Issues in Agglomerative Hierarchical Clustering (Strengths and Weaknesses)

The DBSCAN Algorithm

Eliminate noise points

Strengths and weaknesses of DBSCAN

Supervised Cluster Evaluation Using Cohesion and Separation

Graph-Based View of Cohesion and Separation:

Prototype-Based View of Cohesion and Separation

Measuring Cluster Validity Via Correlation

Classification-Oriented Measures of Cluster Validity

Similarity-Oriented Measures of Cluster Validity

DENCLUE: A Kernel-Based Scheme for Density-Based Clustering

Limitations of Current Merging Schemes

➢ Concept of neighborhood is captured dynamically (even if region is sparse)

SNN Clustering Algorithm:

7)Discard all noise points.

Limitations of SNN Clustering

Scalable Clustering Algorithms

Question Bank: Clustering Analysis

1. Explain desired features of cluster analysis.

16. Explain unsupervised cluster evaluation using proximity matrix.

You might also like