DWDM Unit 5
DWDM Unit 5
DWDM Unit 5
Unit V
Definition:
“Cluster analysis or clustering is the task of grouping a set of objects or data points in such a way
that objects in the same group (called a cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters).”
Or
“Clustering is the process of grouping objects into different groups which are meaningful, useful or both”.
Cluster analysis is related to other mining tasks that divide data objects into groups. For instance,
clustering can be regarded as a form of classification in that it creates a labeling of objects with class
(cluster) labels. Due to this reason Classification is known as “Supervised Learning” and Clustering is
known as “Unsupervised Classification”
Segmentation and Partitioning are sometimes used as synonyms for Clustering; these terms are
frequently used for approaches outside the traditional bounds of cluster analysis. For example, the term
partitioning is often used in connection with techniques that divide graphs into subgraphs and that are not
strongly connected to clustering. Segmentation often refers to the division of data into groups using simple
techniques; e.g., an image can be split into segments based only on pixel intensity and color.
Biology
Biologists have spent many years creating taxonomy (hierarchical classification) of
all living things: kingdom, phylum, class, order, family, genus, and species. Biologists
have applied clustering to analyze the large amounts of genetic information. For
example, clustering has been used to find groups of genes that have similar
functions.
Information Retrieval
The World Wide Web consists of billions of Web pages, and the results of a query
to a search engine can return thousands of pages. Clustering can be used to group
these search results into a small number of clusters, each of which captures a
particular aspect of the query.
Climate
Understanding the Earth’s climate requires finding patterns in the atmosphere and
ocean. Cluster analysis has been applied to find patterns in the atmospheric
pressure of Polar Regions and areas of the ocean that have a significant impact on
land climate.
Business
Businesses collect large amounts of information on current and potential
customers. Clustering can be used to segment customers into a small number of
groups for additional analysis and marketing activities.
Summarization
Many data analysis techniques, such as regression or PCA, have a time or space
complexity of O(m2) or higher (where m is the number of objects), and thus, are
not practical for large data sets. However, instead of applying the algorithm to the
entire data set, it can be applied to a reduced data set consisting only of cluster
prototypes.
Compression
Cluster prototypes can also be used for data compression. In particular, a table is
created that consists of the prototypes for each cluster; i.e., each prototype is
assigned an integer value that is its position (index) in the table. Each object is
DATA WARE HOUSING AND DATA MINING (R16)
represented by the index of the prototype associated with its cluster. This type of
compression is known as vector quantization and is often applied to image, sound,
and video data, where (1) many of the data objects are highly similar to one
another, (2) some loss of information is acceptable, and (3) a substantial reduction
in the data size is desired.
CLASSIFICATION CLUSTERING
We have a Training set containing data that We do not know the characteristics
have been previously categorized of similarity of data in advance
Based on this training set, the algorithms finds Using statistical concepts, we split the
the category that the new data points belong datasets into sub-datasets such that the Sub-
to datasets have “Similar” data
Since a Training set exists, we describe this Since Training set is not used, we describe this
technique as Supervised learning technique as Unsupervised learning
Example: We use training dataset which Example: We use a dataset of customers and
categorized customers that are loyal. Now split them into sub-datasets of customers with
based on this training set, we can classify “similar” characteristics. Now this information
whether a customer will be loyal to our shop can be used to market a product to a specific
or not. segment of customers that has been identified
by clustering algorithm
Types of Clustering:
Hierarchical clustering Vs Partitional Clustering
Exclusive Clustering Vs Overlapping Clustering
Fuzzy Clustering
Complete Clustering Vs Partial Clustering
Partitional Clustering:
A division data objects into non-overlapping subsets (clusters) such that each data object is in
exactly one subset.
Overlapping Clustering:
Fuzzy Clustering
In fuzzy clustering, every object belongs to a cluster with a membership weight that is between
0 (absolutely doesn’t belong) and 1 (absolutely belong). In fuzzy clustering we often impose
that the sum of weights of an object must be 1.
Partial Clustering:
In Partial clustering, some of the data points are left alone as outliers or noises. These types of
clusters are used for outlier analysis.
DATA WARE HOUSING AND DATA MINING (R16)
Types of Clusters:
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Center-based Clusters:
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the
“center” of a cluster, than to the center of any other cluster. The center of a cluster is often a
DATA WARE HOUSING AND DATA MINING (R16)
centroid, the average of all the points in the cluster, or a medoid, the most “representative”
point of a cluster.
Contiguous Clusters (Nearest neighbor or Transitive):
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or
more other points in the cluster than to any point not in the cluster.
Density-based Clusters:
A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density. Used when the clusters are irregular or intertwined, and when noise
and outliers are present.
Road Map:
K-means:
This is a prototype-based, partitional clustering technique that attempts to find a user-specified
number of clusters (K), which are represented by their centroids.
DBSCAN:
This is a density-based clustering algorithm that produces a partitional clustering, in which the
number of clusters is automatically determined by the algorithm. Points in low-density regions
are classified as noise and omitted; thus, DBSCAN does not produce a complete clustering.
K-Means:
A prototype based one level partitional clustering technique that attempts to find a user-specified number
of clusters(k).
Choose k initial centroids, k-user specified and k indicates number of clusters desired
Each point is then assigned to the closest centroid, and each collection of points assigned to
a centroid is a cluster.
Centroid of each cluster is then updated based on the points assigned to the cluster
Repeat the assignment and update steps until no point changes clusters.
DATA WARE HOUSING AND DATA MINING (R16)
Algorithm:
K-means reaches a state in which no points are shifting from one cluster to another and hence, the
centroids don’t change.
To assign a point into a cluster, a measure must be calculated for finding the closest of point
to a centroid.
Objective function, measures the quality of a clustering, we use the sum of the squared
error (SSE), which is also known as scatter.
Calculate the error of each data point, i.e., its Euclidean distance to the closest centroid, and
then compute the total sum of the squared errors (SSE).
DATA WARE HOUSING AND DATA MINING (R16)
Given two different sets of clusters that are produced by two different runs of K-means, we
prefer the one with the smallest squared error.
SSE is formally defined as follows:
where dist is the standard Euclidean (L2) distance between two objects in Euclidean space.
The centroid that minimizes the SSE of the cluster is the mean. The centroid (mean) of the ith
cluster is defined by:
To illustrate, the centroid of a cluster containing the three two-dimensional points, (1,1), (2,3),
and (6,2), is ((1 + 2 + 6)/3,((1 + 3 + 2)/3) = (3,2).
Document data:
k-means is also used for document data; for document data the cosine similarity measure.
Document data is represented as a document – term matrix
Maximize the similarity of the documents in a cluster to the cluster centroid
This quality is called cohesion of the cluster
PROBLEM
Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). The distance matrix
based on the Euclidean distance is given below:
Solution
d(a,b) denotes the Eucledian distance between a and b. It is obtained directly from the distance matrix
or calculated as follows: ( ) √( ) ( )
Select three centroids. A1 (C1 - Cluster 1), A4 (C2 - Cluster 2) and A7 (C3 - Cluster 3)
ITERATION 1
ITERATION 2
With this new centroids again calculate Eucledian Distance for points A1, A2, A3, A4, A5, A6, A7 AND
A8
Newly formed clusters are:
Cluster 1 (C1) : {A1, A8}
Cluster 2 (C2) : {A3, A4, A5, A6}
Cluster 3 (C3) : {A2, A7}
ITERATION 3
With this new centroids again calculate Eucledian Distance for points A1, A2, A3, A4, A5, A6, A7 AND
A8
Newly formed clusters are:
Cluster 1 (C1) : {A1, A4, A8}
Cluster 2 (C2) : {A3, A5, A6}
Cluster 3 (C3) : {A2, A7}
With these new centroids, the entire process is repeated until the centroids don’t change.
DATA WARE HOUSING AND DATA MINING (R16)
Space for k-means are modest because only the data points and centroids are stored
i.e o((m+k)n) where m-no. of points, n-no. of attributes
time required is o(I*k*m*n) where I-no. of iterations, k – no. of clusters, m-no. of points and n-
no. of attributes.
Outliers:
When outliers are present, the resulting cluster centroids (prototypes) thus have SSE at
higher levels. Because of this, it is often useful to discover outliers and eliminate them
before.
Strategies that decrease the number of clusters, while trying to minimize the increase in
total SSE
Disperse a cluster:
Remove the centroid that corresponds to the cluster & reassigning the points to
other clusters.
Merge two clusters:
Clusters with closest centroids are merged result in the smallest increase in total
SSE.
either moves to a new cluster (two updates) or stays in its current cluster (zero
updates).
Bisecting K-means:
Idea:
To obtain k clusters, split the set of all points into two clusters, select one of those clusters to
split, and so on, until k clusters have been produced.
Algorithm:
Bisecting K-means has less trouble with initialization because it performs several trial bisections
and takes the one with the lowest SSE, and because there are only two centroids at each step.
The best centroid for minimizing the SSE of a cluster is the mean of the points in the cluster.
Divisive:
Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters
of individual points remain. In this case, we need to decide which cluster to split at each step
and how to do the splitting.
In this data objects are grouped in a top down manner
Initially all objects are in one cluster
Then the cluster is subdivided into smaller and smaller pieces, until each object forms a
cluster on its own or until it satisfies certain termination conditions as the desired number
of clusters is obtained.
DATA WARE HOUSING AND DATA MINING (R16)
Agglomerative vs Divisive
A hierarchical clustering is often displayed graphically using a tree-like diagram called a dendrogram,
which displays both the cluster-subcluster relationships and the order in which the clusters were
merged (agglomerative view) or split (divisive view). A hierarchical clustering can also be graphically
represented using a nested cluster diagram.
EXAMPLE
Given a data set of five objects characterized by a single feature, assume that there are two
clusters: C1: {a, b} and C2: {c, d, e}.
The distance matrix is as follows:
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
Calculation of the distance between C1: {a, b} and C2: {c, d, e} is as follows:
Single link:
dist(C1, C2) = min {d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
= min {3, 4, 5, 2, 3, 4}
=2
Complete Link:
dist(C1, C2) = max {d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
= max {3, 4, 5, 2, 3, 4}
=5
Average Link:
dist(C1, C2) = avg {d(a, c), d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
= avg {3, 4, 5, 2, 3, 4}
= (3+4+5+2+3+4)/6
= 3.5
EXAMPLE 1:
Consider the following distance matrix for five points and generate clusters using Agglomerative
Hierarchical Clustering
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
DATA WARE HOUSING AND DATA MINING (R16)
a b c d e
C1 :
a 0 1 3 4 5
{a, b}
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1 C2 :
e 5 4 2 1 0 {c, d} or {d, e}
Min Distance = 1
C1 : C2 :
e
{a, b} {c, d}
C1 :
0 2 4
{a, b}
C2 :
2 0 1
{c, d} C2 and e :
e 4 1 0
{{c, d}, e}
a b c d e
C1 :
a 0 1 3 4 5
{a, b}
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1 C2 :
e 5 4 2 1 0 {c, d} or {d, e}
Min Distance = 1
C1 : C2 :
e
{a, b} {c, d}
C1 :
0 4 5
{a, b}
C2 :
4 0 2
{c, d} C2 and e :
e 5 2 0
{{c, d}, e}
a b c d e
C1 :
a 0 1 3 4 5
{a, b}
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1 C2 :
e 5 4 2 1 0 {c, d} or {d, e}
C1 : C2 :
e
{a, b} {c, d}
C1 :
0 3 4.5
{a, b}
C2 :
3 0 1.5
{c, d} C2 and e :
e 4.5 1.5 0
{{c, d}, e}
EXAMPLE 2
Given Proximity matrix can be represented as:
A B C (D, F) E
A 0.0 0.71 5.66 3.20 4.24
B 0.71 0.0 4.95 2.50 3.54
C 5.66 4.95 0.0 2.24 1.41
(D, F) 3.20 2.50 2.24 0.0 1.00
(A, B) C (D, F) E
Minimum distance from the above distance matrix is 1.00. Merge the clusters containing (D, F) and E
Calculate Distance Matrix for clusters (A, B), C and ((D, F), E):
A B C (D, F) E
A 0.0 0.71 5.66 3.61 4.24
B 0.71 0.0 4.95 2.92 3.54
C 5.66 4.95 0.0 2.50 1.41
(D, F) 3.61 2.92 2.50 0.0 1.12
(A, B) C (D, F) E
Minimum distance from the above distance matrix is 1.00. Merge the clusters containing (D, F) and E
Calculate Distance Matrix for clusters (A, B), C and ((D, F), E):
A B C (D, F) E
A 0.0 0.71 5.66 3.405 4.24
B 0.71 0.0 4.95 2.71 3.54
C 5.66 4.95 0.0 2.37 1.41
(D, F) 3.405 2.71 2.37 0.0 1.06
(A, B) C (D, F) E
Minimum distance from the above distance matrix is 1.00. Merge the clusters containing (D, F) and E
Calculate Distance Matrix for clusters (A, B), C and ((D, F), E):
DATA WARE HOUSING AND DATA MINING (R16)
DBSCAN
Density-based clustering locates regions of high density that are separated from one another by
regions of low density. DBSCAN is a simple and effective density-based clustering algorithm that
illustrates a number of important concepts that are important for any density-based clustering
approach.
This method is simple to implement, but the density of any point will depend on the specified
radius. For instance, if the radius is large enough, then all points will have a density of m, the
number of points in the data set. Likewise, if the radius is too small, then all points will have a
density of 1. Selecting appropriate radius is crucial.
Core points:
These points are in the interior of a density-based cluster. A point is a core point if the
number of points within a given neighborhood around the point as determined by the
distance function and a user specified distance parameter, Eps, exceeds a certain
threshold, MinPts, which is also a user-specified parameter. In the above figure, point A
is a core point, for the indicated radius (Eps) ifMinPts ≤ 7.
DATA WARE HOUSING AND DATA MINING (R16)
Border points:
A border point is not a core point, but falls within the neighborhood of a core point. In
the above figure, point B is a border point. A border point can fall within the
neighborhoods of several core points.
Noise points:
A noise point is any point that is neither a core point nor a border point. In the above
figure, point C is a noise point.
NOTE:
DBSCAN Clustering is based on density (local cluster criterion)
It is used to discover clusters of arbitrary shape
DBSCAN handles noise very well.
DBSCAN needs only one scan to cluster the data points
We can provide termination condition as density parameters.
The Two parameters that we can specify are:
Eps: Maximum radius of neighborhood
MinPts: Minimum number of points in an Eps-neighborhood of a point
The neighborhood of the data point is represented as:
NEps(p) ={q Є D | dist(p,q) <= Eps}
The above statement can be represented as, if the distance between q and p is less than
or equal to the given radius, then the data point belongs to cluster.
Directly density-reachable:
1) p belongs to NEps(q)
Density-Connectivity:
A pair of points p and q are density-connected if they are commonly density-reachable from a
point o.
Density-connectivity is symmetric
DATA WARE HOUSING AND DATA MINING (R16)
Weakness:
DBSCAN is sensitive to Parameters like €’ (Radius) and minpoints.
The size and shape of the clusters vary from one another based on the
parameters given.
DBSCAN does not work well with varying densities.
It does not work well with High-dimensional data.