Data Mining Notes UNIT IV
Data Mining Notes UNIT IV
Cluster:
Cluster is a group of objects that belongs to the same class. In other words, similar objects
are grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
The following points throw light on why clustering is required in data mining −
Scalability − We need highly scalable clustering algorithms to deal with large
databases.
Ability to deal with different kinds of attributes − Algorithms should be capable
to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to
only distance measures that tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
Interpretability − The clustering results should be interpretable, comprehensible,
and usable.
Clustering Methods
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This
method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for
each data point within a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It reflects
spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based
on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
Constraint-based Method
That is, it classifies the data into k groups, which together satisfy the following
requirements
Each group must contain at least one object,
Each object must belong to exactly one group.
It uses iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another.
The general criterion of a good partitioning is that objects in the same cluster are “close” or
related to each other, whereas objects of different clusters are “far apart” or very different.
There are various kinds of other criteria for judging the quality of partitions.
To achieve global optimality in partitioning – based clustering it would require the
continuous evaluation of all the possible partitions.
1) The k-means algorithm, where each cluster is represented by the mean value of the
objects in the cluster.
2) the k-medoids algorithm, where each cluster is represented by one of the objects
located near the center of the cluster.
The heuristic clustering methods work well for finding spherical-shaped clusters in small
to medium databases.
To find clusters with complex shapes and for clustering very large data sets, partitioning
based methods need to be extended.
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: Continuous evaluation all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
First, it randomly selects k of the objects, each of which initially represents a cluster mean
or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster. This process iterates until the criterion
function converges.
K-Means Algorithm
Here, E is the sum of the square error for all objects in the data set.
x is the point in space representing a given object, and mi is the mean of cluster Ci (both x
and mi are multidimensional). In other words, for each object in each cluster, the distance
from the object to its cluster center is squared, and the distances are summed.
This criterion tries to make the resulting k clusters as compact and as separate as possible.
Suppose that there is a set of objects located in space as depicted in the rectangle.
Let k =3; i.e. the user would like to cluster the object into three clusters.
According to the algorithm, we arbitrarily choose three objects as the three initial cluster
centers, were cluster centers are marked by a “+”.
Each object is distributed to a cluster based on the cluster center to which it is the nearest.
Such distribution forms circled by dotted curves.
Advantages Of K-Means
Disadvantages Of K-Means
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with non-convex shapes.
K-Medoids Clustering
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.
This quality is estimated by using a cost function that measures the average
dissimilarity between an object and the medoid of its cluster.
Case 1: "P" currently belongs to medoid "Oj", If "Oj" is replaced by "Orandom", as a medoid
and "P" is closest to one of "Oi", it do not belong "j", then "P" is assigned to "Oi".
Case 2: "P" currently belongs to medoid "Oj". If "Oj" is replaced by "Orandom" as medoid
and "P" is closest to "Orandom", then "P" is reassigned to "Orandom".
Case 3: "P" currently belongs to medoid "Oi", it does not belong "j". If "Oj" is replaced by
"Orandom" as a medoid and "P" is still closest to "Oi", then the assignment does not change.
Case 4: "P" currently belongs to medoid "Oi", it does not belong to "j". If "Oj" is replaced by
"Orandom" as a medoid and "P" is closest to "Orandom", then "P" is reassigned to
"Orandom".
Which Is More Robust -- K-Means or K-Medoids
The k-medoids method is more robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other extreme values than a
mean.
However, its processing is more costly than the k-means method. Both methods
require the user to specify k, the number of clusters.
Aside from using the mean or the medoid as a measure of cluster center, other
alternative measures are also commonly used in partitioning clustering methods.
The median can be used, resulting in the k-median method, where the median or
“middle value” is taken for each ordered attribute. Alternatively, in the k-modes method,
the most frequent value for each attribute is used.
1. Agglomerative:
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into
account all of the data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable. In the end, we are left with N
clusters.
Grid-Based Clustering
The statistical info of each cell is calculated and stored beforehand and is used to answer
queries.
The parameters of higher-level cells can be easily calculated from parameters of lower-
level cell
Count, mean, s, min, max
Type of distribution—normal, uniform, etc.
For each cell in the current level compute the confidence interval.
When finishing examining the current layer, proceed to the next lower level.
Advantages:
It is Query-independent, easy to parallelize, incremental update.
O(K), where K is the number of grid cells at the lowest level.
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected.
WaveCluster
Input parameters:
No of grid cells for each dimension
The wavelet, and the no of applications of wavelet transform.
Major features:
The time complexity of this method is O(N).
It detects arbitrary shaped clusters at different scales.
It is not sensitive to noise, not sensitive to input order.
It only applicable to low dimensional data.
It is based on automatically identifying the subspaces of high dimensional data space that
allow better clustering than original space.
Partition the data space and find the number of points that lie inside each cell of the
partition.
Identify the subspaces that contain clusters using the Apriori principle.
Identify clusters:
Determine dense units in all subspaces of interests.
Determine connected dense units in all subspaces of interests.
Disadvantages
The accuracy of the clustering result may be degraded at the expense of the
simplicity of the method.
Model-Based Clustering
Model-based clustering method is an attempt to optimize the fit between the data
and some mathematical models.
It is the Statistical and AI approach.
Model-based clustering works on the intuition that gene expression data originates
from a finite mixture of underlying probability distributions (Ramoni et al. 2001).
Each cluster corresponds to a different distribution, and these distributions are
assumed to be Gaussians.
The parameters of each distribution (i.e., cluster) are estimated by maximizing the
likelihood of the expression data (Hogg and Craig 1994).
The k-means clustering method is a special case of model-based clustering,
where all the distributions are assumed to be Gaussians with equal variance.
Randomly generate the parameters (the parameters would be the mean and
standard deviation or covariance matrix) describing each probability distribution (i.e.,
cluster)
Repeat until the parameters of each distribution converge
For each gene, estimate the probability that the gene's expression pattern was
generated from each of the distributions.
For each distribution, estimate the parameters of the distribution to maximize the
likelihood of the expression data given the probability that each gene was generated from
the distribution.
Assign each gene to the distribution which generates the gene's expression profile
with maximum probability
Model-based clustering has the advantage of providing the probability that each gene
belongs in each cluster.
However, model-based clustering operates under the assumption that expression data
comes from particular probability distributions, which may not be a reasonable
assumption for many microarray data sets.
Conceptual clustering
Conceptual clustering is a form of clustering in machine learning.
It produces a classification scheme for a set of unlabeled objects and finds
characteristic description for each concept (class).
COBWEB (Fisher’87)
COBWEB is a popular a simple method of incremental conceptual learning.
It creates a hierarchical clustering in the form of a classification tree.
Each node refers to a concept and contains a probabilistic description of that
concept.
Classification Tree
Limitations of COBWEB
The assumption that the attributes are independent of each other is often too strong
because correlation may exist.
It is not suitable for clustering large database data – skewed tree and expensive
probability distributions.
Some of the other methods alike COBWEB are:
CLASSIT
It is an extension of COBWEB for incremental clustering of continuous data.
It suffers similar problems as COBWEB.
Clustering is also performed by having several units competing for the current
object.
The unit whose weight vector is closest to the current object wins.
The winner and its neighbors learn by having their weights adjusted.
SOMs are believed to resemble processing that can occur in the brain.
Useful for visualizing high-dimensional data in 2-D or 3-D space.
Outliers in Data Mining
Outlier is a data object that deviates significantly from the rest of the data objects
and behaves in a different manner. An outlier is an object that deviates significantly from
the rest of the objects. They can be caused by measurement or execution errors. The
analysis of outlier data is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being
generated by the same method as the rest of the data objects.
Outliers are of three types, namely –
1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers
1. Global Outliers
They are also known as Point Outliers. These are the simplest form of outliers. If, in a
given dataset, a data point strongly deviates from all the rest of the data points, it is known
as a global outlier. Mostly, all of the outlier detection methods are aimed at finding global
outliers.
For example, In Intrusion Detection System, if a large number of packages are
broadcast in a very short span of time, then this may be considered as a global outlier and
we can say that that particular system has been potentially hacked.
As the name suggests, if in a given dataset, some of the data points, as a whole,
deviate significantly from the rest of the dataset, they may be termed as collective outliers.
Here, the individual data objects may not be outliers, but when seen as a whole, they may
behave as outliers. To detect these types of outliers, we might need background
information about the relationship between those data objects showing the behavior of
outliers.
For example: In an Intrusion Detection System, a DOS (denial-of-service) package
from one computer to another may be considered as normal behavior. However, if this
happens with several computers at the same time, then this may be considered as
abnormal behavior and as a whole they can be termed as collective outliers.
3. Contextual Outliers
They are also known as Conditional Outliers. Here, if in a given dataset, a data object
deviates significantly from the other data points based on a specific context or condition
only. A data point may be an outlier due to a certain condition and may show normal
behavior under another condition. Therefore, a context has to be specified as part of the
problem statement in order to identify contextual outliers.
Contextual outlier analysis provides flexibility for users where one can examine
outliers in different contexts, which can be highly desirable in many applications. The
attributes of the data point are decided on the basis of both contextual and behavioral
attributes.
For example: A temperature reading of 40°C may behave as an outlier in the context
of a “winter season” but will behave like a normal data point in the context of a “summer
season”.
A low temperature value in June is a contextual outlier because the same value in December
is not an outlier.