Datamining Mod3
Datamining Mod3
DATA MINING
MODULE-3
TRACE KTU
Introduction to Clustering:
TRACE KTU
target each group, based on common features shared by the customers per group.
● Here, the class label (or group ID) of each customer is unknown.
● We have to discover these groupings.
● Given a large number of customers and many attributes describing customer profiles, it can be very costly or even
infeasible to have a human study the data and manually come up with a way to partition the customers into
strategic groups. An appropriate tool must be used.
CLUSTERING:
● Clustering is the process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
● Cluster analysis is the process of partitioning a set of data objects into subsets.
○ Each subset is a cluster, such that objects in a cluster are similar to one another,
yet dissimilar to objects in other clusters.
TRACE KTU
○ The set of clusters resulting from a cluster analysis can be referred to as the
clusters.
● Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition,outlier detection, web search, biology, and
security.
CLUSTERING PARADIGMS:
● There are many clustering approaches.
● It is difficult to provide a crisp categorization of clustering methods because these
categories may overlap so that a method may have features from several categories.
● In general, the major fundamental clustering methods can be classified into the
following categories,
TRACE KTU
Partitioning methods:
Hierarchical methods:
●
TRACE KTU
A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
● Hierarchical clustering methods can be distance-based or density- and continuity-based.
● Based on how the hierarchical decomposition is formed, a hierarchical method can be
classified as;
○ Agglomerative approaches.
○ Divisive approaches.
Agglomerative approach:
● Also called the bottom-up approach.
● Starts with each object forming a separate group.
● It successively merges the objects or groups close to one another, until all the groups
are merged into one (the topmost level of the hierarchy), or a termination condition
holds.
Divisive approach:
● Also called top-down approach.
●
● TRACE KTU
Starts with all the objects in the same cluster.
In each successive iteration, a cluster is split into smaller clusters, until each object is in
one cluster, or a termination condition holds.
Drawback:
● Once a step (merge or split) is done, it can never be undone.
● This rigidity is useful in that it leads to smaller computation costs by not having to worry
about a combinatorial number of different choices.
● Cannot correct erroneous decisions.
Density-based methods:
● Most partitioning methods cluster objects based on the distance between objects.
● Such methods can find only spherical-shaped clusters and encounter difficulty in discovering
clusters of arbitrary shapes.
● Other clustering methods have been developed based on the notion of density.
● Their general idea is to continue growing a given cluster as long as the density (number
of objects or data points) in the “neighborhood” exceeds some threshold.
○ For example, for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.
TRACE KTU
○ Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
● Density-based methods can divide a set of objects into multiple exclusive clusters, or a
hierarchy of clusters.
● Typically, density-based methods consider exclusive clusters only, and do not consider fuzzy
clusters.
● Eg, of density based clustering: DBSCAN, OPTICS, DENCLUE
Grid-based methods:
● Grid-based methods quantize the object space into a finite number of cells that form a grid
structure.
● All the clustering operations are performed on the grid structure (i.e., on the quantized
space).
● The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
● Using grids is often an efficient approach to many spatial data mining problems,including
●
clustering.
TRACE KTU
Therefore, grid-based methods can be integrated with other clustering methods such as
density-based methods and hierarchical methods.
Note:
● Most of the clustering algorithms integrate the ideas of several clustering methods, so that it is
sometimes difficult to classify a given algorithm as uniquely belonging to only one clustering method
category.
● Furthermore, some applications may have clustering criteria that require the integration of several
clustering techniques.
DISTANCE MEASURES IN CLUSTER ANALYSIS:
●
●
TRACE KTU
○ Minkowski distance
Consider two n-dimensional data objects i = (xi1, xi2,..., xin) and j = (x j1, x j2,..., x jn)
Euclidean distance between i and j is defined as;
● Manhattan (or city block) distance, defined as;
1. Let x1 = (1, 2) and x2 = (3, 5) represent two objects. Calculate the euclidean and
manhattan distance.
2. Given 5 dimensional samples A (1,0,2,5,3) and B (2,1,0,3,-1). Find the euclidean,
manhattan and minkowski distance (given p = 3, for minkowski distance).
TRACE KTU
Eg:
Solution:
= 3.61
●
TRACE KTU
Manhattan distance = |1-3|+ |2-5| = 5
PARTITIONING METHODS
TRACE KTU
objects of different clusters are “dissimilar” in terms of the data set attributes.
● The most well-known and commonly used partitioning methods are;
○ k-means
○ k-medoids
Variants
■ PAM (Partitioning around medoids)
of ■ CLARA (Clustering Large Applications)
k-medoid
■ CLARANS (Clustering Large Applications based upon Randomized Search)
PAM:
TRACE KTU
● PAM was one of the first k-medoids algorithms introduced.
● It attempts to determine k partitions for n objects.
● Initially, we’ll select k representative objects randomly.
● After that, the algorithm repeatedly tries to make a better choice of cluster
representatives.
● All of the possible pairs of objects are analyzed, where one object in each pair is
●
●
TRACE KTU
considered a representative object and the other is not.
The quality of the resulting clustering is calculated for each such combination.
An object, oj, is replaced with the object causing the greatest reduction in error.
● The set of best objects for each cluster in one iteration forms the representative
objects for the next iteration.
● The final set of representative objects are the respective medoids of the clusters.
Drawback of PAM method:
TRACE KTU
each of the non-representative objects,p.
● Each time a reassignment occurs, a difference in absolute error, E, is contributed to the cost
function.
○ Therefore, the cost function calculates the difference in absolute-error value if a
current representative object is replaced by a nonrepresentative object.
● The total cost of swapping is the sum of costs incurred by all non-representative objects.
●
●
TRACE KTU
If the total cost is negative, then oj is replaced or swapped with orandom since the actual
absolute error E would be reduced.
If the total cost is positive, the current representative object, o j, is considered acceptable,
and nothing is changed in the iteration.