DW & DM Unit 4 Notes
DW & DM Unit 4 Notes
Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and (E) and at last, we get the
clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)]
A type of clustering in which groups are created such that similar instances are within
the same group and different objects are in other groups
Partitioning Clustering
Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C-
Means.In k-means clustering, the objects are divided into several clusters mentioned
by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and
c2, as shown:
Fuzzy c-means is very similar to k-means in the sense that it clusters objects that have similar
characteristics together. In k-means clustering, a single object cannot belong to two different
clusters. But in c-means, objects can belong to more than one cluster, as shown.
• Amazon: Amazon also uses Unsupervised learning to learn the customer’s purchase
and recommend the product which are most frequently bought together which is an
example of association rule mining.
• Anomaly detection: Unsupervised clustering can process large datasets and discover
data points that are atypical in a dataset.
Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications becauseclustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.
Partitioning Methods:
A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.
❖ Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
❖ The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination conditionholds.
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.
Model-Based Methods:
❖ Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
❖ A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
❖ It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.
Constraint-Based Clustering:
It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user’s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clusteringemploys forpairwise
constraints in order to improvethe quality of the resulting clustering.
whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci
The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as
whereE is the sum of the absolute error for all objects in the data set
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering
Thek-MedoidsAlgorithm:
A hierarchical clustering method works by grouping data objects into a tree ofclusters.
The quality of a pure hierarchical clusteringmethod suffers fromits inability to
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
backtrack and correct it.
We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.
A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
➢ Constraints on distance or similarity functions:
We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
➢ User-specified constraints on the properties of individual clusters:
A user may like tospecify desired characteristics of the resulting clusters, which may
strongly influencethe clustering process.
➢ Semi-supervised clustering based on partial supervision:
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
calledsemi-supervised clustering.
Outlier Analysis:
There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person’s noise could be another person’s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medicalanalysis for finding unusual responses to various medicaltreatments.
Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
Types of outlier detection:
➢ Statistical Distribution-Based Outlier Detection
➢ Distance-Based Outlier Detection
➢ Density-Based Local Outlier Detection
➢ Deviation-Based Outlier Detection
In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis
Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.
That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted Nkdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get NMinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p.
The reachability distance of an object p with respect to object o (where o is within
theMinPts-nearest neighbors of p), is defined as reach
distMinPts(p, o) = max{MinPtsdistance(o), d(p, o)}.