Unit 5
Unit 5
• Clustering is the process of finding groups of objects such that the objects in a
group will be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
• Different clustering methods may generate different clusters on the same data
set.
– intra-cluster similarity is high (The data that is present inside the cluster is
similar to one another)
– inter-cluster similarity is less (Each cluster holds data that isn’t similar to
the other)
– In image reorganization
– In web search
– In Outlier detection
– In biology
– For example: Some people may write it with a small circle at the left
bottom part, while some others may not. We can use clustering to
determine subclasses for “2,” each of which represents a variation on the
way in which 2 can be written.
– Constraint-based clustering
– Clustering on a sample of a given large data set may lead to biased results.
types.
– This not only burdens users, but it also makes the quality of clustering
difficult to control.
– Some clustering algorithms are sensitive to such data and may lead to
clusters of poor quality.
– Some clustering algorithms are sensitive to the order of input data. That is,
given a set of data objects, such an algorithm may return dramatically
different clustering depending on the order of presentation of the input
objects.
– Human eyes are good at judging the quality of clustering for up to three
dimensions.
– Suppose that your job is to choose the locations for a given number of
new Automated Teller Machines (ATMs) in a city.
– Partitioning Methods
– Hierarchical Methods
– Density-Based Methods
– Grid-Based Methods
– That is, it classifies the data into k groups, which together satisfy the following
requirements:
• The centroid of a cluster is its center point such as the mean of the objects (or
points) assigned to the cluster.
Instance X Y
1 1 1.5
2 1 4.5
3 2 1.5
4 2 3.5
5 3 2.5
6 3 4
28
Contd…
• Solution:
– Initially choose two points randomly as a initial cluster center, say objects
1 and 3 are chosen
1, 2 3,4,5,6
C1 C2
1, 2 3,4,5,6
C1 C2
41
Contd…
• Weakness of K-means:
– Applicable only when mean is defined.
– starts by letting each object form its own cluster and iteratively merges
clusters into larger and larger clusters, until all the objects are in a single
satisfied.
– For the merging step, it finds the two clusters that are closest to each other
(according to some similarity measure), and combines the two to form one
cluster.
– It starts by placing all objects in one cluster, which is the hierarchy’s root.
– It then divides the root cluster into several smaller sub-clusters, and
recursively partitions those clusters into smaller ones.
– The partitioning process continues until each cluster at the lowest level
either containing only one object, or the objects within a cluster are
sufficiently similar to each other.
– Conversely, divisive methods initially let all the given objects form one
cluster, which they iteratively split into smaller clusters.
– Thus, merge or split decisions, if not well chosen, may lead to low-quality
clusters.
• Moreover, the methods do not scale well because each decision of merge or
split needs to examine and evaluate many objects or clusters.
• Major features:
– Handle noise
• Two parameters:
p MinPts = 5
q e = 1 cm
Contd..
• Directly density-reachable:
• 1) p belongs to NEps(q)
p MinPts = 5
q e = 1 cm
Contd..
• Density-reachable:
density-reachable from pi
p
p1
q
Contd..
• Density-connected:
point o such that both, p and q are density-reachable from o wrt. Eps and
MinPts.
p q
o
Contd..
• Density = number of points within a specified radius (Eps).
• A border point is not a core point, but is in the neighborhood of a core point
• A noise point is any point that is not a core point or a border point
e.g.,: Minpts=7
• Solution :
– d(a,b) denotes the Eucledian distance between a and b. It is obtained
directly from the distance matrix calculated as follows:
– d(a,b)=sqrt((xb-xa)2+(yb-ya)2))
A1 A2 A3 A4 A5 A6 A7 A8
A1 0 √25 √36 √13 √50 √52 √65 √5
A2 0 √37 √18 √25 √17 √10 √20
A3 0 √25 √2 √2 √53 √41
A4 0 √13 √17 √52 √2
A5 0 √2 √45 √25
A6 0 √29 √29
A7 0 √58
A8 0
• N2(A2)={};
• N2(A3)={A5, A6};
• N2(A4)={A8};
• N2(A5)={A3, A6};
• N2(A6)={A3, A5};
• N2(A7)={};
• N2(A8)={A4};
• So A1, A2, and A7 are outliers, while we have two clusters C1={A4,
A8} and C2={A3, A5, A6}
7/5/2019 By:Tekendra Nath Yogi 62
Contd..
• Advantages:
– DBSCAN does not require one to specify the number of clusters in the
data priori, as opposed to k-means.
– The parameters minPts and ε can be set by a domain expert, if the data is
well understood.
– DBSCAN cannot cluster data sets well with large differences in densities,
since the minPts-ε combination cannot then be chosen appropriately for all
clusters.
– If the data and scale are not well understood, choosing a meaningful
distance threshold ε can be difficult.
• Explain the different types of cluster analysis methods and discuss their
features.