Unit IV and V
Unit IV and V
Clustering Validation
To find good clustering partitions for a data set, regardless of the clustering algorithm used, the
quality of the partitions must be evaluated. In contrast to the classification task, the identification
of the best clustering algorithm for a given data set lacks a clear definition.Several cluster
validity criteria have been proposed; some automatic and others using expert input.
The automatic validation measures for clustering partition evaluation can be roughly divided into
three categories:
• External indices:The external criteria uses external information, such as class label, if available,
to define the quality of the clusters in a given partition. Two of the most common external
measures are the correct-RAND and Jaccard.
• Internal indices: The internal criteria looks for compactness inside each cluster and/or
separation between different clusters. Two of the most common internal measures are the
silhouette index, which measures both compactness and separation, and the within-groups sum of
squares, which only measures compactness.
• Relative indices: The relative criterion compares partitions found by two or more clustering
techniques or by different runs of the same technique.
Silhouette internal index
This evaluates compactness inside each cluster, measuring:
• how close to each other the objects inside a cluster are
• the separation of different clusters– how far the objects in each cluster are to the closest object
from another cluster.
To do this, it applies the following equation to each object xi:
s(xi)= 1 −a(xi)∕b(xi) , if a(xi) < b(xi)
0, if a(xi)=b(xi) b(xi)∕a(xi)
−1 , if a(xi) > b(xi) (5.5)
where:
• a(xi) is the average distance between xi and all other objects in its cluster
• b(xi) is the minimum average distance between xi and all other objects from each other cluster.
T heaverage of all s(xi) gives the partition silhouette measure value.
Within-groups sum of squares
This is also an internal measure but only measures compactness. It sums the squared Euclidean
distance between each instance and the centroid of its cluster. From Equation (5.4) we know that
the squared Euclidean distance between two instances p and q with m attributes each is given by:
∑ sed(p, q)= |pk − qk|2
The within groups sum of squares is given by:
s= ∑ ∑ sed(pj, Ci)
where K is the number of clusters and Ji is the number of instances of cluster i, and Ci is the
centroid of cluster i.
Jaccard external measure
This is a variation of a similar measure used in classification tasks. It evaluates how uniform the
distribution of the objects in each cluster is with respect to the class label. It uses the following
equation:
J =(M11)∕(M01 +M10 +M11) where:
• M01 isthe number of objects in other clusters but with the same label (5.8)
• M10 isthe number of objects in the same cluster, but with different labels
• M00 isthe number of objects in other clusters with different labels
• M11 is the number of objects in the same cluster with the same label.
Clustering Techniques
Another criteria is the approach used to define what a cluster is, determining the elements to be
included in the same cluster. According to this criterion, the main types of clusters are [20]:
• Separation-based: each object in the cluster is closer to every other object in the cluster than to
any object outside the cluster
• Prototype-based: each object in the cluster is closer to a prototype representing the cluster than
to a prototype representing any other cluster
• Graph-based: represents the data set by a graph structure associating each node with an object
and connecting objects that belong to the same cluster with an edge
• Density-based: a cluster is a region where the objects have a high number of close neighbors
(i.e. a dense region), surrounded by a region of low density
• Shared-property: a cluster is a group of objects that share a property.
Methods
K-means:the most popular clustering algorithm and a representative of partitional and
prototype-based clustering methods
• DBSCAN: another partitional clustering method, but in this case density-based
• Agglomerative hierarchical clustering: a representative of hierarchical and graph-based
clustering methods
K-means
Centroids are a key concept in order to understand k-means. They represent a kind of centre of
gravity for a set of instances. We start by describing the concept before explaining how k-means
works, how to read the results, and how to set the hyper-parameters.
Centroids and Distance Measures
A centroid can also be seen a s a prototype or profile of all the objects in a cluster, for example
the average of all the objects in the cluster. Thus, if we have several photos of cats and dogs, if
we put all the dogs in a cluster and all the cats in another cluster, the centroid in the dog cluster,
for example, would be a photo representing the average features in all dog photos. We can
observe, therefore, that the centroid of a cluster is not usually one of the objects in the cluster.
Example 5.7
The centroid for the friends Bernhard, Gwyneth and James has the average age and education of
the three friends: 41 years (the average of 43, 38 and 42) and 3.4 (the average of 2.0, 4.2 and 4.1)
as the education level. As you can see none of the three friends has this age and education level.
In order to have an object of the cluster as a prototype, a medoid is used instead of a centroid.
The medoid of a cluster is the instance with the shortest sum of distances to the other instances of
the cluster.
How K-means Works
The way k-means works is shown graphically. This follows the k-means algorithm, the
pseudocode for which can be seen below.
Algorithm K-means
1: INPUT D the data set
2: INPUT d the distance measure
3: INPUTK the number of clusters
4: Define the initial K centroids (they are usually randomly defined, but can be defined explicitly
in some software packages)
5: repeat
6: Associate each instance in D with the closest centroid according to the chosen distance
measure d
7. Recalculate each centroid using all instances from D associated with it.
8: until No instances from D change of associated centroid
DBSCAN
Like k-means, DBSCAN (density-based spatial clustering of applications with noise) is used for
partitional clustering. In contrast to k-means, DBSCAN automatically defines the number of
clusters. DBSCAN is a density-based technique, defining objects forming a dense region as
belonging to the same cluster. Objects not belonging to dense regions are considered to be noise.
Agglomerative Hierarchical Clustering Technique
Hierarchical algorithms construct clusters progressively. This can be done by starting with all
instances in a single cluster and dividing it progressively, or by starting with as many clusters as
the number of instances and joining them up step by step. The first approach is top-down while
the second is bottom-up. The agglomerative hierarchical clustering method is a bottom-up
approach.
Frequent Itemsets
An arbitrary combination of items is called an “itemset”. It is, in essence, an arbitrary subset of
the set I of all items. Let us think a little about the number of possible itemsets (combinations of
items). In total, there are 2|I| − 1 possible itemsets where |1| is the number of items in I.
Example 6.1 In our example, I ={Arabic, Indian, Mediterranean, Oriental, Fast food} is the set
of all of five items considered, so |I| = 5. The subsets {Fast food}, {Indian, Oriental} and
{Arabic, Oriental, Fast food} are itemsets of size 1, 2 and 3,respectively. The number of all
possible itemsets, created from items in I, of length 1 is five, of length 2 and 3 is ten and ten,
respectively, while there are five itemsets of length 4 and one itemset of length 5. In total, there
are 5+10+10+5+1 =31=32−1=25−1itemsets we can generate from items in I.
Setting them in supThreshold
The min sup threshold, a hyper-parameter with high importance, which has to be set carefully by
the user according to their expectations of the results:
• Setting it to a very low value would give a large number of itemsets that would be too specific
to be considered “frequent”. These itemsets might apply in too few cases to be useful.
• On the other hand, very high values for minsup would give a small number of itemsets. These
would be too generic to be useful. Thus, the resulting information would probably not represent
new knowledge for the user. Another important aspect of the minsup value is whether the
number of frequent itemsets that results is small enough for subsequent analysis.
Example 6.4
All itemsets generated from I ={Arabic, Indian, Mediter ranean, Oriental, Fast food}, numbered
and organized into a so-called “lattice”. Each itemset is connected to the subset(s) positioned
above it and to the superset(s) positioned below it. The uppermost itemset (with the number 0) is
an empty set, which should not be considered an itemset. It is introduced into the lattice only for
the sake of completeness.
Apriori–a Join-based Method
The oldest and most simple technique of mining frequent itemsets involves the generic, so-called
“join-based” principle, as set out below. The Apriori principle for our dataset for a minimum
support threshold minsup = 3. In the first step of the algorithm, the support of each itemset of
length k = 1 is computed, resulting in four frequent and one non-frequent itemsets. In the next
step, itemsets of length k = 2 are generated from the frequent itemsets of length k = 1, so no
itemset containing item F is considered. Step 2 results in four frequent itemsets, which are used
to generate itemsets of length k = 3, in the following step.
Algorithm Apriori.
1: INPUT T the transactional dataset
2: INPUT min_sup the minimum support threshod
3: Set k = 1
4: Set stop = false
5: repeat
6: Select all frequent itemsets of length k (with support at least min_sup)
7: if there are no two frequent itemsets of length k then
8: stop = true
9: else
10: Set k = k +1
11: until stop
Eclat
The main obstacle for the Apriori algorithm is that in every step it needs to scan the whole
transactional database in order to count the support of candidate itemsets. Counting support is
one of the bottlenecks for frequent itemset mining algorithms, especially if the database does not
fit into the memory. There are many technical issues, not relevant here, for why counting support
is computationally expensive if the database is large and does not fit into the memory.
Behind Support and Confidence
Association rules lattice corresponding to the frequent itemset {I,M,O} found in the data. In
some sense, each pattern reveals a kind of knowledge that might support further decisions of
users of these patterns. However, only some patterns are “interesting” enough for the user,
representing useful and unexpected knowledge. Evaluation of the interestingness of patterns
depends on the application domain and also on the subjective opinion of the user.
Cross-support Patterns
It is not rare in real-world data that the most of the items have relatively low or modest support,
while a few of the items have high support. For example, more students at a university attend a
course on introductory data analytics than one on quantum computing. If a pattern contains low-
support items and high-support items, then it is called a cross-support pattern. A cross-support
pattern can represent interesting relationships between items but also, and most likely, it can be
spurious since the items it contains are weakly correlated in the transactions. To measure an
extent to which a pattern P can be called a cross-support pattern, the so-called support ratio is
used.
It is defined as: supratio(P)= min{s(i1),s(i2),…,s(ik)} /max{s(i1),s(i2), …,s(ik)}
where s(i1),s(i2), …,s(ik) are the supports of items i1,i2,…,ik contained in P and min and max
return the minimum and maximum value, respectively, in their arguments. In other words,
supratio computes the ratio of the minimal support of items present in the pattern to the maximal
support of items present in the pattern.
Lift
First, let us start with some considerations about the confidence measure and its relationship to
the strength of an association rule. We will use a so-called contingency table, which contains
some statistics related to an association rule. A contingency table related to two itemsets X and Y
appearing in the rule X ⇒ Y contains four frequency counts of transactions in which:
• X and Y are present
• X is present and Y is absent
• X is absent and Y is present
• neither X nor Y are present
Simpson’s Paradox
A related phenomenon, called Simpson’s paradox, says that certain correlations between pairs of
itemsets(antecedents and consequents of rules) appearing in different groups of data may
disappear or be reversed when these groups are combined.
Example6.23
Consider 800 transcripts of records(transactions)formed by two groups of students, A and B, as
The groups A and B may refer, for example, to students on the physics and biology study
programs, respectively, while X= {basics of genetics} and Y={introduction to data analytics}
might be two itemsets, each consisting of a single course.
• In group A, the rule X⇒Y has a highconfidence(0.8) and goodlift(1.79) values.
• In group B, the ruleY⇒X has a highconfidence(0.8) and goodlift(1.66) values.
Other Types of Pattern
Sequential Patterns
The input to sequential pattern mining is a sequence database, denoted by S. Each row consists
of a sequence of events consecutively recorded in time. Each event is an itemset of arbitrary
length assembled from items available in the data.
Example 6.24 As an example, let the sequence database in Table 6.7 represent shopping records
of customers over some period of time. For example, the first row can be interpreted as follows:
the customer with ID=1 bought items as follows:
• first visit: items a and b
• second visit: items a, b and c
• third visit: items a, c, d, e
• fourth visit: items b and f.
Frequent Sequence Mining
Given a set of all available items I, sequential database S and a threshold value minsup,frequent
sequence mining aims at finding those sequences, called frequent sequences, generated from I
for which support in S is at least minsup. It is important to mention that the number of frequent
sequences that can be generated from S with available items I is usually much larger than the
number of frequent itemsets generated from I.
Example 6.27
As an example, the number of all possible itemsets which can be generated from 6 items a, b, c,
d, e and f, regardless of the value of the minsupthreshold,is26 − 1 = 64 −1 =
63.Thenumbersoffrequentsequences with respect to the sequence database in Table 6.7 are: 6 for
minsup = 1.0, 20 for minsup = 0.8, 53 for minsup = 0.6 and 237 for minsup = 0.4.
Closed and Maximal Sequences
Similar to closed and maximal frequent itemsets, closed and maximal sequential patterns can be
defined. A frequent sequential patterns is closed if it is not a subsequence of any other frequent
sequential pattern with the same support. A frequent sequential pattern s is maximal if it is not a
subsequence of any other frequent sequential pattern.
Example 6.28
Given the sequential database in Table 6.7 and minsup = 0.8, the frequent sequences ⟨{b,f} ⟩ and
⟨{a},{f}⟩, both with support 0.8, are not closed since they are subsequences of ⟨{a},{b,f} ⟩ with
the same support 0.8, which is a maximal frequent sequence. On the other hand, the frequent
sequence ⟨{a,e}⟩ with support 1.0 is closed since all of its “supersequences” ⟨{a},{a,e} ⟩, ⟨{b},
{a,e}⟩ and ⟨{c},{a,e}⟩ have less support, at 0.8.
UNIT-V
Since a classification task with two predictive attributes can be represented by a graph with two
axes, the representation is two-dimensional. Up to three predictive attributes, it is possible to
visualize the data distribution without a mathematical transformation.