0% found this document useful (0 votes)
83 views22 pages

Data Analytics Unit 4

The document discusses frequent itemset mining and the Apriori algorithm. It defines key terms like support, confidence and lift used for evaluating association rules. An example transactional dataset is provided and steps are outlined to find frequent itemsets using Apriori. Frequent itemsets with support above a minimum threshold like {A,B,C} are identified. Association rules between itemsets meeting minimum confidence are then generated.

Uploaded by

Aditi Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views22 pages

Data Analytics Unit 4

The document discusses frequent itemset mining and the Apriori algorithm. It defines key terms like support, confidence and lift used for evaluating association rules. An example transactional dataset is provided and steps are outlined to find frequent itemsets using Apriori. Frequent itemsets with support above a minimum threshold like {A,B,C} are identified. Association rules between itemsets meeting minimum confidence are then generated.

Uploaded by

Aditi Jaiswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit:- 4 Frequent Item-sets and Clustering

Frequent Item set in Data set (Association Rule Mining):- Association


Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transnational and
relational databases are found. In short, Frequent Mining shows which items appear
together in a transaction or relation.

Need of Association Mining: Frequent mining is generation of association rules


from a Transnational Dateset. If there are 2 items X and Y purchased frequently
then its good to put them together in stores or provide some discount offer on one
item on purchase of other item. This can really increase the sales.

For example it is likely to find that if a customer buys Milk and bread he/she also
buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can
suggest the customer to buy butter if he/she buys Milk and Bread.

Important Definitions :

 Support : It is one of the measure of interesting. This tells about usefulness


and certainty of rules. 5% Support means total 5% of transactions in database
follow the rule.
Support(A -> B) = Support_count(A ∪ B)
 Confidence: A confidence of 60% means that 60% of the customers who
purchased a milk and bread also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong
rule.
 Support_count(X) : Number of transactions in which X appears. If X is
A union B then it is the number of transactions in which A and B both are
present.
 Maximal Itemset: An itemset is maximal frequent if none of its supersets
are frequent.
 Closed Itemset:An itemset is closed if none of its immediate supersets have
same support count same as Itemset.
 K- Itemset:Itemset which contains K items is a K-itemset. So it can be said
that an itemset is frequent if the corresponding support count is greater than
minimum support count.
Example On finding Frequent Itemsets – Consider the given dataset with given
transactions.
TID Items
1 Bread, Milk
2 Bread, Butter, Beer, Egg
3 Milk, Butter, Beer, Coke
4 Bread, Milk, Butter, Beer
5 Bread, Milk, Butter, Coke
Before we start defining the rule, let us first see the basic definitions.
Support Count( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Butter})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y
are any 2 itemsets.
Example: {Milk, Butter}->{Beer}
Rule Evaluation Metrics –
 Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the
rule as a percentage of the total number of transaction.It is a measure of how
frequently the collection of items occur together as a percentage of all
transactions.
 Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
 Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as
the no of transactions that includes all items in {A} to the no of transactions that
includes all items in {A}.
 Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that contains items
in X also.
 Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each
other.The expected confidence is the confidence divided by the frequency of
{Y}.
 Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected,
greater than 1 means they appear together more than expected and less than 1
means they appear less than expected.Greater lift values indicate stronger
association.
Example – From the above table, {Milk, Butter}=>{Beer}
s= ({Milk, Butter, Beer}) |T|
= 2/5
= 0.4
c= (Milk, Butter, Beer) (Milk, Butter)
= 2/3
= 0.67

l= Supp({Milk, Butter, Beer}) Supp({Milk, Butter})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected using
bar-code scanners in supermarkets. Such databases consists of a large number of
transaction records which list all items bought by a customer on a single purchase.
So the manager could know if certain groups of items are consistently purchased
together and use this data for adjusting store layouts, cross-selling, promotions
based on statistics.
Example On finding Frequent Itemsets – Consider the given dataset with given
transactions.
TID Items
1 {A,C,D}
2 {B,C,D}
3 {A,B,C,D}
4 {B,D}
5 {A,B,C,D}
Lets say minimum support count is 3
 Relation hold is maximal frequent => closed => frequent
1-frequent: {A} = 3; // not closed due to {A, C} and not maximal {B}
= 4; // not closed due to {B, D} and no maximal {C} = 4; // not
closed due to {C, D} not maximal {D} = 5; // closed item-set since
not immediate super-set has same count. Not maximal
2-frequent: {A, B} = 2 // not frequent because support count <
minimum support count so ignore {A, C} = 3 // not closed due to {A,
C, D} {A, D} = 3 // not closed due to {A, C, D} {B, C} = 3 // not
closed due to {B, C, D} {B, D} = 4 // closed but not maximal due to
{B, C, D} {C, D} = 4 // closed but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support
count < minimum support count {A, B, D} = 2 // ignore not frequent
because support count < minimum support count {A, C, D} = 3 //
maximal frequent {B, C, D} = 3 // maximal frequent
4-frequent: {A, B, C, D} = 2 //ignore not frequent
Apriori Algorithm in Machine Learning

The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets
from the large dataset.

Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working

We will understand the apriori algorithm using an example and mathematical


calculation:

Example: Suppose we have the following dataset that has various transactions, and
from this dataset, we need to find the frequent itemsets and generate the association
rules using the Apriori algorithm:
Solution:

Step-1: Calculating C1 and L1:

o In the first step, we will create a table that contains support count (The
frequency of each itemset individually in the dataset) of each itemset in the
given dataset. This table is called the Candidate set or C1.

o Now, we will take out all the itemsets that have the greater support count that
the Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum
support, except the E, so E itemset will be removed.

Step-2: Candidate Generation C2, and L2:

o In this step, we will generate C2 with the help of L1. In C2, we will create the
pair of the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred
together in the given dataset. So, we will get the below table for C2:
o Again, we need to compare the C2 Support count with the minimum support
count, and after comparing, the itemset with less support count will be
eliminated from the table C2. It will give us the below table for L2

Step-3: Candidate generation C3, and L3:

o For C3, we will repeat the same two processes, but now we will form the C3
table with subsets of three itemsets together, and will calculate the support
count from the dataset. It will give the below table:

o Now we will create the L3 table. As we can see from the above C3 table, there
is only one combination of itemset that has support count equal to the
minimum support count. So, the L3 will have only one combination, i.e., {A,
B, C}.

Step-4: Finding the association rules for the subsets:

Rules Support Confidence

A^B → C 2 Sup{(A^B) ^C}/sup(A^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%


To generate the association rules, first, we will create a new table with the possible
rules from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A^B)/A. After calculating the confidence value for all
rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).

Consider the below table:

As the given threshold or minimum confidence is 50%, so the first three rules A ^B
→ C, B^C → A, and A^C → B can be considered as the strong association rules for
the given problem.

Advantages of Apriori Algorithm

o This is easy to understand algorithm


o The join and prune steps of the algorithm can be easily implemented on large
datasets.

Disadvantages of Apriori Algorithm

o The apriori algorithm works slow compared to other algorithms.


o The overall performance can be reduced as it scans the database for multiple
times.
o The time complexity and space complexity of the apriori algorithm is O(2D),
which is very high. Here D represents the horizontal width present in the
database.

Generating association rules from frequent itemsets.

1. Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong
association rules satisfy both minimumsupport and minimum confidence).

2. This can be done using equation (4.6.1) for confidence, which is shown here for
completeness :

Confidence(A  B) = P(B |A) = support_count(AU


B)/support_count(A )

3.The conditional probability is expressed in terms of itemset support count, where


support_count (A U B) is the number of transactions containing the itemsets A U B,
and support_count (A) is the number of transactions containing the itemset A.

4. Based on equation (4.6.1), association rules can be generated as follows :


a. For each frequent itemset l, generate all non-empty subsets of l. Frequent Itemsets
& Clustering
b. For every non-empty subset s of l, output the rule (l – s) if (support_count(l)/
support_count(s)) min_conf, where min_conf is the minimum confidence threshold.
5. Because the rules are generated from frequent itemsets, each one automatically
satisfies the minimum support.

6. Frequent itemsets can be stored ahead of time in hash tables along with their counts
so that they can be accessed quickly.

Applications of frequent itemset analysis :

Related concepts :
1. Let items be words, and let baskets be documents (e.g., Web pages,blogs, tweets).
2. A basket/document contains those items/words that are present in the document.
3. If we look for sets of words that appear together in many documents, the sets will
be dominated by the most common words (stop words).
4. If the document contain many the stop words such as “and” and “a” then it will
consider as more frequent itemsets.
5. However, if we ignore all the most common words, then we would hope to find
among the frequent pairs some pairs of words that represent a joint concept.

Plagiarism :
1.Let the items be documents and the baskets be sentences.
2. An item is in a basket if the sentence is in the document.
3. This arrangement appears backwards, and we should remember that the
relationship between items and baskets is an arbitrary many-many relationship.
4. In this application, we look for pairs of items that appear togetherin several baskets.
5. If we find such a pair, then we have two documents that share several sentences in
common.

Biomarkers :
1. Let the items be of two types such as genes or blood proteins, and diseases.
2. Each basket is the set of data about a patient: their genome and blood-chemistry
analysis, as well as their medical history of disease.
3. A frequent itemset that consists of one disease and one or more biomarkers suggest
a test for the disease.

Handling Large Data Sets in Main Memory

A large volume of data poses new challenges, such as overloaded memory and
algorithms that never stop running. It forces you to adapt and expand your repertoire
of techniques. But even when you can perform your analysis, you should take care of
issues such as I/O (input/output) and CPU starvation, because these can cause speed
issues.
General techniques for handling large volumes of data
Never-ending algorithms, out-of-memory errors, and speed issues are the most
common challenges you face when working with large data. In this section, we’ll
investigate solutions to overcome or alleviate these problems.
The solutions can be divided into three categories: using the correct algorithms,
choosing the right data structure, and using the right tools.

Choosing the right algorithm


Choosing the right algorithm can solve more problems than adding more or better
hardware. An algorithm that’s well suited for handling large data doesn’t need to load
the entire data set into memory to make predictions. Ideally, the algorithm also
supports parallelized calculations. In this section we’ll dig into three types of
algorithms that can do that: online algorithms, block algorithms, and MapReduce
algorithms, as shown in figure.

1. The triangular-matrix method :


a. Even after coding items as integers, we still have the problem that we must count a
pair {i, j} in only one place.
b. For example, we could order the pair so that i < j, and only use the entry a[i, j] in a
two-dimensional array a. That strategy would make half the array useless.
c. A more space-efficient way is to use a one-dimensional triangular array.
d. We store in a[k] the count for the pair {i, j}, with 1 i < j n, where k = (i – 1)(n
– i/2 ) + j – i .
e. The result of this layout is that the pairs are stored in lexicographic order, that is
first {1, 2}, {1, 3}, . . ., {1, n}, then {2, 3}, {2, 4}, . . . , {2, n}, and so on, down to {n
– 2, n – 1}, {n – 2, n}, and finally {n – 1, n}.

2. The triples method :


a. This is more appropriate approach to store counts that depend on the fraction of the
possible pairs of items that actually appear in some basket.
b. We can store counts as triples [i, j, c], meaning that the count of pair {i, j}, with i <
j, is c. A data structure, such as a hash table with i and j as the search key, is used so
we can tell if there is a triple for
a given i and j and, if so, to find it quickly.
c. We call this approach the triples method of storing counts.
d. The triples method does not require us to store anything if the count for a pair is 0.
e. On the other hand, the triples method requires us to store three integers, rather than
one, for every pair that does appear in some basket.

3. PCY algorithm for handling large dataset in main memory.

1. In first pass of Apriori algorithm, there may be much unused space in main
memory.
2. The PCY Algorithm uses the unused space for an array of integers that generalizes
the idea of a Bloom filter. The idea is shown schematically in Fig.
Item Item Fre
names Item
names quent
to counts
to items
integers integer

Bitmap

Hash table
for bucket Data structure

counts for counts


of pairs

Pass 1 Pass 2
3. Array is considered as a hash table, whose buckets hold integers rather than sets of
keys or bits. Pairs of items are hashed to buckets of this hash table. As we examine a
basket during the first pass, we not only add 1 to the count for each item in the basket,
but we generate all the pairs, using a double loop.
4. We hash each pair, and we add 1 to the bucket into which that pair hashes.
5. At the end of the first pass, each bucket has a count, which is the sum of the counts
of all the pairs that hash to that bucket.
6. If the count of a bucket is at least as great as the support threshold s, it is called a
frequent bucket. We can say nothing about the pairs that hash to a frequent bucket;
they could all be frequent pairs from the information available to us.
7. But if the count of the bucket is less than s (an infrequent bucket), we know no pair
that hashes to this bucket can be frequent, even if the pair consists of two frequent
tems. Frequent Itemsets & Clustering
8. We can define the set of candidate pairs C2 to be those pairs {i, j} such that:
a. i and j are frequent items.
b. {i, j} hashes to a frequent bucket.
Limited-Pass Algorithms
The algorithms for frequent itemsets discussed so far use one pass for each size of
itemset we investigate. If main memory is too small to hold the data and the space
needed to count frequent itemsets of one size, there does not seem to be any way to
avoid k passes to compute the exact collection of frequent itemsets. However, there
are many applications where it is not essential to discover every frequent itemset. For
instance, if we are looking for items purchased together at a supermarket, we are not
going to run a sale based on every frequent itemset we find, so it is quite sufficient to
find most but not all of the frequent itemsets. In this section we explore some
algorithms that have been proposed to find all or most frequent itemsets using at most
two passes. We begin with the obvious approach of using a sample of the data rather
than the entire dataset. An algorithm called SON uses two passes, gets the exact
answer, and lends itself to implementation by map-reduce or another parallel
computing regime. Finally, Toivonen’s Algorithm uses two passes on average, gets an
exact answer, but may, rarely, not terminate in any given amount of time.

Simple and randomized algorithm :


1. In simple and randorimized algorithm, we pick a random subset of the baskets and
pretend it is the entire dataset instead of using the entire file of baskets.
2. We must adjust the support threshold to reflect the smaller number of baskets.
3. For instance, if the support threshold for the full dataset is s, and we choose a
sample of 1% of the baskets, then we should examine the sample for itemsets that
appear in at least s/100 of the baskets.
4. The best way to pick the sample is to read the entire dataset, and for each basket,
select that basket for the sample with some fixed probability p.
5. Suppose there are m baskets in the entire file. At the end, we shall have a sample
whose size is very close to pm baskets.
6. However, if the baskets appear in random order in the file already, then we do not
even have to read the entire file.
7. We can select the first pm baskets for our sample. Or, if the file is part of a
distributed file system, we can pick some chunks at random to serve as the sample.
8. Having selected our sample of the baskets, we use part of main memory to store
these baskets.
9. Remaining main memory is used to execute one of the algorithms such as A-Priori
or PCY. However, the algorithm must run passes over the main-memory sample for
each itemset size, until we find a size with no frequent items.

SON algorithm to find all or most frequent itemsets using at most two passes.

1. The idea is to divide the input file into chunks.


2. Treat each chunk as a sample, and run the simple and randomized Data Analytics
algorithm on that chunk.
3. We use ps as the threshold, if each chunk is fraction p of the whole file, and s is the
support threshold.
4. Store on disk all the frequent itemsets found for each chunk.
5. Once all the chunks have been processed in that way, take the union of all the
itemsets that have been found frequent for one or more chunks. These are the
candidate itemsets.
6. If an itemset is not frequent in any chunk, then its support is less than ps in each
chunk. Since the number of chunks is 1/p, we conclude that the total support for that
itemset is less than (1/p)ps = s.
7. Thus, every itemset that is frequent in the whole is frequent in at least one chunk,
and we can be sure that all the truly frequent itemsets are among the candidates; i.e.,
there are no false negatives. We have made a total of one pass through the data as we
read each chunk and processed it.
8. In a second pass, we count all the candidate itemsets and select those that have
support at least s as the frequent itemsets.

SON algorithm using MapReduce.


1. The SON algorithm work well in a parallel-computing environment.
2. Each of the chunks can be processed in parallel, and the frequent itemsets from
each chunk combined to form the candidates.
3. We can distribute the candidates to many processors, have each processor count the
support for each candidate in a subset of the baskets, and finally sum those supports to
get the support for each candidate itemset in the whole dataset.
4. There is a natural way of expressing each of the two passes as a MapReduce
operation.

MapReduce-MapReduce sequence :

First Map function :


a. Take the assigned subset of the baskets and find the itemsets frequent in the subset
using the simple and randomized algorithm.
b. Lower the support threshold from s to ps if each Map task gets fraction p of the
total input file.
c. The output is a set of key-value pairs (F, 1), where F is a frequent itemset from the
sample.

First Reduce Function :


a. Each Reduce task is assigned a set of keys, which are itemsets. Frequent Itemsets &
Clustering
b. The value is ignored, and the Reduce task simply produces those keys (itemsets)
that appear one or more times. Thus, the output of the first Reduce function is the
candidate itemsets.

Second Map function :


a. The Map tasks for the second Map function take all the output from the first
Reduce Function (the candidate itemsets) and a portion of the input data file.
b. Each Map task counts the number of occurrences of each of the candidate itemsets
among the baskets in the portion of the dataset that it was assigned.
c. The output is a set of key-value pairs (C, v), where C is one of the candidate sets
and v is the support for that itemset among the baskets that were input to this Map
task.

Second Reduce function :


a. The Reduce tasks take the itemsets they are given as keys and sum the associated
values.
b. The result is the total support for each of the itemsets that the Reduce ask was
assigned to handle.
c. Those itemsets whose sum of values is at least s are frequent in the whole dataset,
so the Reduce task outputs these itemsets with their counts.
d. Itemsets that do not have total support at least s are not transmitted to the output of
the Reduce task.

What are clustering algorithms?


Clustering is an unsupervised machine learning task. You might also hear this referred
to as cluster analysis because of the way this method works.Using a clustering
algorithm means you're going to give the algorithm a lot of input data with no labels
and let it find any groupings in the data it can.Those groupings are called clusters. A
cluster is a group of data points that are similar to each other based on their relation to
surrounding data points. Clustering is used for things like feature engineering or
pattern discovery.

Types of clustering algorithms


There are different types of clustering algorithms that handle all kinds of unique data.

Density-based
In density-based clustering, data is grouped by areas of high concentrations of data
points surrounded by areas of low concentrations of data points. Basically the
algorithm finds the places that are dense with data points and calls those clusters.The
great thing about this is that the clusters can be any shape. You aren't constrained to
expected conditions.The clustering algorithms under this type don't try to assign
outliers to clusters, so they get ignored.

Distribution-based
With a distribution-based clustering approach, all of the data points are considered
parts of a cluster based on the probability that they belong to a given cluster.It works
like this: there is a center-point, and as the distance of a data point from the center
increases, the probability of it being a part of that cluster decreases.If you aren't sure
of how the distribution in your data might be, you should consider a different type of
algorithm.

Centroid-based
Centroid-based clustering is the one you probably hear about the most. It's a little
sensitive to the initial parameters you give it, but it's fast and efficient.These types of
algorithms separate data points based on multiple centroids in the data. Each data
point is assigned to a cluster based on its squared distance from the centroid. This is
the most commonly used type of clustering.

Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you would get
from a company database or taxonomies. It builds a tree of clusters so everything is
organized from the top-down.This is more restrictive than the other clustering types,
but it's perfect for specific kinds of data sets.
Hierarchical Clustering in Machine Learning

Hierarchical clustering is another unsupervised machine learning algorithm, which is


used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging them
until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it
is a top-down approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then


why we need hierarchical clustering? So, as we have seen in the K-means clustering
that there are some challenges with this algorithm, which are a predetermined number
of clusters, and it always tries to create the clusters of the same size. To solve these
two challenges, we can opt for the hierarchical clustering algorithm because, in this
algorithm, we don't need to have knowledge about the predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To


group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the


below steps:

o Step-1: Create each data point as a single cluster. Let's


say there are N data points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and


merge them together to form one cluster. There will
be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:

ep-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance


between the closest points of the clusters.
Consider the below image:

2. Complete Linkage: It is the farthest distance


between the two points of two different clusters.
It is one of the popular linkage methods as it
forms tighter clusters than single-linkage.

3. Average Linkage: It is the linkage method in


which the distance between each pair of
datasets is added up and then divided by the total
number of datasets to calculate the average distance between two clusters. It is
also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine


together and form a cluster, correspondingly a dendrogram is created, which
connects P2 and P3 with a rectangular shape. The hight is decided according
to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram
is created. It is higher than of previous, as the Euclidean distance between P5
and P6 is a little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points
together.

We can cut the dendrogram tree structure at any level as per our requirement.

K-Means (centroid-based partitioning technique) Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science. In this topic, we will learn
what is K-means clustering algorithm, how the algorithm works, along with the
Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k


different clusters in such a way that each dataset belongs only one
group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The


main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters. algorithm takes the unlabeled dataset as input,
divides the dataset into k-number of clusters, and repeats the process until it does not
find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering

algorithm mainly performs two tasks:


o Determines the best value for K center points or centroids by an iterative
process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Clustering High-Dimensional Data:

Clustering of the High-Dimensional Data return the group of objects which are
clusters. It is required to group similar types of objects together to perform the
cluster analysis of high-dimensional data, But the High-Dimensional data space is
huge and it has complex data types and attributes. A major challenge is that we need
to find out the set of attributes that are present in each cluster. A cluster is defined
and characterized based on the attributes present in the cluster. Clustering High-
Dimensional Data we need to search for clusters and find out the space for the
existing clusters.
The High-Dimensional data is reduced to low-dimension data to make the clustering
and search for clusters simple. some applications need the appropriate models of
clusters, especially the high-dimensional data. clusters in the high-dimensional data
are significantly small. the conventional distance measures can be ineffective.
Instead, To find the hidden clusters in high-dimensional data we need to apply
sophisticated techniques that can model correlations among the objects in subspaces.
Subspace Clustering Methods: There are 3 Subspace Clustering Methods:

 Subspace search methods


 Correlation-based clustering methods
 Bi-clustering methods

Subspace clustering approaches to search for clusters existing in subspaces of the


given high-dimensional data space, where a subspace is defined using a subset of
attributes in the full space.
1. Subspace Search Methods: A subspace search method searches the subspaces
for clusters. Here, the cluster is a group of similar types of objects in a subspace.
The similarity between the clusters is measured by using distance or density
features. CLIQUE algorithm is a subspace clustering method. subspace search
methods search a series of subspaces. There are two approaches in Subspace Search
Methods: Bottom-up approach starts to search from the low-dimensional subspaces.
If the hidden clusters are not found in low-dimensional subspaces then it searches in
higher dimensional subspaces. The top-down approach starts to search from the
high-dimensional subspaces and then search in subsets of low-dimensional
subspaces. Top-down approaches are effective if the subspace of a cluster can be
defined by the local neighborhood sub-space clusters.
2. Correlation-Based Clustering: correlation-based approaches discover the
hidden clusters by developing advanced correlation models. Correlation-Based
models are preferred if is not possible to cluster the objects by using the Subspace
Search Methods. Correlation-Based clustering includes the advanced mining
techniques for correlation cluster analysis. Biclustering Methods are the Correlation-
Based clustering methods in which both the objects and attributes are clustered.
3. Bi-clustering Methods:
Bi-clustering means clustering the data based on the two factors. we can cluster both
objects and attributes at a time in some applications. The resultant clusters are
biclusters. To perform the bi-clustering there are four requirements:
 Only a small set of objects participate in a cluster.
 A cluster only involves a small number of attributes.
 The data objects can take part in multiple clusters, or the objects may also
include in any cluster.
 An attribute may be involved in multiple clusters.
Objects and attributes are not treated in the same way. Objects are clustered
according to their attribute values. We treat Objects and attributes as different in
biclustering analysis.

CLIQUE
CLIQUE is a density-based and grid-based subspace clustering algorithm. So let’s
first take a look at what is a grid and density-based clustering technique.
 Grid-Based Clustering Technique: In Grid-Based Methods, the space of
instance is divided into a grid structure. Clustering techniques are then applied
using the Cells of the grid, instead of individual data points, as the base units.
 Density-Based Clustering Technique: In Density-Based Methods, A cluster
is a maximal set of connected dense units in a subspace.

CLIQUE Algorithm:

CLIQUE Algorithm uses density and grid-based technique i.e subspace clustering
algorithm and finds out the cluster by taking density threshold and a number of grids
as input parameters. It is specially designed to handle datasets with a large number
of dimensions.CLIQUE Algorithm is very scalable with respect to the value of the
records, and a number of dimensions in the dataset because it is grid-based and uses
the Apriori Property effectively. APRIORI APPROACH ?.
Apriori Approach Stated that If an X dimensional unit is dense then all its
projections in X-1 dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when
projected to a low-dimensional subspace. CLIQUE restricts its search for high-
dimensional dense cells to the intersection of dense cells in the subspace because
CLIQUE uses apriori properties.

Working of CLIQUE Algorithm:

The CLIQUE algorithm first divides the data space into grids. It is done by dividing
each dimension into equal intervals called units. After that, it identifies dense units.
A unit is dense if the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find
dense cells along two dimensions, and it works until all dense cells along the entire
dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the
largest set (“cluster”) of connected dense cells. Finally, the CLIQUE algorithm
generates a minimal description of the cluster. Clusters are then generated from all
dense subspaces using the apriori approach.

Advantage:

 CLIQUE is a subspace clustering algorithm that outperforms K-means,


DBSCAN, and Farthest First in both execution time and accuracy.
 CLIQUE can find clusters of any shape and is able to find any number of
clusters in any number of dimensions, where the number is not predetermined by
a parameter.
 One of the simplest methods, and interpretability of results.

Disadvantage:

 The main disadvantage of CLIQUE Algorithm is that if the size of the cell is
unsuitable for a set of very high values, then too much of the estimation will
take place and the correct cluster will be unable to find.
PROCLUS. (Projected Clustering) :
Projected clustering is the first, top-down partitioning projected clustering
algorithm based on the notion of k- medoid clustering which was presented by
Aggarwal (1999). It determines medoids for each cluster repetitively on a sample of
data using a greedy hill climbing technique and then upgrades the results
repetitively. Cluster quality in projected clustering is a function of average distance
between data points and the closest medoid. Also, the subspace dimensionality is an
input framework which generates clusters of alike sizes.

1. Projected clustering (PROCLUS) is a top-down subspace clustering algorithm.


2. PROCLUS samples the data and then selects a set of k-medoids and iteratively
improves the clustering.
3. PROCLUS is actually faster than CLIQUE due to the sampling of large data sets.
Frequent Itemsets & Clustering
4. The three phases of PROCLUS are as follows :
a. Initialization phase : Select a set of potential medoids that are far apart using a
greedy algorithm.
b. Iteration phase :
i. Select a random set of k-medoids from this reduced data set to determine if
clustering quality improves by replacing current medoids with randomly chosen new
medoids.
ii. Cluster quality is based on the average distance between instances and the nearest
medoid.
iii. For each medoid, a set of dimensions is chosen whose average distances are small
compared to statistical expectation.
iv. Once the subspaces have been selected for each medoid, average Manhattan
segmental distance is used to assign points to medoids, forming dusters.
c. Refinement phase :
i. Compute a new list of relevant dimensions for each medoid
based on the clusters formed and reassign points to medoids,
removing outliers.
ii. The distance-based approach of PROCLUS is biased toward
clusters that are hype-spherical in shape.

Clustering For Streams and Parallelism:- Data stream clustering refers to the
clustering of data that arrives continually such as financial transactions, multimedia
data, or telephonic records. It is usually studied as a “Streaming Algorithm.” The
purpose of Data Stream Clustering is to contract a good clustering stream using a
small amount of time and memory.
Technically, Clustering is the act of grouping elements using sets. The main purpose
of this type of separation is to unite items that are similar to each other, using the
comparison of a series of characteristics of these. When we talk about Data Stream,
we can separate its methods into five categories, namely partitioning, hierarchical,
density-based, grid-based and model-based.
There is one more factor to take into account when talking about clusters. It is
possible to divide the possible distances in four, being the minimum (or single
connection), maximum (or complete connection), mean distance and the average, and
each one has its characteristics regarding the cost of implementation and
computational power, being that the minimum distance and mean distance are more
common to use in Data Stream Clustering.
Basic subspace clustering approaches are :
1. Grid-based subspace clustering :
a. In this approach, data space is divided into axis-parallel cells. Then the cells
containing objects above a predefined threshold value given as a parameter are
merged to form subspace clusters. Number of intervals is another input parameter
which defines range of values in each grid.
b. Apriori property is used to prune non-promising cells and to improve efficiency.
c. If a unit is found to be dense in k – 1 dimension, then it is considered for finding
dense unit in k dimensions.
d. If grid boundaries are strictly followed to separate objects, accuracy of clustering
result is decreased as it may miss neighbouring objects which get separated by string
grid boundary. Clustering quality is highly dependent on input parameters.

2. Window-based subspace clustering :


a. Window-based subspace clustering overcomes drawbacks of cell-based subspace
clustering that it may omit significant results.
b. Here a window slides across attribute values and obtains overlapping intervals to be
used to form subspace clusters.
c. The size of the sliding window is one of the parameters. These algorithms generate
axis-parallel subspace clusters.

3. Density- based subspace clustering :


a. A density-based subspace clustering overcome drawbacks of grid based subspace
clustering algorithms by not using grids.
b. A cluster is defined as a collection of objects forming a chain which fall within a
given distance and exceed predefined threshold of object count. Then adjacent dense
regions are merged to form bigger clusters.
c. As no grids are used, these algorithms can find arbitrarily shaped subspace clusters.
d. Clusters are built by joining together the objects from adjacent dense regions.
e. These approaches are prone to values of distance parameters.
f. The effect curse of dimensionality is overcome in density-based algorithms by
utilizing a density measure which is adaptive to subspace size.

You might also like