Data Analyticskit601 Unit 4 Notes
Data Analyticskit601 Unit 4 Notes
Apriori Algorithm
The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected.
This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
associations efficiently. It is the iterative process for finding the frequent itemsets from
the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly
used for market basket analysis and helps to find those products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.
Frequent itemsets are those items whose support is greater than the threshold value or
user-specified minimum support. It means if A & B are the frequent itemsets together,
then individually A and B should also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
Note: To better understand the apriori algorithm, and related term such as support and
confidence, it is recommended to understand the association rule learning.
Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum
or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Example: Suppose we have the following dataset that has various transactions, and
from this dataset, we need to find the frequent itemsets and generate the association
rules using the Apriori algorithm:
Solution:
● In the first step, we will create a table that contains support count (The frequency
of each itemset individually in the dataset) of each itemset in the given dataset.
This table is called the Candidate set or C1.
● Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum
support, except the E, so E itemset will be removed.
● In this step, we will generate C2 with the help of L1. In C2, we will create the pair
of the itemsets of L1 in the form of subsets.
● After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred
together in the given dataset. So, we will get the below table for C2:
● Again, we need to compare the C2 Support count with the minimum support
count, and after comparing, the itemset with less support count will be
eliminated from the table C2. It will give us the below table for L2
● For C3, we will repeat the same two processes, but now we will form the C3 table
with subsets of three itemsets together, and will calculate the support count
from the dataset. It will give the below table:
● Now we will create the L3 table. As we can see from the above C3 table, there is
only one combination of itemset that has support count equal to the minimum
support count. So, the L3 will have only one combination, i.e., {A, B, C}.
To generate the association rules, first, we will create a new table with the possible
rules from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. After calculating the confidence value for all
rules, we will exclude the rules that have less confidence than the minimum
threshold(50%).
As the given threshold or minimum confidence is 50%, so the first three rules A ^B →
C, B^C → A, and A^C → B can be considered as the strong association rules for the
given problem.
● The join and prune steps of the algorithm can be easily implemented on large
datasets.
● The overall performance can be reduced as it scans the database for multiple
times.
● The time complexity and space complexity of the apriori algorithm is O(2D),
which is very high. Here D represents the horizontal width present in the
database.
Improvements to A-Priori
PCY Algorithm(Park-Chen-Yu)
• Hash-based improvement to A-Priori.
Picture of PCY
Bitmap
Hash
table Counts of
candidate
pairs
Pass 1 Pass 2
3. But in the best case, the count for a bucket is less than the
support s.
– Now, all pairs that hash to this bucket can be eliminated as
candidates, even if the pair consists of two frequent items.
Memory Details
• Hash table requires buckets of 2-4 bytes.
– Number of buckets thus almost 1/4-1/2 of the
number of bytes of main memory.
Multistage Picture
Bitmap 1 Bitmap 1
Multistage – Pass 3
• Count only those pairs {i, j } that satisfy:
1. Both i and j are frequent items.
2. Using the first hash function, the pair hashes to a
bucket whose bit in the first bit-vector is 1.
3. Using the second hash function, the pair hashes
to a bucket whose bit in the second bit-vector is
1.
Multihash
• Key idea: use several independent hash
tables on the first pass.
Multihash Picture
Bitmap 1
First hash
table Bitmap 2
Counts of
Second
candidate
hash table
pairs
Pass 1 Pass 2
Extensions
• Either multistage or multihash can use more
than two hash functions.
Simple Algorithm
• Take a random sample of the market
baskets.
Copy of
sample
• Run a-priori (for sets of all sizes, not just baskets
pairs) in main memory, so you don’t pay
for disk I/O each time you increase the size Space
of itemsets. for
counts
– Be sure you leave enough space for counts.
Theorem:
• If there is an itemset that is frequent in the
whole, but not frequent in the sample,
• then there is a member of the negative border
for the sample that is frequent in the whole.
Proof:
• Suppose not; i.e., there is an itemset S
frequent in the whole but
– Not frequent in the sample, and
– Not present in the sample’s negative border.
• Let T be a smallest subset of S that is not
frequent in the sample.
• T is frequent in the whole (S is frequent,
monotonicity).
• T is in the negative border (else not
“smallest”).
unit4/Handling large data sets in main
9/28/2019 26
memory
Random Sampling
◆ Take a random sample of the market
baskets that fits in main memory
◗ Leave enough space in memory for counts
Main memory
◗ For sets of all sizes, not just pairs Copy of
sample
◗ Don’t pay for disk I/O each
baskets
time we increase the size of itemsets
◗ Reduce support threshold Space
proportionally to match for
the sample size counts4
Downloaded by Shashank Mishra ([email protected])
lOMoARcPSD|37581604
Random Sampling:
Not an exact algorithm
◆ With a single pass, cannot guarantee:
◗ That algorithm will produce all itemsets that are
frequent in the whole dataset
• False negative: itemset that is frequent in the whole but
not in the sample
◗ That it will produce only itemsets that are
frequent in the whole dataset
• False positive: frequent in the sample but not in the
whole
SON Algorithm
◆ Avoids false negatives and false positives
◆ Requires two full passes over data
10
12
13
SON: Map/Reduce
◆ Phase 1: Find candidate itemsets
◗ Map?
◗ Reduce?
14
SON: Map/Reduce
Phase 1: Find candidate itemsets
◆ Map
◗ Input is a chunk/subset of all baskets; fraction p of total input file
◗ Find itemsets frequent in that subset (e.g., using random
sampling algorithm)
◗ Use support threshold ps
◗ Output is set of key-value pairs (F, 1) where F is a
frequent itemset from sample
◆ Reduce
◗ Each reduce task is assigned set of keys, which are itemsets
◗ Produces keys that appear one or more time
◗ Frequent in some subset
◗ These are candidate itemsets
15
SON: Map/Reduce
Phase 2: Find true frequent itemsets
◆ Map
◗ Each Map task takes output from first Reduce task AND a
chunk of the total input data file
◗ All candidate itemsets go to every Map task
◗ Count occurrences of each candidate itemset among the baskets
in the input chunk
◗ Output is set of key-value pairs (C, v), where C is a
candidate frequent itemset and v is the support for that
itemset among the baskets in the input chunk
◆ Reduce
◗ Each reduce tasks is assigned a set of keys (itemsets)
◗ Sums associated values for each key: total support for itemset
◗ If support of itemset >= s, emit itemset and its count
16
Toivonen’s Algorithm
17
Toivonen’s Algorithm
◆ Given sufficient main memory, uses one pass
over a small sample and one full pass over
data
◆ Gives no false positives or false negatives
◆ BUT, there is a small but finite probability
it will fail to produce an answer
◗ Will not identify frequent itemsets
◆ Then must be repeated with a different
sample until it gives an answer
◆ Need only a small number of iterations
18
20
21
Negative Border
…
tripletons
doubletons
Frequent Itemsets
singletons from Sample
22
26
Hashing
In PCY algorithm, when generating L1, the set of
frequent itemsets of size 1, the algorithm also:
• generates all possible pairs for each basket
• hashes them to buckets
• keeps a count for each hash bucket
• Identifies frequent buckets (count >= s)
Main memory
Item counts Frequent items
Bitmap
Recall: Hash
Hash table
Main-Memory table Counts of
for pairs candidate
Picture of PCY
pairs
Pass 1 Pass 2
December 2008 ©GKGupta 27
Example
Consider a basket database in the first table below
All itemsets of size 1 determined to be frequent on previous pass
The second table below shows all possible 2-itemsets for each basket
Basket ID Items
100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk
Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0
0 1
E 3 0 2
J 4 0 3
0 4
M 5
1 5
Y 6 1 6
1 7
Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0
0 1
E 3 0 2
J 4 0 3
0 4 2 (B, C)
M 5
1 5 3 (B, E) (J, M)
Y 6 1 6
1 7
Bucket 5 is frequent. Are any of the pairs that hash to the bucket frequent?
Does Pass 1 of PCY know which pairs contributed to the bucket?
Downloaded by Shashank Mishra ([email protected])
lOMoARcPSD|37581604
Hashing Example
Support Threshold = 3
The possible pairs:
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)
300 (B, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (C, J) (C, M) (J, M)
(B,C) -> 12, 12%8 = 4; (B,E) -> 13, 13%8 = 5; (C, J) -> 24, 24%8 = 0
Mapping table
Bit map for Bucket number Count Pairs that hash
B 1 frequent buckets to bucket
C 2 1 0 5 (C, J) (B, Y) (M, Y)
0 1 1 (C, M)
E 3 0 2 1 (E, J)
J 4 0 3 0
0 4 2 (B, C)
M 5
1 5 3 (B, E) (J, M)
Y 6 1 6 3 (B, J)
1 7 3 (C, E) (B, M)
PCY algorithm was developed by three Chinese scientists Park, Chen, and Yu. This is
an algorithm used in the field of big data analytics for the frequent itemset mining when
the dataset is very large.
Consider we have a huge collection of data, and in this data, we have a number of
transactions. For example, if we buy any product online it’s transaction is being noted.
Let, a person is buying a shirt from any site now, along with the shirt the site advised the
person to buy jeans also, with some discount. So, we can see that how two different
things are made into a single set and associated. The main purpose of this algorithm is
to make frequent item sets say, along with shirt people frequently buy jeans.
For example:
So, from the above example we can see that shirt is most frequently bought along with
jeans, so, it is considered as a frequent itemset.
Question: Apply PCY algorithm on the following transaction to find the candidate sets
(frequent sets).
Given data
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}
Solution:
1. To identify the length or we can say repetition of each candidate in the given
dataset.
2. Reduce the candidate set to all having length 1.
3. Map pair of candidates and find the length of each pair.
4. Apply a hash function to find bucket no.
5. Draw a candidate set table.
Items → {1, 2, 3, 4, 5, 6}
Key 1 2 3 4 5 6
Value 4 6 8 8 6 4
But here in this example there is no key having value less than 1. Hence, candidate set
= {1, 2, 3, 4, 5, 6}
Step 3: Map all the candidate set in pairs and calculate their lengths.
Note: Pairs should not get repeated avoid the pairs that are already written before.
Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5)
(4,5) (4,6)}
ADVERTISEMENT
Now, arrange the pairs according to the ascending order of their obtained bucket
number.
Bit vector Bucket no. Highest Support Count Pair Candidate Set
s
1 0 3 (4,5) (4,5)
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 3 (4,6) (4,6)
1 5 5 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 3 (2,4) (2,4)
Check the pairs which have the highest support count less than 3, and write those in the
candidate set, if less than 3 then reject.
(NOTE: There are some exceptional cases where highest count support is less than 3,
i.e. threshold value and for every candidate pair write bit vector as 1 means if HCS is
greater than equal to threshold then bit vector is 1 otherwise 0).
process for Clustering, as mentioned above, a distance-based similarity metric plays a pivotal role in
deciding the clustering.
What is Clustering?
In this article, we shall understand the various types of clustering, numerous clustering methods used
Many things around us can be categorized as “this and that” or to be less vague and more specific,
in machine learning and eventually see how they are key to solve various business problems
we have groupings that could be binary or groups that can be more than two, like a type of pizza
base or type of car that you might want to purchase. The choices are always clear – or, how the
technical lingo wants to put it – predefined groups and the process predicting that is an important Types of Clustering Methods
process in the Data Science stack called Classification.
As we made a point earlier that for a successful grouping, we need to attain two major goals: one, a
But what if we bring into play a quest where we don’t have pre-defined choices initially, rather, we similarity between one data point with another and two, a distinction of those similar data points
derive those choices! Choices that are based out of hidden patterns, underlying similarities between with others which most certainly, heuristically differ from those points. The basis of such divisions
the constituent variables, salient features from the data etc. This process is known as Clustering in begins with our ability to scale large datasets and that’s a major beginning point for us. Once we are
Machine Learning or Cluster Analysis, where we group the data together into an unknown number through it, we are presented with a challenge that our data contains different kinds of attributes –
of groups and later use that information for further business processes. categorical, continuous data, etc., and we should be able to deal with them. Now, we know that our
data these days is not limited in terms of dimensions, we have data that is multi-dimensional in
nature. The clustering algorithm that we intend to use should successfully cross this hurdle as well.
You may also like to read: What is Machine Learning?
The clusters that we need, should not only be able to distinguish data points but also they should be
inclusive. Sure, a distance metric helps a lot but the cluster shape is often limited to being a
So, to put it in simple words, in machine learning clustering is the process by which we create
geometric shape and many important data points get excluded. This problem too needs to be taken
groups in a data, like customers, products, employees, text documents, in such a way that objects
care of.
falling into one group exhibit many similar properties with each other and are different from objects
that fall in the other groups that got created during the process.
In our progress, we notice that our data is highly “noisy” in nature. Many unwanted features have
been residing in the data which makes it rather Herculean task to bring about any similarity between
the data points – leading to the creation of improper groups. As we move towards the end of the
line, we are faced with a challenge of business interpretation. The outputs from the clustering
algorithm should be understandable and should fit the business criteria and address the business
problem correctly.
To address the problem points above – scalability, attributes, dimensional, boundary shape, noise,
and interpretation – we have various types of clustering methods that solve one or many of these
problems and of course, many statistical and machine learning clustering algorithms that implement
the methodology.
Hierarchical Clustering is a method of unsupervised machine learning clustering where it begins with
a pre-defined top to bottom hierarchy of clusters. It then proceeds to perform a decomposition of
the data objects based on this hierarchy, hence obtaining the clusters. This method follows two
approaches based on the direction of progress, i.e., whether it is the top-down or bottom-up flow of
creating clusters. These are Divisive Approach and the Agglomerative Approach respectively.
Hence, iteratively, we are splitting the data which was once grouped as a single large cluster, to “n” These groups of clustering methods iteratively measure the distance between the clusters and the
number of smaller clusters in which the data points now belong to. characteristic centroids using various distance metrics. These are either of Euclidian distance,
Manhattan Distance or Minkowski Distance.
It must be taken into account that this algorithm is highly “rigid” when splitting the clusters –
meaning, one a clustering is done inside a loop, there is no way that the task can be undone. The major setback here is that we should either intuitively or scientifically (Elbow Method) define
the number of clusters, “k”, to begin the iteration of any clustering machine learning algorithm to
start assigning the data points.
Despite the flaws, Centroid based clustering has proven it’s worth over Hierarchical clustering when
working with large datasets. Also, owing to its simplicity in implementation and also interpretation,
4. Distribution-Based Clustering
these algorithms have wide application areas viz., market segmentation, customer segmentation, text Until now, the clustering techniques as we know are based around either proximity
topic retrieval, image segmentation etc. (similarity/distance) or composition (density). There is a family of clustering algorithms that take a
totally different metric into consideration – probability. Distribution-based clustering creates and
3. Density-based Clustering (Model-based Methods) groups data points based on their likely hood of belonging to the same probability distribution
(Gaussian, Binomial etc.) in the data.
If one looks into the previous two methods that we discussed, one would observe that both
hierarchical and centroid based algorithms are dependent on a distance (similarity/proximity) metric.
The very definition of a cluster is based on this metric. Density-based clustering methods take
density into consideration instead of distances. Clusters are considered as the densest region in a
data space, which is separated by regions of lower object density and it is defined as a maximal-set
of connected points.
When performing most of the clustering, we take two major assumptions, one, the data is devoid of
any noise and two, the shape of the cluster so formed is purely geometrical (circular or elliptical).
The fact is, data always has some extent of inconsistency (noise) which cannot be ignored. Added to
that, we must not limit ourselves to a fixed attribute shape, it is desirable to have arbitrary shapes so
as to not to ignore any data points. These are the areas where density based algorithms have proven
their worth!
Density-based algorithms can get us clusters with arbitrary shapes, clusters without any limitation in
cluster sizes, clusters that contain the maximum level of homogeneity by ensuring the same levels of
density within it, and also these clusters are inclusive of outliers or the noisy data.
The distribution models of clustering are most closely related to statistics as it very closely relates to
the way how datasets are generated and arranged using random sampling principles i.e., to fetch data
points from one form of distribution. Clusters can then be easily be defined as objects that are most
likely to belong to the same distribution.
A major drawback of density and boundary-based approaches is in specifying the clusters apriori to
some of the algorithms and mostly the definition of the shape of the clusters for most of the
algorithms. There is at least one tuning or hyper-parameter which needs to be selected and not only
that is trivial but also any inconsistency in that would lead to unwanted results.
Distribution based clustering has a vivid advantage over the proximity and centroid based clustering
methods in terms of flexibility, correctness and shape of the clusters formed. The major problem
however is that these clustering methods work well only with synthetic or simulated data or with
data where most of the data points most certainly belong to a predefined distribution, if not, the
results will overfit.
5. Fuzzy Clustering
The general idea about clustering revolves around assigning data points to mutually exclusive
clusters, meaning, a data point always resides uniquely inside a cluster and it cannot belong to more
than one cluster. Fuzzy clustering methods change this paradigm by assigning a data-point to
multiple clusters with a quantified degree of belongingness metric. The data-points that are in
proximity to the center of a cluster, may also belong in the cluster that is at a higher degree than
points in the edge of a cluster. The possibility of which an element belongs to a given cluster is
measured by membership coefficient that vary from 0 to 1. Usually, tree-based, Classification machine learning algorithms like Decision Trees, Random Forest,
and Gradient Boosting, etc. are made use of to attain constraint-based clustering. A tree is
Fuzzy clustering can be used with datasets where the variables have a high level of overlap. It is a constructed by splitting without the interference of the constraints or clustering labels. Then, the leaf
strongly preferred algorithm for Image Segmentation, especially in bioinformatics where identifying nodes of the tree are combined together to form the clusters while incorporating the constraints and
overlapping gene codes makes it difficult for generic clustering algorithms to differentiate between using suitable algorithms.
the image’s pixels and they fail to perform a proper clustering.
Types of Clustering Algorithms with Detailed Description
6. Constraint-based (Supervised Clustering)
The clustering process, in general, is based on the approach that the data can be divided into an 1. k-Means Clustering
optimal number of “unknown” groups. The underlying stages of all the clustering algorithms to find
those hidden patterns and similarities, without any intervention or predefined conditions. However, k-Means is one of the most widely used and perhaps the simplest unsupervised algorithms to solve
in certain business scenarios, we might be required to partition the data based on certain constraints. the clustering problems. Using this algorithm, we classify a given data set through a certain number
Here is where a supervised version of clustering machine learning techniques come into play. of predetermined clusters or “k” clusters. Each cluster is assigned a designated cluster center and
they are placed as much as possible far away from each other. Subsequently, each point belonging
A constraint is defined as the desired properties of the clustering results, or a user’s expectation on gets associated with it to the nearest centroid till no point is left unassigned. Once it is done, the
the clusters so formed – this can be in terms of a fixed number of clusters, or, the cluster size, or, centers are re-calculated and the above steps are repeated. The algorithm converges at a point where
important dimensions (variables) that are required for the clustering process. the centroids cannot move any further. This algorithm targets to minimize an objective function
called the squared error function F(V) :
where, complete linkage or single linkage. Ideally, the algorithm continues until each data has its own
||xi – vj|| is the distance between Xi and Vj. cluster.
Implementation: In R, we make use of the diana() fucntion from cluster package (cluster::diana)
In R, there is a built-in function kmeans() and in Python, we make use of scikit-learn cluster module
which has the KMeans function. (sklearn.cluster.KMeans)
Advantages:
1. Can be applied to any form of data – as long as the data has numerical (continuous) entities.
2. Much faster than other algorithms.
3. Easy to understand and interpret.
Drawbacks:
Application Areas:
2.2 Agglomerative Nesting or AGNES
a. Document clustering – high application area in Segmenting text-matrix related like data like DTM,
TF-IDF etc. AGNES starts by considering the fact that each data point has its own cluster, i.e., if there are n data
b. Banking and Insurance fraud detection where majority of the columns represent a financial figure rows, then the algorithm begins with n clusters initially. Then, iteratively, clusters that are most
– continuous data. similar – again based on the distances as measured in DIANA – are now combined to form a larger
c. Image segmentation. cluster. The iterations are performed until we are left with one huge cluster that contains all the data-
d. Customer Segmentation. points.
Implementation:
2. Hierarchical Clustering Algorithm
As discussed in the earlier section, Hierarchical clustering methods follow two approaches – Divisive In R, we make use of the agnes() function from cluster package (cluster::agnes()) or the built-in
and Agglomerative types. Their implementation family contains two algorithms respectively, the hclust() function from the native stats package. In python, the implementation can be found in
divisive DIANA (Divisive Analysis) and AGNES (Agglomerative Nesting) for each of the scikit-learn package via the AgglomerativeClustering function inside the cluster
approaches. module(sklearn.cluster.AgglomerativeClustering)
Advantages:
2.1 DIANA or Divisive Analysis
As discussed earlier, the divisive approach begins with one single cluster where all the data points 1. No prior knowledge about the number of clusters is needed, although the user needs to define a
belong to. Then it is split into multiple clusters and the data points get reassigned to each of the threshold for divisions.
clusters on the basis of the nearest distance measure of the pairwise distance between the data 2. Easy to implement across various forms of data and known to provide robust results for data
points. These distance measures can be Ward’s Distance, Centroid Distance, average linkage, generated via various sources. Hence it has a wide application area.
Disadvantages: major difference is, as mentioned earlier, that according to this algorithm, a data point can be put
into more than one cluster. This degree of belongingness can be clearly seen in the cost function of
1. The cluster division (DIANA) or combination (AGNES) is really strict and once performed, it this algorithm as shown below:
cannot be undone and re-assigned in subsequesnt iterations or re-runs.
2. It has a high time complexity, in the order of O(n^2 log n) for all the n data-points, hence cannot
be used for larger datasets.
3. Cannot handle outliers and noise
Application areas:
uij is the degree of belongingness of data xi to a cluster cj
μj is the cluster center of the cluster j
1. Widely used in DNA sequencing to analyse the evolutionary history and the relationships among
m is the fuzzifier.
biological entities (Phylogenetics).
So, just like the k-means algorithm, we first specify the number of clusters k and then assign the
degree of belongingness to the cluster. We need to then repeat the algorithm till the max_iterations
are reached, again which can be tuned according to the requirements.
Implementation:
In R, FCM can be implemented using fanny() from the cluster package (cluster::fanny) and in
Python, fuzzy clustering can be performed using the cmeans() function from skfuzzy module.
(skfuzzy.cmeans) and further, it can be adapted to be applied on new data using the predictor
function (skfuzzy.cmeans_predict)
Advantages:
1. FCM works best for highly correlated and overlapped data, where k-means cannot give any
conclusive results.
2. It is an unsupervised algorithm and it has a higher rate of convergence than other partitioning
based algorithms.
Disadvantages:
1. We need to specify the number of clusters “k” prior to the start of the algorithm
2. Although convergence is always guaranteed but the process is very slow and this cannot be used
for larger data.
2. Identifying fake news by clustering the news article corpus, by assigning the tokens or words into
3. Prone to errors if the data has noise and outliers.
these clusters and marking out suspicious and sensationalized words to get possible faux words.
3. Personalization and targeting in marketing and sales.
Application Areas
4. Classifying the incoming network traffic into a website by classifying the http requests into
various clusters and then heuristically identifying the problematic clusters and eventually restricting 1. Used widely in Image Segmentation of medical imagery, especially the images generated by an
them. MRI.
2. Market definition and segmentation.
3. Fuzzy C Means Algorithm – FANNY (Fuzzy Analysis Clustering)
This algorithm follows the fuzzy cluster assignment methodology of clustering. The working of 4. Mean Shift Clustering
FCM Algorithm is almost similar to the k-means – distance-based cluster assignment – however, the
Mean shift clustering is a form of nonparametric clustering approach which not only eliminates the 1. Image segmentation and computer vision – mostly used for handwritten text identification.
need for apriori specification of the number of clusters but also it removes the spatial and shape 2. Image tracking in video analysis.
constraints of the clusters – two of the major problems from the most widely preferred k-means
algorithm.
5. DBSCAN – Density-based Spatial Clustering
It is a density-based clustering algorithm where it firstly, seeks for stationary points in the density Density-based algorithms, in general, are pivotal in the application areas where we require non-linear
function. Then next, the clusters are eventually shifted to a region with higher density by shifting the cluster structures, purely based out of density. One of the ways how this principle can be made into
center of the cluster to the mean of the points present in the current window. The shift if the reality is by using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
window is repeated until no more points can be accommodated inside of that window. algorithm. There are two major underlying concepts in DBSCAN – one, Density Reachability and
second, Density Connectivity. This helps the algorithm to differentiate and separate regions with
varying degrees of density – hence creating clusters.
For implementing DBSCAN, we first begin with defining two important parameters – a radius
parameter eps (ϵ) and a minimum number of points within the radius (m).
Implementation:
Disadvantages:
1. The selection of the window radius is highly arbitrary and cannot be related to any business logic
and selecting incorrect window size is never desirable.
Applications:
1. Used in document Network Analysis of text data for identifying plagiarism and copyrights in
various scientific documents and scholarly articles.
2. Widely used in recommendation systems for various web applications and eCommerce websites.
3. Used in x-ray Crystallography to categorize the protein structure of a certain protein and to
determine its interactions with other proteins in the strands.
4. Clustering in Social Network Analysis is implemented by DBSCAN where objects (points) are
clustered based on the object’s linkage rather than similarity.
Implementation:
In Python its implemented via DBSCAN() function from scikit-learn cluster module
(sklearn.cluster.DBSCAN) and in R its implemented through dbscan() from dbscan package
(dbscan::dbscan(x, eps, minpts))
Advantages:
Disadvantages:
Applications:
Disadvantages:
Applications:
Implementation:
1. GMM has been more practically used in Topic Mining where we can associate multiple topics to a
particular document (an atomic part of a text – a news article, online review, Twitter tweet, etc.)
In Python, it is implenteded via the GaussianMixture() function from scikit-learn.
2. Spectral clustering, combined with Gaussian Mixed Models-EM is used in image processing.
(sklearn.mixture.GaussianMixture) and in R, it is implemented using GMM() from the clusteR
package. (clusteR.GMM())
Applications of Clustering
Advantages:
We have seen numerous methodologies and approaches for clustering in machine learning and some
1. The associativity of a data point to a cluster is quantified using probability metrics – which can be of the important algorithms that implement those techniques. Let’s have a quick overview of
easily interpreted. business applications of clustering and understand its role in Data Mining.
2. Proven to be accurate for real-time data sets.
3. Some versions of GMM allows for mixed membership of data points, hence it can be a good 1. It is the backbone of search engine algorithms – where objects that are similar to each other must
alternative to Fuzzy C Means to achieve fuzzy clustering. be presented together and dissimilar objects should be ignored. Also, it is required to fetch objects
that are closely related to a search term, if not completely related.
2. A similar application of text clustering like search engine can be seen in academics where
clustering can help in the associative analysis of various documents – which can be in-turn used in
– plagiarism, copyright infringement, patent analysis etc.
3. Used in image segmentation in bioinformatics where clustering algorithms have proven their worth
in detecting cancerous cells from various medical imagery – eliminating the prevalent human errors
and other bias.
4. Netflix has used clustering in implementing movie recommendations for its users.
5. News summarization can be performed using Cluster analysis where articles can be divided into a
group of related topics.
6. Clustering is used in getting recommendations for sports training for athletes based on their goals
and various body related metrics and assign the training regimen to the players accordingly.
7. Marketing and sales applications use clustering to identify the Demand-Supply gap based on
various past metrics – where a definitive meaning can be given to huge amounts of scattered data.
8. Various job search portals use clustering to divide job posting requirements into organized groups
which becomes easier for a job-seeker to apply and target for a suitable job.
9. Resumes of job-seekers can be segmented into groups based on various factors like skill-sets,
experience, strengths, type of projects, expertise etc., which makes potential employers connect
with correct resources.
10. Clustering effectively detects hidden patterns, rules, constraints, flow etc. based on various metrics
of traffic density from GPS data and can be used for segmenting routes and suggesting users with
best routes, location of essential services, search for objects on a map etc.
11. Satellite imagery can be segmented to find suitable and arable lands for agriculture.
12. Pizza Hut very famously used clustering to perform Customer Segmentation which helped them to
target their campaigns effectively and helped increase their customer engagement across various
channels.
13. Clustering can help in getting customer persona analysis based on various metrics of Recency,
Frequency, and Monetary metrics and build an effective User Profile – in-turn this can be used for
Customer Loyalty methods to curb customer churn.
14. Document clustering is effectively being used in preventing the spread of fake news on Social
Media.
15. Website network traffic can be divided into various segments and heuristically when we can
prioritize the requests and also helps in detecting and preventing malicious activities.
16. Fantasy sports have become a part of popular culture across the globe and clustering algorithms
can be used in identifying team trends, aggregating expert ranking data, player similarities, and
other strategies and recommendations for the users.
K-Means
Means Clustering Algorithm | Examples
Pattern Recognition
K-Means
Means Clustering-
Clustering
● K-Means
Means clustering is an unsupervised iterative clustering technique.
● It partitions the given data set into k predefined distinct clusters.
● A cluster is defined as a collection of data points exhibiting certain similarities.
K-Means
Means Clustering Algorithm-
Algorithm
Step-01:
Step-02:
Step-03:
● Calculate the distance between each data point and each cluster center.
● The distance may be calculated either by using given distance function or by using
euclidean distance formula.
Step-04:
Step-05:
● The center of a cluster is computed by taking mean of all the data points contained in
that cluster.
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-
Advantages-
Point-01:
● n = number of instances
● k = number of clusters
● t = number of iterations
Point-02:
Disadvantages-
Problem-01:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-
Iteration-01:
● We calculate the distance of each point from each of the center of the three clusters.
● The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
Ρ(A1, C1)
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |5 – 2| + |8 – 10|
=3+2
=5
Ρ(A1, C3)
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,
Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (5, 8) of center (1, 2) of to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
Cluster-01:
● A1(2, 10)
Cluster-02:
● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)
● A8(4, 9)
Cluster-03:
● A2(2, 5)
● A7(1, 2)
Now,
For Cluster-01:
For Cluster-02:
Center of Cluster-02
= (6, 6)
For Cluster-03:
Center of Cluster-03
= (1.5, 3.5)
Iteration-02:
● We calculate the distance of each point from each of the center of the three clusters.
● The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
Ρ(A1, C1)
= |2 – 2| + |10 – 10|
=0
Ρ(A1, C2)
= |6 – 2| + |6 – 10|
=4+4
=8
Ρ(A1, C3)
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,
Given Points Distance from Distance from Distance from Point belongs
center (2, 10) of center (6, 6) of center (1.5, 3.5) of to Cluster
Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
Cluster-01:
● A1(2, 10)
● A8(4, 9)
Cluster-02:
● A3(8, 4)
● A4(5, 8)
● A5(7, 5)
● A6(6, 4)
Cluster-03:
● A2(2, 5)
● A7(1, 2)
Now,
For Cluster-01:
Center of Cluster-01
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= (1.5, 3.5)
● C1(3, 9.5)
● C2(6.5, 5.25)
● C3(1.5, 3.5)
Problem-02:
Solution-
Iteration-01:
● We calculate the distance of each point from each of the center of the two clusters.
● The distance is calculated by using
u the euclidean distance formula.
The following illustration shows the calculation of distance between point A(2, 2) and each of the
center of the two clusters-
Ρ(A, C1)
= sqrt [ 0 + 0 ]
=0
Ρ(A, C2)
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
In the similar manner, we calculate the distance of other points from each of the center of the
two clusters.
Next,
Given Points Distance from center Distance from center Point belongs to
(2, 2) of Cluster-01 (1, 1) of Cluster-02 Cluster
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
Cluster-01:
● A(2, 2)
● B(3, 2)
● E(1.5, 0.5)
● D(3, 1)
Cluster-02:
● C(1, 1)
● E(1.5, 0.5)
Now,
For Cluster-01:
Center of Cluster-01
= (2.67, 1.67)
For Cluster-02:
Center of Cluster-02
= (1.25, 0.75)
Next, we go to iteration-02, iteration-03 and so on until the centers do not change anymore.
• Typical methods
– Frequent-term-based document clustering
– Clustering by pattern similarity in micro-array data (pClustering)
Why p-Clustering?
• Microarray data analysis may need to
– Clustering on thousands of dimensions (attributes)
– Discovery of both shift and scaling patterns
• Clustering with Euclidean distance measure? — cannot find shift patterns
• Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
• Bi-cluster using transformed mean-squared residue score matrix (I, J)
– Where
1
– A submatrix isda δ-cluster
if H(I,
d J) ≤ δ for some
1δ>0
ij | J | d 1
Ij | I | ij
ij d d d
jJ IJ | I || J | i I , j J ij
• Problems with bi-cluster iI
– No downward closure property,
– Due to averaging, it may contain outliers but still within δ-threshold
p-Clustering
• Given objects x, y in O and features a, b in T, pCluster is a 2 by 2 matrix
d xa d xb
pScore (
d d ) | ( d xa d xb ) ( d ya d yb ) |
ya yb
• A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some
δ>0
• Properties of δ-pCluster
– Downward closure
– Clusters are more homogeneous than bi-cluster (thus the name: pair-wise
Cluster)
• Pattern-growth algorithm has been developed for efficient mining
• For scaling patterns, one can observe, taking logarithmic on will lead to the
pScore form
d /d
xa ya
9/28/2019 IT6006-DATA ANALYTICS
d / d 10
xb yb
Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
EM Algorithm
Pr( d c ) i k
wk i
N
P (d i c j)
k
m i1
9/28/2019 IT6006-DATA ANALYTICS k 13
• Basic Concepts
Evaluation Methods
• Summary
unit4/frequent pattern based clustering 9
9/28/2019
methods
Downloaded by Shashank Mishra ([email protected])
lOMoARcPSD|37581604
Format
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
unit4/frequent pattern based clustering
14
9/28/2019
methods
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
unit4/frequent pattern based clustering
15
9/28/2019
methods
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Clustering in non-Euclidean
space clustering for streams and
parallelism
IT6006-DATA ANALYTICS 1
Main Topics
What is Cluster ing?
Distance measures and spaces.
Algor ithmic approaches.
The cur se of dimensionality.
Hierarchical cluster ing.
Point assign cluster ing.
non-main-memor y data cluster ing.
Summar y and other topics.
IT6006-DATA ANALYTICS 2
IT6006-DATA ANALYTICS 3
IT6006-DATA ANALYTICS 4
IT6006-DATA ANALYTICS 5
IT6006-DATA ANALYTICS 6
IT6006-DATA ANALYTICS 7
• Whether the algor ithm assumes that the data is small enough
to fit in main memor y, or whether data must reside in
secondar y memor y pr imar ily.
IT6006-DATA ANALYTICS 8
IT6006-DATA ANALYTICS 9
IT6006-DATA ANALYTICS 11
IT6006-DATA ANALYTICS 12
IT6006-DATA ANALYTICS 13
• Merging r ule: merge the two cluster s w ith the shor test
Euclidean distance between their centroids.
IT6006-DATA ANALYTICS 14
IT6006-DATA ANALYTICS 15
Continuation.
• The Diameter of a cluster is the maximum
distance between any two points of the cluster.
We merge those clusters whose resulting cluster
has the lowest diameter.
For example, the centroid of the cluster in step 3
is (11,4), so the radius will be
And the diameter will be
IT6006-DATA ANALYTICS 18
IT6006-DATA ANALYTICS 19
IT6006-DATA ANALYTICS 20
Solution:
We pick one of the points in the cluster itself to
represent the cluster. This point should be
selected as close to all the points in the cluster, so
it represent some kind of “center”.
We call the representative point Clustroid.
IT6006-DATA ANALYTICS 21
IT6006-DATA ANALYTICS 23
IT6006-DATA ANALYTICS 24
IT6006-DATA ANALYTICS 25
K – Means Algorithms
IT6006-DATA ANALYTICS 26
K – Means Algorithm
The algorithm:
Initially choose k points which are likely to be in different clusters;
Make this points the centroids of this clusters;
FOR each remaining point p DO
Find the centroids to which p is closest;
Add p to the cluster of that centroid;
Adjust the centroid of that cluster to account for p;
END;
• Optional: Fix the centroids of the clusters and assign each point
to the k clusters (usually does not influence).
IT6006-DATA ANALYTICS 27
K – Means Algorithm
• Illustration (similar to our algorithm):
IT6006-DATA ANALYTICS 28
K – Means Algorithm
Initializing clusters.
Few approaches:
• Pick points that are as far away from one other as possible.
• Cluster a sample of the data (perhaps hierarchically) so there
are k clusters. Pick a point from each cluster (perhaps that point
closest to cluster centroid).
K – Means Algorithm
Example for initializing clusters:
We have the following set:
• We first the worst case point, which is (6,8). That’s the first point.
• The furthest point from (6,8) is (12,3), so that’s the next point.
IT6006-DATA ANALYTICS 30
K – Means Algorithm
• Now, we check the point whose minimum distance to either
(6,8) or (12,3) is the maximum.
d((2,2),(6,8)) = 7.21, d((2,2),(12,3)) = 10.05.
So, the score is min(7.21, 10.05)= 7.21.
IT6006-DATA ANALYTICS 31
K – Means Algorithm
• Picking the r ight value of k:
• Recall measures of appropr iateness of cluster s , i.e. radius or diameter.
• We r un k-means on ser ies of number s, say 1,…,10, and search for
significant decreasing in the cluster s diameter s average, where
after wards it doesn’t change much.
IT6006-DATA ANALYTICS 32