Unit 4
Unit 4
20 A, C, D
◼support, s, probability that
a transaction contains X Y
30 A, D, E
◼confidence, c, conditional
40 B, E, F
probability that a transaction
50 B, C, D, E, F having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A → D (60%, 100%)
D → A (60%, 75%)
Transactio Items
n
X Y Support Confidence
Bread PeanutButter 60% 75%
Jelly Milk 0% 0%
Observations
Confidence measures the strength of the rule, whereas support
measures how often it should occur in the database.
Typically, large confidence values and a smaller support are used.
In our example, Bread PeanutButter, with confidence 75%,
this indicates that this rule holds 75% of the time that it could.
That is, ¾ times that bread occurs, so does PeanutButter. This
is a stronger rule than Jelly Milk because there are no times
Milk is purchased when Jelly is brought.
Lower values for support may be allowed as support indicates the
% of time the rule occurs throughout the database. For Ex.
With Jelly PeanutButter, the confidence is 100% but the
support is only 20%. It may be the case that this association
rule exists only in 20% of the transactions, but when the
antecedent jelly occurs, the consequent always occurs. Here an
advertising strategy targeted to people who purchase jelly
would appropriate.
The Apriori Algorithm :Finding Frequent
Itemsets using candidate Generation
Apriori is an algorithm proposed by R.Agrawal and R.
Srikant in 1994 for mining frequent itmsets for Boolean
association rules.
Apriori employs an iterative approach is known as level-
wise search, where k-itemsets are used to explore
(k+1) –itemsets. First, the set of frequent 1-itemsets is
found by scanning the database to accumulate the count
for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted by L1.
Next L1 is used to find L2, the set of frequent
2-itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can found.
Apriori Property-
The Apriori property is based on the following
observations. By definition, if an itemset I does not
satisfy the minimum support threshold, min_sup, then I
is not frequent; that is
P(I) < min_sup. If an item A is added to the itemset I,
then the resulting itemset I (i.e. I A) cannot occur
more frequently than I. Therefore, I A is not frequent
either; that is, P( I A) < min_sup.
Let us see how Lk-1 is used to find Lk for K > = 2. A two-
step process is followed, consisting of join and prune
action.
The Join Property –
To find Lk, a set of candidate k-itemsets is generated by
joining Lk-1 with itself. This set of candidates is denoted
Ck. Let l1 and l2 be itemsets in Lk-1. The notation li[j]
refers to the jth item in li. By convention, Apriori
assumes that items within a transaction or itemset are
sorted in lexicographic order. For the (k-1)-itemset, li,
this means that the items are sorted such that li[1] < li[2]
<…… < li[k-1].
The join Lk-1 & Lk-1, is performed, where members of Lk-1
are joinable if their first (k-2) items are in common.
The resulting itemset formed by joining l1 and l2 is
l1[1], l1[2], ….l1[k-2], l1[k-1],l2[k-1].
The prune step :
Ck is a superset of Lk, that is, its members may not
be frequent, but all of the frequent k-itemsets are
included in Ck. A scan of the database to
determine the count of each candidate in Ck
would result in the determination of Lk
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2 L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
TID List of Item IDs
T100 11, 12, 15
T200 12, 14
T300 12, 13
T400 11, 12, 14
T500 11, 13
T600 12, 13
T700 11, 13
T800 11, 12, 13, 15
T900 11, 12, 13
L1
C1
Compare candidate
Itemset Sup.Count
Scan D for count of ItemSet Sup.count support count with
each candidate minimum support {I1} 6
{I1} 6 count {I2} 7
{I2} 7
{I3} 6
{I3} 6
{I4} 2
{I4} 2
{I5} 2
{I5} 2
C2 C2 Compare candidate
support count with
Itemset
Itemset Sup.Count minimum support
{I1,I2} count
{I1,I2} 4
{I1,I3} Scan D for
Generate C2 {I1,I3} 4
{I1,I4} count of
candidates each {I1,I4} 1 Itemset Sup.Coun
from L1 {I1,I5} candidate {I1,I2} 4
{I1,I5} 2
{I2,I3} {I1,I3} 4
{I2,I3} 4
{I2,I4} {I1,I4} 1
{I2,I4} 2
{I2,I5} {I1,I5} 2
{I2,I5} 2
{I3,I4} {I2,I3} 4
{I3,I4} 0
{I3,I5} {I2,I4} 2
{I3,I5} 1
{I4,I5} {I2,I5} 2
{I4,I5} 0
L2
C3 C3
Scan D for
Generate C3 Itemset count of each Itemset Sup.Count
candidates from candidate {I1, I2, I3} 2
{I1, I2, I3}
L2
{I1, I2,I5} {I1, I2,I5} 2
Compare candidate
support count with
minimum support
count
Itemset Sup.Count
{I1, I2, I3} 2
{I1, I2,I5} 2
Once the frequent itemset is found , by using the formula we can
calculate the confidence for each association rule. Those
values more than the minimum confidence threshold value is
called the strong association rules.
In our example , if the frequent itemset l ={I1, I2, I5}, the
association rules are as follows.
I1 I2 I5 Confidence = 2/4 = 50%
I1 I5 I2 Confidence= 2/2 = 100%
I2 I5 I1 Confidence= 2/2 = 100%
I1 I2 I5 Confidence= 2/6 = 33%
I2 I1 I5 Confidence= 2/7 = 29%
I5 I1 I2 Confidence= 2/2 = 100%
If the minimum confidence threshold is, say, 70%, then
only the second, third and the last rules above are
output, because these are the only ones generated that
are strong.
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
Assignment
• Sample clothing Transactions are given as follows. Find
frequent items and strong association rules using Apriori
Algorithm.
TID Items TID Items
T1 Blouse T11 Tshirt
T2 Shoes, Skirt, Tshirt T12 Blouse, Jeans, Shoes, Skirt, Tshirt
T3 Jeans, Tshirt T13 Jeans, Shoes, Shorts, Tshirt
T4 Jeans, Shoes,Tshirt T14 Shoes, Skirt, Tshirt
T5 Jeans, Shorts T15 Jeans, Tshirt
T6 Shoes, Tshirt T16 Skirt, Tshirt
T7 Jeans, Skirt T17 Blouse, Jeans, Skirt
T8 Jeans, Shoes, Shorts, Tshirt T18 Jeans, Shoes, Shorts, Tshirt
T9 Jeans T19 Jeans
T10 Jeans, Shoes, Tshirt T20 Jeans, Shoes, Shorts, Tshirt
Problem:
TID Items
1 acd
2 bce
3 abce
4 be
Learning association rules basically means finding the items that are purchased together more
frequently than others.
For example in the above table you can see Item1 and item2 are bought together frequently.
So as I said Apriori is the classic and probably the most basic algorithm to do it. Now if you
search online you can easily find the pseudo-code and mathematical equations and stuff. I would
like to make it more intuitive and easy, if I can.
I would like if a 10th or a 12th grader can understand this without any problem. So I will try and
not use any terminologies or jargons.
Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought
at least 60% of times. So for here it should be bought at least 3 times.
For simplicity
M = Mango
O = Onion
And so on……
Original table:
Transaction Items Bought
ID
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}
Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought
4 times in total, but, it occurs in just 3 transactions.
Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
Step 2: Now remember we said the item is said frequently bought if it is bought at least 3 times.
So in this step we remove all the items that are bought less than 3 times from the above table and
we are left with
Item Number of
transactions
M 3
O 3
K 5
E 4
Y 3
This is the single items that are bought frequently. Now let’s say we want to find a pair of items
that are bought frequently. We continue from the above table (Table in step 2)
Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with
the second item like OK,OE,OY. We did not do OM because we already did MO when we were
making pairs with M and buying a Mango and Onion together is same as buying Onion and
Mango together. After making all the pairs we get,
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Step 4: Now we count how many times each pair is bought together. For example M and O is
just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we get
Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less
than three and we are left with
Step 6: To make the set of three items we need one more rule (it’s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same first
Alphabet, so we get
OK and OE, this gives OKE
KE and KY, this gives KEY
Then we find how many times O,K,E are bought together in the original table and same for
K,E,Y and we get the following table
While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and
you want to generate item sets of 4 items you look for two sets having the same first two
alphabets.
ABC and ABD -> ABCD
ACD and ACE -> ACDE
And so on … In general you have to look for sets having just the last alphabet/item different.
Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3
times which leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are bought together most frequently are O,K,E.
33
Counting frequent itemsets in a stream,
Points to be covered
❑ PCY algorithm
❑ Example
❑ References
Frequent item Sets:
Set of items which occur more frequently(satisfying minimum support
count) in the given data set.
For example:
Bread and Butter generally occurs more frequently in the transactions data
set of a grocery store.
PCY algorithm:
It was developed by three Chinese scientists Park, Chen, and Yu. This is an
algorithm used in the field of big data analytics for the frequent itemset
mining when the dataset is very large.
Steps:
1.To identify the length or we can say repetition of each candidate
in the given dataset.
2.Reduce the candidate set to all having length 1.
3.Map pair of candidates and find the length of each pair.
4.Apply a hash function to find bucket no.
5.Draw a candidate set table.
Threshold value or minimization value = 3
Hash function= (i*j) mod 10
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}
Step 1: Mapping all the elements in order to find their length.
Items → {1, 2, 3, 4, 5, 6}
Key Value
1 4
2 6
3 8
4 8
5 6
6 4
Step 2: Removing all elements having value less than 1.
But here in this example there is no key having value less than 1.
Hence, candidate set = {1, 2, 3, 4, 5, 6}
Step 3: Map all the candidate set in pairs and calculate their
lengths
T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1
T7: {(1, 4)} = 2
T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______
• Note: Pairs should not get repeated avoid the pairs that are already written before.
• Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4)
(3,4) (3,5) (4,5) (4,6)}
Step 4: Apply Hash Functions. (It gives us bucket number)
Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Now, arrange the pairs according to the ascending order of their
obtained bucket number.
Bucket no. Pair
0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)
Step 5: In this final step we will prepare the candidate set.
Example:
a set of items, such as milk and bread, that appear frequently together
in transaction data set is frequent item set.
Market Basket Analysis
Frequent itemset mining leads to the discovery of associations and correlations
among items in large transactional and relational data sets. The discovery of
interesting correlation relationships among huge amounts of business transaction
records can help in many business decision-making process, such as catalog design,
cross-marketing, and customer shopping behavior analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding association between the different items
that customers place in their “shopping baskets”. The discovery of such association
can help retailers develop marketing strategies by gaining insight into which items are
frequently purchased together by customers. For instance, if customers are buying
milk, how likely are they also buy bread (and what kind of bread) on the same trip to
the super market.
Patterns can be represented in the form of association rules. For example, the
information that customers who purchase computers also tend to buy antivirus
software at the same time is represented in association rule below.
computer antivirus_software [ support = 2%, confidence = 60%]
Rule support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rules. A support of 2%
means that 2% of all the transactions under analysis show that computer and
antivirus software are purchased together. A confidence of 60% means that 60% of
the customers who purchased a computer also brought the software.
1. Marketing:
Clustering helps to find group of customers with similar
behavior from a given data set customer record.
2. Biology:
Classification of plants and animal according to their
features.
3. Hospital:
Clustering is very useful in patients classification
based on symptoms.
4. Anomaly detection:
And so on..
Different clustering methods
Following is the categorization of Major Clustering Methods
• Partitioning methods
In this approach, several partitions are created and then evaluated based on given criteria.
• Hierarchical methods
In this method, the set of data objects are decomposed (multilevel) hierarchically by using certain
criteria.
• Density-based methods
This method is based on density (density reachability and density connectivity).
• Grid-based methods
This approach is based on multi-resolution grid data structure.
K Means algorithm( Partitioning Method)
• It is simple unsupervised learning algorithm developed
by J. MacQueen in 1967 and then J.A Hartigan and
M.A Wong in 1975.
https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm
https://www.tutorialride.com/
Thank
You
Clustering techniques: Hierarchical
Introduction
Points to be covered
❑ Different clustering methods
❑ Hierarchical algorithm
▪ AGNES (Agglomerative Nesting)
▪ DIANA (Divisive Analysis)
❑ References
Clustering Methods
• Partitioning methods
In this approach, several partitions are created and then evaluated based on given criteria.
• Hierarchical methods
In this method, the set of data objects are decomposed (multilevel) hierarchically by using certain
criteria.
• Density-based methods
This method is based on density (density reachability and density connectivity).
• Grid-based methods
This approach is based on multi-resolution grid data structure.
Hierarchical clustering: A method that works via grouping data into a
tree of clusters. begins by treating every data points as a separate
cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue
these steps until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-
like diagram that statistics the sequences of merges or splits) graphically
represents this hierarchy and is an inverted tree that describes the order
in which factors are merged (bottom-up view) or cluster are break up
(top-down view).
Method to generate hierarchical clustering
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At
first every data set is considered as individual entity or cluster. At every
iteration, the clusters merge with different clusters until one cluster is
formed.
Algorithm for Agglomerative Hierarchical Clustering is as follows:
• Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Step 3 and 4 until only a single cluster remains.
• Let’s see the graphical representation of this algorithm using a
dendrogram.
Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a
termination condition
7
AGNES (Agglomerative Nesting)
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
8
DIANA (Divisive Analysis)
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9
References:
https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm
https://www.geeksforgeeks.org/
• Frequent pattern mining can be applied to clustering resulting in
frequent pattern based cluster analysis.
• Frequent pattern mining can lead to the discovery of interesting
association and correlation among data object.
• The Idea behind frequent pattern based cluster analysis is that the
frequent patterns discovered may also indicate clusters.
• Frequent pattern based cluster analysis is well suited to high
dimension data
• Rather than growing the clusters dimension by dimension ,we grow
sets of frequent item sets which eventually lead to cluster
descriptions.
• Examples of frequent pattern based cluster analysis : Clustering of
text documents that contain thousands of distinct keywords.
Example: Text Clustering
• Text clustering is the application of cluster analysis to text-based
documents. Descriptors (sets of words that describe topic matter) are
extracted from the document first.
• Then they are analyzed for the frequency in which they are found in the
document compared to other terms. After which, clusters of descriptors
can be identified and then auto-tagged.
• From there, the information can be used in any number of ways
• Google’s search engine is probably the best and most widely known
example.
• When you search for a term on Google, it pulls up pages that apply to
that term.
• How Google can analyze billions of web pages to deliver an accurate and
fast result?
• It’s because of text clustering! Google’s algorithm breaks down
unstructured data from web pages and turns it into a matrix model,
tagging pages with keywords that are then used in search results!
There are two forms of frequent pattern based
cluster analysis
1. Frequent term based text clustering
2. Clustering by pattern similarity in
microarray data analysis.
Frequent term based text clustering
• In frequent term based text clustering text
documents are clustered based on the
frequent terms they contain.
• A stemming algorithm is applied to reduce each
term to its basic stem in this way each document
can be represented as a set of terms.
• the dimension space can be referred to another
vector space with each document is represented
by a term vector.
Clustering by pattern similarity in
microarray data analysis
• Another approach for clustering high dimensional data is
based on pattern similarity among the objects on a subset
of dimensions.
• PCluster method performs clustering by
pattern similarity in microarray data analysis. Example is
DNA microarray analysis.
• DNA microarray analysis: A microarray is a laboratory tool
used to detect the expression of thousands of genes at the
same time.
• DNA microarrays are microscope slides that are printed
with thousands of tiny spots in defined positions, with each
spot containing a known DNA sequence or gene.
Clustering in non-Euclidean space
• When the space is non-Euclidean, we need to use
some distance measure that is computed from
points, such as Jaccard, cosine, or edit distance.
• That is, we cannot base distances on “location” of
points A problem arises when we need to
represent a cluster, because we cannot replace a
collection of points by their centroid.
• Jaccard Similarity for Two Sets
• The Jaccard similarity measures the similarity between two sets of data to
see which members are shared and distinct.
• The Jaccard similarity is calculated by dividing the number of observations
in both sets by the number of observations in either set.
• Cosine Similarity
• Cosine similarity: This measures the similarity
using the cosine of the angle between two
vectors in a multidimensional space. It is given
by:
• Find minimum number of edits (operations)
required to convert ‘str1’ into ‘str2’.
• Input: str1 = "geek", str2 = "gesek" Output: 1
We can convert str1 into str2 by inserting a 's'.
• Input: str1 = "cat", str2 = "cut" Output: 1 We
can convert str1 into str2 by replacing 'a' with
'u'.
• Input: str1 = "sunday", str2 = "saturday"
Output: 3
• Common choices include selecting as the
clustroid the point that minimizes:
1. The sum of the distances to the other
points in the cluster.
2. The maximum distance to another point in
the cluster.
3. The sum of the squares of the distances to
the other points in the cluster