0% found this document useful (0 votes)
10 views97 pages

Unit 4

The document discusses association rules and frequent patterns, particularly in the context of market basket analysis, which identifies correlations among items in transactional data to inform business decisions. It explains key concepts such as support and confidence, the Apriori algorithm for mining frequent itemsets, and provides examples of how to calculate these metrics. The document emphasizes the importance of strong association rules that meet minimum support and confidence thresholds for effective marketing strategies.

Uploaded by

itskanishka1202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views97 pages

Unit 4

The document discusses association rules and frequent patterns, particularly in the context of market basket analysis, which identifies correlations among items in transactional data to inform business decisions. It explains key concepts such as support and confidence, the Apriori algorithm for mining frequent itemsets, and provides examples of how to calculate these metrics. The document emphasizes the importance of strong association rules that meet minimum support and confidence thresholds for effective marketing strategies.

Uploaded by

itskanishka1202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Association Rules

Frequent Patterns - a pattern (a set of items, subsequences,


etc.) that occurs frequently in a data set .

For example, a set of items, such as milk and bread, that


appear frequently together in transaction data set is
frequent item set.

A subsequence, such as buying a computer first, then a


digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent)
sequential pattern.
Market Basket Analysis
Frequent itemset mining leads to the discovery of associations
and correlations among items in large transactional and relational
data sets. The discovery of interesting correlation relationships
among huge amounts of business transaction records can help in
many business decision-making process, such as catalog design,
cross-marketing, and customer shopping behavior analysis.
A typical example of frequent itemset mining is market basket
analysis. This process analyzes customer buying habits by finding
association between the different items that customers place in
their “shopping baskets”. The discovery of such association can
help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For
instance, if customers are buying milk, how likely are they also
buy bread (and what kind of bread) on the same trip to the super
market.
Patterns can be represented in the form of association rules. For
example, the information that customers who purchase computers
also tend to buy antivirus software at the same time is represented
in association rule below.

computer  antivirus_software [ support = 2%, confidence = 60%]


Rule support and confidence are two measures of rule
interestingness. They respectively reflect the usefulness and
certainty of discovered rules. A support of 2% means that 2% of all
the transactions under analysis show that computer and antivirus
software are purchased together. A confidence of 60% means that
60% of the customers who purchased a computer also brought the
software.
Typically association rules are considered interesting if they satisfy
both a minimum support threshold and a minimum confidence
threshold.
Definition of Association Rule –
Given a set of Items I ={I1,I2,….Im} and a database of
transactions D= {t1,t2,…tn} where ti = { Ii1,Ii2,…..Iik} and Iij  I,an
association rule is an implication of the form X  Y where x, y
 I are set of items called itemsets and X  Y = 

Definition Of Support (s) – The support(s) for an association rule


X  Y is the percentage of transactions in the database that
contain X  Y.
i.e support (X  Y) = P(X  Y)
Definition Of Confidence / Strength – The confidence/strength for
an association rule X  Y is the ratio of the number of
transactions that contain X  Y to the number of transactions that
contain X. i.e confidence(X  Y) = P(Y|X)
Rules that satisfy both a miniumm support threshold(min_sup) and a
minimum confidence(min_conf) are called strong.
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset. The set { computer,
antivirus_software } is a 2-itemset.
The occurrence frequency of an itemset is the number of
transactions that contain the itemset. This is also known as the
frequency, support count or count of the itemset.
The itemset support defined by the equation
support(X  Y) = P( X  Y)
Is sometimes referred to as relative support.
If the relative support of an itemset I satisfies a predefined
minimum support threshold, then I is called a frequent itemset.
Confidence( X Y) = P(Y|X)
= support( X  Y)/support(X)
= support_count(X  Y)/support_count(X)
Example
◼Find all the rules X → Y with
Transaction-id Items bought
minimum support and confidence
◼ itemset X = {x1, …, xk}
10 A, B, D

20 A, C, D
◼support, s, probability that
a transaction contains X  Y
30 A, D, E
◼confidence, c, conditional
40 B, E, F
probability that a transaction
50 B, C, D, E, F having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A → D (60%, 100%)
D → A (60%, 75%)
Transactio Items
n

T1 Bread, Jelly, PeanutButter


T2 Bread, PeanutButter
T3 Bread, Milk, PeanutButter
T4 Beer, Bread
T5 Beer, Milk
Set Support Set Support
Beer 40 Beer, Bread, Milk 0
Bread 80 Beer, Bread, PeanutButter 0
Jelly 20 Beer, jelly, Milk 0
Milk 40 Beer, Milk, PeanutButter 0
PeanutButter 60 Beer, Jelly, PeanutButter 0
Beer,Bread 20 Bread, Jelly, Milk 0
Beer,jelly 0 Bread, Jelly, PeanutButter 20
Beer,Milk 20 Bread, Milk, PeanutButter 20
Beer, PeanutButter 0 Jelly, Milk, PeanutButter 0
Bread, jelly 20 Beer, Bread, Milk,Jelly 0
Bread, Milk 20 Beer, Bread,Jelly, PeanutButter 0
Bread,PeanutButter 60 Beer, Bread,Milk, PeanutButter 0
Jelly, Milk 0 Beer, Jelly,Milk, PeanutButter 0
Jelly, PeanutButter 20 Bread, Jelly,Milk, PeanutButter 0
Milk,PeanutButter 20 Beer, Bread, Jelly,Milk, 0
Beer,Bread,Jelly 0 PeanutButter
Support and Confidence for some Association Rules

X  Y Support Confidence
Bread  PeanutButter 60% 75%

PeanutButter  Bread 60% 100%


Beer  Bread 20% 50%
PeanutButter  Jelly 20% 33.3%
Jelly  PeanutButter 20 100

Jelly  Milk 0% 0%
Observations
Confidence measures the strength of the rule, whereas support
measures how often it should occur in the database.
Typically, large confidence values and a smaller support are used.
In our example, Bread  PeanutButter, with confidence 75%,
this indicates that this rule holds 75% of the time that it could.
That is, ¾ times that bread occurs, so does PeanutButter. This
is a stronger rule than Jelly  Milk because there are no times
Milk is purchased when Jelly is brought.
Lower values for support may be allowed as support indicates the
% of time the rule occurs throughout the database. For Ex.
With Jelly  PeanutButter, the confidence is 100% but the
support is only 20%. It may be the case that this association
rule exists only in 20% of the transactions, but when the
antecedent jelly occurs, the consequent always occurs. Here an
advertising strategy targeted to people who purchase jelly
would appropriate.
The Apriori Algorithm :Finding Frequent
Itemsets using candidate Generation
Apriori is an algorithm proposed by R.Agrawal and R.
Srikant in 1994 for mining frequent itmsets for Boolean
association rules.
Apriori employs an iterative approach is known as level-
wise search, where k-itemsets are used to explore
(k+1) –itemsets. First, the set of frequent 1-itemsets is
found by scanning the database to accumulate the count
for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted by L1.
Next L1 is used to find L2, the set of frequent
2-itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can found.
Apriori Property-
The Apriori property is based on the following
observations. By definition, if an itemset I does not
satisfy the minimum support threshold, min_sup, then I
is not frequent; that is
P(I) < min_sup. If an item A is added to the itemset I,
then the resulting itemset I (i.e. I  A) cannot occur
more frequently than I. Therefore, I  A is not frequent
either; that is, P( I  A) < min_sup.
Let us see how Lk-1 is used to find Lk for K > = 2. A two-
step process is followed, consisting of join and prune
action.
The Join Property –
To find Lk, a set of candidate k-itemsets is generated by
joining Lk-1 with itself. This set of candidates is denoted
Ck. Let l1 and l2 be itemsets in Lk-1. The notation li[j]
refers to the jth item in li. By convention, Apriori
assumes that items within a transaction or itemset are
sorted in lexicographic order. For the (k-1)-itemset, li,
this means that the items are sorted such that li[1] < li[2]
<…… < li[k-1].
The join Lk-1 & Lk-1, is performed, where members of Lk-1
are joinable if their first (k-2) items are in common.
The resulting itemset formed by joining l1 and l2 is
l1[1], l1[2], ….l1[k-2], l1[k-1],l2[k-1].
The prune step :
Ck is a superset of Lk, that is, its members may not
be frequent, but all of the frequent k-itemsets are
included in Ck. A scan of the database to
determine the count of each candidate in Ck
would result in the determination of Lk
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2 L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E
1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
TID List of Item IDs
T100 11, 12, 15
T200 12, 14
T300 12, 13
T400 11, 12, 14
T500 11, 13
T600 12, 13
T700 11, 13
T800 11, 12, 13, 15
T900 11, 12, 13
L1
C1
Compare candidate
Itemset Sup.Count
Scan D for count of ItemSet Sup.count support count with
each candidate minimum support {I1} 6
{I1} 6 count {I2} 7
{I2} 7
{I3} 6
{I3} 6
{I4} 2
{I4} 2
{I5} 2
{I5} 2
C2 C2 Compare candidate
support count with
Itemset
Itemset Sup.Count minimum support
{I1,I2} count
{I1,I2} 4
{I1,I3} Scan D for
Generate C2 {I1,I3} 4
{I1,I4} count of
candidates each {I1,I4} 1 Itemset Sup.Coun
from L1 {I1,I5} candidate {I1,I2} 4
{I1,I5} 2
{I2,I3} {I1,I3} 4
{I2,I3} 4
{I2,I4} {I1,I4} 1
{I2,I4} 2
{I2,I5} {I1,I5} 2
{I2,I5} 2
{I3,I4} {I2,I3} 4
{I3,I4} 0
{I3,I5} {I2,I4} 2
{I3,I5} 1
{I4,I5} {I2,I5} 2
{I4,I5} 0

L2
C3 C3
Scan D for
Generate C3 Itemset count of each Itemset Sup.Count
candidates from candidate {I1, I2, I3} 2
{I1, I2, I3}
L2
{I1, I2,I5} {I1, I2,I5} 2

Compare candidate
support count with
minimum support
count

Itemset Sup.Count
{I1, I2, I3} 2
{I1, I2,I5} 2
Once the frequent itemset is found , by using the formula we can
calculate the confidence for each association rule. Those
values more than the minimum confidence threshold value is
called the strong association rules.
In our example , if the frequent itemset l ={I1, I2, I5}, the
association rules are as follows.
I1 I2  I5 Confidence = 2/4 = 50%
I1 I5  I2 Confidence= 2/2 = 100%
I2  I5  I1 Confidence= 2/2 = 100%
I1  I2  I5 Confidence= 2/6 = 33%
I2  I1  I5 Confidence= 2/7 = 29%
I5  I1  I2 Confidence= 2/2 = 100%
If the minimum confidence threshold is, say, 70%, then
only the second, third and the last rules above are
output, because these are the only ones generated that
are strong.
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
Assignment
• Sample clothing Transactions are given as follows. Find
frequent items and strong association rules using Apriori
Algorithm.
TID Items TID Items
T1 Blouse T11 Tshirt
T2 Shoes, Skirt, Tshirt T12 Blouse, Jeans, Shoes, Skirt, Tshirt
T3 Jeans, Tshirt T13 Jeans, Shoes, Shorts, Tshirt
T4 Jeans, Shoes,Tshirt T14 Shoes, Skirt, Tshirt
T5 Jeans, Shorts T15 Jeans, Tshirt
T6 Shoes, Tshirt T16 Skirt, Tshirt
T7 Jeans, Skirt T17 Blouse, Jeans, Skirt
T8 Jeans, Shoes, Shorts, Tshirt T18 Jeans, Shoes, Shorts, Tshirt
T9 Jeans T19 Jeans
T10 Jeans, Shoes, Tshirt T20 Jeans, Shoes, Shorts, Tshirt
Problem:
TID Items
1 acd
2 bce
3 abce
4 be

Min support =50%


Confidence= 70 %
Solution: b, c, e
Apriori algorithm
Without further ado, let’s start talking about Apriori algorithm. It is a classic algorithm used in
data mining for learning association rules. It is nowhere as complex as it sounds, on the contrary
it is very simple; let me give you an example to explain it. Suppose you have records of large
number of transactions at a shopping center as follows:

Transactions Items bought


T1 Item1, item2, item3
T2 Item1, item2
T3 Item2, item5
T4 Item1, item2, item5

Learning association rules basically means finding the items that are purchased together more
frequently than others.
For example in the above table you can see Item1 and item2 are bought together frequently.

What is the use of learning association rules?


 Shopping centers use association rules to place the items next to each other so that users buy
more items. If you are familiar with data mining you would know about the famous beer-diapers-
Wal-Mart story. Basically Wal-Mart studied their data and found that on Friday afternoon young
American males who buy diapers also tend to buy beer. So Wal-Mart placed beer next to diapers
and the beer-sales went up. This is famous because no one would have predicted such a result
and that’s the power of data mining. You can Google for this if you are interested in further
details
 Also if you are familiar with Amazon, they use association mining to recommend you the items
based on the current item you are browsing/buying.
 Another application is the Google auto-complete, where after you type in a word it searches
frequently associated words that user type after that particular word.

So as I said Apriori is the classic and probably the most basic algorithm to do it. Now if you
search online you can easily find the pseudo-code and mathematical equations and stuff. I would
like to make it more intuitive and easy, if I can.

I would like if a 10th or a 12th grader can understand this without any problem. So I will try and
not use any terminologies or jargons.

Let’s start with a non-simple example,

Transaction Items Bought


ID
T1 {Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T2 {Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T3 {Mango, Apple, Key-chain, Eggs}
T4 {Mango, Umbrella, Corn, Key-chain, Yo-yo}
T5 {Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}

Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought
at least 60% of times. So for here it should be bought at least 3 times.

For simplicity
M = Mango
O = Onion
And so on……

So the table becomes

Original table:
Transaction Items Bought
ID
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}

Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought
4 times in total, but, it occurs in just 3 transactions.

Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1

Step 2: Now remember we said the item is said frequently bought if it is bought at least 3 times.
So in this step we remove all the items that are bought less than 3 times from the above table and
we are left with
Item Number of
transactions
M 3
O 3
K 5
E 4
Y 3

This is the single items that are bought frequently. Now let’s say we want to find a pair of items
that are bought frequently. We continue from the above table (Table in step 2)

Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with
the second item like OK,OE,OY. We did not do OM because we already did MO when we were
making pairs with M and buying a Mango and Onion together is same as buying Onion and
Mango together. After making all the pairs we get,

Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY

Step 4: Now we count how many times each pair is bought together. For example M and O is
just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we get

Item Pairs Number of


transactions
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
KY 3
EY 2

Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less
than three and we are left with

Item Pairs Number of


transactions
MK 3
OK 3
OE 3
KE 4
KY 3

These are the pairs of items frequently bought together.


Now let’s say we want to find a set of three items that are brought together.
We use the above table (table in step 5) and make a set of 3 items.

Step 6: To make the set of three items we need one more rule (it’s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same first
Alphabet, so we get
 OK and OE, this gives OKE
 KE and KY, this gives KEY

Then we find how many times O,K,E are bought together in the original table and same for
K,E,Y and we get the following table

Item Set Number of


transactions
OKE 3
KEY 2

While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and
you want to generate item sets of 4 items you look for two sets having the same first two
alphabets.
 ABC and ABD -> ABCD
 ACD and ACE -> ACDE

And so on … In general you have to look for sets having just the last alphabet/item different.

Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3
times which leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are bought together most frequently are O,K,E.
33
Counting frequent itemsets in a stream,
Points to be covered
❑ PCY algorithm
❑ Example
❑ References
Frequent item Sets:
Set of items which occur more frequently(satisfying minimum support
count) in the given data set.

For example:
Bread and Butter generally occurs more frequently in the transactions data
set of a grocery store.
PCY algorithm:
It was developed by three Chinese scientists Park, Chen, and Yu. This is an
algorithm used in the field of big data analytics for the frequent itemset
mining when the dataset is very large.

Steps:
1.To identify the length or we can say repetition of each candidate
in the given dataset.
2.Reduce the candidate set to all having length 1.
3.Map pair of candidates and find the length of each pair.
4.Apply a hash function to find bucket no.
5.Draw a candidate set table.
Threshold value or minimization value = 3
Hash function= (i*j) mod 10
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12= {3, 4, 6}
Step 1: Mapping all the elements in order to find their length.
Items → {1, 2, 3, 4, 5, 6}

Key Value
1 4

2 6

3 8

4 8

5 6

6 4
Step 2: Removing all elements having value less than 1.
But here in this example there is no key having value less than 1.
Hence, candidate set = {1, 2, 3, 4, 5, 6}

Step 3: Map all the candidate set in pairs and calculate their
lengths
T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1
T7: {(1, 4)} = 2
T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______
• Note: Pairs should not get repeated avoid the pairs that are already written before.
• Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4)
(3,4) (3,5) (4,5) (4,6)}
Step 4: Apply Hash Functions. (It gives us bucket number)
Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Now, arrange the pairs according to the ascending order of their
obtained bucket number.
Bucket no. Pair
0 (4,5)
2 (3,4)
3 (1,3)
4 (4,6)
5 (3,5)
6 (2,3)
8 (2,4)
Step 5: In this final step we will prepare the candidate set.

Bit vector Bucket no. Highest Support Pairs Candidate Set


Count
1 0 3 (4,5) (4,5)
1 2 4 (3,4) (3,4)
1 3 3 (1,3) (1,3)
1 4 3 (4,6) (4,6)
1 5 5 (3,5) (3,5)
1 6 3 (2,3) (2,3)
1 8 3 (2,4) (2,4)
• Note: Highest support count is the no. of repetition of that
vector.
• Check the pairs which have the highest support count more
than 3, and write in the candidate set, if less than 3 then reject.
References:
https://www.includehelp.com/big-data/pcy-algorithm-in-big-data-analytics.aspx
Points to be covered
❑ Frequent Itemset
❑ Market Basket Analysis
❑ Support
❑ Example
Frequent Patterns - a pattern (a set of items, subsequences, etc.) that
occurs frequently in a data set .

Example:
a set of items, such as milk and bread, that appear frequently together
in transaction data set is frequent item set.
Market Basket Analysis
Frequent itemset mining leads to the discovery of associations and correlations
among items in large transactional and relational data sets. The discovery of
interesting correlation relationships among huge amounts of business transaction
records can help in many business decision-making process, such as catalog design,
cross-marketing, and customer shopping behavior analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding association between the different items
that customers place in their “shopping baskets”. The discovery of such association
can help retailers develop marketing strategies by gaining insight into which items are
frequently purchased together by customers. For instance, if customers are buying
milk, how likely are they also buy bread (and what kind of bread) on the same trip to
the super market.
Patterns can be represented in the form of association rules. For example, the
information that customers who purchase computers also tend to buy antivirus
software at the same time is represented in association rule below.
computer antivirus_software [ support = 2%, confidence = 60%]


Rule support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rules. A support of 2%
means that 2% of all the transactions under analysis show that computer and
antivirus software are purchased together. A confidence of 60% means that 60% of
the customers who purchased a computer also brought the software.

Typically association rules are considered interesting if they


satisfy both a minimum support threshold and a minimum
confidence threshold.
Definition of Association Rule –
Given a set of Items I ={I1,I2,….Im} and a database of
transactions D= {t1,t2,…tn} where ti = { Ii1,Ii2,…..Iik} and Iij  I,
an association rule is an implication of the form X  Y where x, y
 I are set of items called itemsets and X  Y = 

Definition Of Support (s) – The support(s) for an association rule


X  Y is the percentage of transactions in the database that
contain X  Y.
i.e support (X  Y) = P(X  Y)
Rules that satisfy both a miniumm support threshold(min_sup) and a
minimum confidence(min_conf) are called strong.
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset. The set { computer,
antivirus_software } is a 2-itemset.
The occurrence frequency of an itemset is the number of
transactions that contain the itemset. This is also known as the
frequency, support count or count of the itemset.
The itemset support defined by the equation
support(X Y) = P( X  Y)
Is sometimes referred to as relative support.
Transaction Items Find all the rules X → Y
with minimum support
and confidence
T1 Bread, Jelly, PeanutButter
◼itemset X = {x1, …, xk}
T2 Bread, PeanutButter
◼support, s,
T3 Bread, Milk, probability that
PeanutButter a transaction
T4 Beer, Bread contains X  Y
T5 Beer, Milk
An Example
Assume Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2 L1
C1 {A} 2
Tid Items {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
L2 {A, B} 1
Itemset sup 2nd scan {A, B}
{A, C} 2 {A, C} 2
{A, E} 1 {A, C}
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset 3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Observations
Support measures how often it should occur in the database.
Typically, large confidence values and a smaller support are used.
Thank
You
Points to be covered
❑ What is Clustering
❑ Different clustering methods
❑ K Means algorithm
❑ Example
❑ References
What is clustering:
• It is a data mining technique used to place the data elements
into their related groups.
• A Cluster consists of data objects with high intra similarity
and low inter similarity with the objects of other cluster.
• The quality of cluster depends on the method used.
• It is also called as data segmentation, because it partitions
large data sets into groups according to their similarity
Clustering can be helpful in many fields, such as:

1. Marketing:
Clustering helps to find group of customers with similar
behavior from a given data set customer record.

2. Biology:
Classification of plants and animal according to their
features.

3. Hospital:
Clustering is very useful in patients classification
based on symptoms.
4. Anomaly detection:
And so on..
Different clustering methods
Following is the categorization of Major Clustering Methods

• Partitioning methods
In this approach, several partitions are created and then evaluated based on given criteria.

• Hierarchical methods
In this method, the set of data objects are decomposed (multilevel) hierarchically by using certain
criteria.

• Density-based methods
This method is based on density (density reachability and density connectivity).

• Grid-based methods
This approach is based on multi-resolution grid data structure.
K Means algorithm( Partitioning Method)
• It is simple unsupervised learning algorithm developed
by J. MacQueen in 1967 and then J.A Hartigan and
M.A Wong in 1975.

• In this approach, the data objects ('n') are classified into


'k' number of clusters in which each observation belongs
to the cluster with nearest mean.

• It defines 'k' sets (the point may be considered as the


center of a one or two dimensional figure), one for each
cluster k ≤ n. The clusters are placed far away from each
other.
Steps of K Means algorithm
Given k, the k-means algorithm is implemented in four steps:
▪ Partition objects into k nonempty subsets
▪ Compute seed points as the centroids of the clusters of
the current partitioning (the centroid is the center, i.e.,
mean point, of the cluster)
▪ Assign each object to the cluster with the nearest seed
point
▪ Go back to Step 2, until no change in the value of mean
point of each cluster
Example:
References:

https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm
https://www.tutorialride.com/
Thank
You
Clustering techniques: Hierarchical
Introduction
Points to be covered
❑ Different clustering methods
❑ Hierarchical algorithm
▪ AGNES (Agglomerative Nesting)
▪ DIANA (Divisive Analysis)
❑ References
Clustering Methods
• Partitioning methods
In this approach, several partitions are created and then evaluated based on given criteria.

• Hierarchical methods
In this method, the set of data objects are decomposed (multilevel) hierarchically by using certain
criteria.

• Density-based methods
This method is based on density (density reachability and density connectivity).

• Grid-based methods
This approach is based on multi-resolution grid data structure.
Hierarchical clustering: A method that works via grouping data into a
tree of clusters. begins by treating every data points as a separate
cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue
these steps until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of
nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-
like diagram that statistics the sequences of merges or splits) graphically
represents this hierarchy and is an inverted tree that describes the order
in which factors are merged (bottom-up view) or cluster are break up
(top-down view).
Method to generate hierarchical clustering
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At
first every data set is considered as individual entity or cluster. At every
iteration, the clusters merge with different clusters until one cluster is
formed.
Algorithm for Agglomerative Hierarchical Clustering is as follows:
• Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Step 3 and 4 until only a single cluster remains.
• Let’s see the graphical representation of this algorithm using a
dendrogram.
Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a
termination condition

Step Step Step Step Step


0 1 2 3 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0

7
AGNES (Agglomerative Nesting)

• Produces tree of clusters (nodes)


• Initially: each object is a cluster (leaf)
• Recursively merges nodes that have the least dissimilarity
• Criteria: min distance, max distance, avg distance, center
distance
• Eventually all nodes belong to the same cluster (root)
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

8
DIANA (Divisive Analysis)

• Inverse order of AGNES


• Start with root cluster containing all objects
• Recursively divide into subclusters
• Eventually each cluster contains a single object
10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

9
References:

https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm
https://www.geeksforgeeks.org/
• Frequent pattern mining can be applied to clustering resulting in
frequent pattern based cluster analysis.
• Frequent pattern mining can lead to the discovery of interesting
association and correlation among data object.
• The Idea behind frequent pattern based cluster analysis is that the
frequent patterns discovered may also indicate clusters.
• Frequent pattern based cluster analysis is well suited to high
dimension data
• Rather than growing the clusters dimension by dimension ,we grow
sets of frequent item sets which eventually lead to cluster
descriptions.
• Examples of frequent pattern based cluster analysis : Clustering of
text documents that contain thousands of distinct keywords.
Example: Text Clustering
• Text clustering is the application of cluster analysis to text-based
documents. Descriptors (sets of words that describe topic matter) are
extracted from the document first.
• Then they are analyzed for the frequency in which they are found in the
document compared to other terms. After which, clusters of descriptors
can be identified and then auto-tagged.
• From there, the information can be used in any number of ways
• Google’s search engine is probably the best and most widely known
example.
• When you search for a term on Google, it pulls up pages that apply to
that term.
• How Google can analyze billions of web pages to deliver an accurate and
fast result?
• It’s because of text clustering! Google’s algorithm breaks down
unstructured data from web pages and turns it into a matrix model,
tagging pages with keywords that are then used in search results!
There are two forms of frequent pattern based
cluster analysis
1. Frequent term based text clustering
2. Clustering by pattern similarity in
microarray data analysis.
Frequent term based text clustering
• In frequent term based text clustering text
documents are clustered based on the
frequent terms they contain.
• A stemming algorithm is applied to reduce each
term to its basic stem in this way each document
can be represented as a set of terms.
• the dimension space can be referred to another
vector space with each document is represented
by a term vector.
Clustering by pattern similarity in
microarray data analysis
• Another approach for clustering high dimensional data is
based on pattern similarity among the objects on a subset
of dimensions.
• PCluster method performs clustering by
pattern similarity in microarray data analysis. Example is
DNA microarray analysis.
• DNA microarray analysis: A microarray is a laboratory tool
used to detect the expression of thousands of genes at the
same time.
• DNA microarrays are microscope slides that are printed
with thousands of tiny spots in defined positions, with each
spot containing a known DNA sequence or gene.
Clustering in non-Euclidean space
• When the space is non-Euclidean, we need to use
some distance measure that is computed from
points, such as Jaccard, cosine, or edit distance.
• That is, we cannot base distances on “location” of
points A problem arises when we need to
represent a cluster, because we cannot replace a
collection of points by their centroid.
• Jaccard Similarity for Two Sets
• The Jaccard similarity measures the similarity between two sets of data to
see which members are shared and distinct.
• The Jaccard similarity is calculated by dividing the number of observations
in both sets by the number of observations in either set.
• Cosine Similarity
• Cosine similarity: This measures the similarity
using the cosine of the angle between two
vectors in a multidimensional space. It is given
by:
• Find minimum number of edits (operations)
required to convert ‘str1’ into ‘str2’.
• Input: str1 = "geek", str2 = "gesek" Output: 1
We can convert str1 into str2 by inserting a 's'.
• Input: str1 = "cat", str2 = "cut" Output: 1 We
can convert str1 into str2 by replacing 'a' with
'u'.
• Input: str1 = "sunday", str2 = "saturday"
Output: 3
• Common choices include selecting as the
clustroid the point that minimizes:
1. The sum of the distances to the other
points in the cluster.
2. The maximum distance to another point in
the cluster.
3. The sum of the squares of the distances to
the other points in the cluster

You might also like