0% found this document useful (0 votes)
8 views

Module 3 - Part 2 - Frequency Pattern Mining

Uploaded by

Mohammed Faizan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 3 - Part 2 - Frequency Pattern Mining

Uploaded by

Mohammed Faizan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

INTRODUCTION TO DATA

ANALYTICS

Module 3, Part 2: Frequency pattern mining


(FPM)
Introduction – Frequency Pattern Mining
• Clustering was to assign objects with similar attributes to certain sets.
• From the data if we need to identify combination of attributes that are common to many objects {or} to
identify any significant (or confident) associations between the combination of attributes we need
further mine into the data.
Introduction – Frequency Pattern Mining
• Consider the data table shown below: {Data related to preferred food choice of contacts}
• Commercial Domain Data Set

Name Arabic Indian Mediterranean Oriental Fast food


Andrew √ √
Bernhard √ √ √
Carolina √ √ √
Dennis √ √
Eve √
Fred √ √ √
Gwyneth √ √
Hayden √ √ √
Irene √ √ √
James √ √

• Rows (objects) – Transactions, Items (attributes) - products


Introduction - FPM
• In the data considered, the attributes are not well defined domains (numbers) which
characterize the person.
• The data represents concepts, which have more complex definitions and domain.
• These concepts are termed as items. The values in the data table does not represent
concrete attribute values of the objects, but rather the represents the presence of a
connection between objects (persons) & items (cuisine type)
• Such data types are common in commercial domains, where each row is called
Transaction and represents a purchase or a market basket. Each item represents a
product.
• Non empty cells, containing the tick mark, indicate that a given item is connected to
given transaction. The common name to this type of data is “Transactional Data” and is
represented as shown in the table next slide.
• Another example is recommendations of movies to persons.
Original data on cuisine preference Transactional data
Name Arabic Indian Mediterra Orien Fast
nean tal food
Andrew √ √
Bernhard √ √ √
Carolina √ √ √
Dennis √ √
Eve √
Fred √ √ √
Gwyneth √ √
Hayden √ √ √
Irene √ √ √
James √ √
Introduction –FPM
• Typical Tasks of FPM are:
• Finding frequent itemsets: This aims to find “itemsets” that
appear together in different transactions.
• Finding association rules: It aims to find interesting relations
between itemsets
• Finding frequent sequences: It aims to find frequent item
sequences, not necessarily continuous, that appear in the same
order
• FPM methods were developed to deal with very large data sets
recorded in hypermarkets, and social media sites just to name
few examples.
• For the various tasks mentioned, there are different methods.
• The results are of course the same.
FPM – Frequent Itemsets
• An arbitrary combination of items is called an “Itemset”.
• It is a arbitrary subset of the set I of all items.
• No. of possible itemset = 2|I| - 1, where |I| is the number of items in I.
• In our data set, I = { Arabic, Indian, Mediterranean, Oriental, Fast food } is set of all
cuisines and hence |I| = 5
• Subsets possible = 25-1 = 31
• Itemset with length 1 = (Indian) ; (Arabic); ……… Total 5
• Itemset with length 2 = (Indian, Oriental) ; (Indian, Arabic) ……. Total 10
• Itemset with length 3 = (Indian, Oriental, Arabic); …. Total 10
• Itemset with length 4 =(Indian, Oriental, Arabic, FastFood);.. Total 5
• Itemset with length 5 = (Indian, Oriental, Arabic, FastFood, Mediterranean) ---Total 1
31
FPM- Frequent Itemsets
• Measure used to express the frequency of an itemset in a transactional data, is called “Support”
and is represented as T.
• Support is computed as ratio between the number of transactions (rows in T) in which the given
itemset is present and the no. of all transaction in the data.

Item set Support


Fast food 2 / 10 = 0.2 or 20%
Indian 6/10 = 0.6 or 60%
Indian, oriental 5/10 = 0.5 or 50%
Arabic, Mediterranean 3/10 = 0.3 or 30%
Indian, Mediterranean
Indian, Oriental, Fastfood
FPM – Frequent Itemsets
• In the example, itemset {Indian, Oriental} occurs in 5 out of the 10 transactions.
• Depending of the software and application, itemset can be represented as absolute
frequency (2,5 etc..) or as relative frequency ( 0.2, 0.5….)
• Frequent itemset mining:
• Given a set of all available items I, transactional data T, and a support threshold value min_sup,
frequent itemset mining aims at finding those itemsets, called “frequent itemsets”, generated
from I, in which support in T is at least equal to value min_sup.
• Naïve algorithm: used to find frequent itemsets from transactional data
• Generate all the different itemsets from I
• Count their support in T
• Filter out those itemsets which support is below a pre-defined min_sup threshold.
FPM – Frequent Itemsets
• Naïve Algorithm is acceptable in cases when I contains only a very small number of items.
• If the no. of items in I grows, it results in exponential explosion as shown below in the
table. This will increase the cost of computation which would be unacceptable.
• In our data set, with I = 5, Naïve algorithm needs to generate 31 itemsets and count their
support and the run time assuming 1ms per itemset will be 31mS
FPM – Setting up min_sup threshold
• It is a hyper parameter, the factors to be considered depends on the expectations on the
results. The important considerations are:
• IF the value is set VERY LOW, it results in large number of itemsets that would be too
specific to be considered “frequent”
• IF the value is set VERY HIGH, it results in small number of itemsets. These would be too
generic to be useful. Hence the resulting information would probably not represent new
knowledge.
• The value of min-sup should be such that the no. of frequent itemsets that results is small
enough for subsequent analysis.
• Lattice diagram is used to represent all the itemsets, generated from I
• Each item set is connected to the subset(s) positioned above it and to the superset(s)
positioned below it.
• In our data set, the Lattice diagram is shown in the next slide. The transaction
identifiers and support for each combination(itemset) is mentioned in the box.
Support Itemset
Value
Original set Empty set
1
Lattice
{F}, {I,F}, {O,F},
diagram 2 {I,O,F} – Total 4
for data {A}, {A,M}, {M,O},
3
set of {I,M,O} – Total 4
cuisine 4 {I,M} – Total 1

5 {I,O} - Total 1

6 {I}, {O} – Total 2

7 {M} - Total 1
8 Empty set

Itemset with support >=2 = 13


Itemset with support >=3 = 09
Itemset with support >=4 = 05
Itemset with support >=5 = 04
Itemset with support >=6 = 03
Itemset with support >=7 = 01
FPM – Setting up min_sup threshold
• Monotonicity Theorem or Rules:
• It helps to set the min_sup value which controls the no. of frequent itemsets.
• Theorem 1: If an itemset is frequent then each of its subsets are frequent too.
• IF we have a frequent itemset then all the itemsets connected to it through a pathway
up in the lattice are frequent too, with support equal or greater than the support of the
given itemset.
• Eg. Let itemset {I,M,O} numbered 22 and with support equal to 3 be frequent
• Each transaction containing this itemset is also a transaction that contains each of the
subsets of this itemset.. i.e. the support of each itemset {I,M},{I,O},{M,O}.{I},{M},{O}
is atleast 3.
• It can be seen the support for {I,M} = 4, {I,O} = 5 , {M,O} = 3. {I} = 6, {M} = 7, {O} =
6
FPM – Setting up min_sup threshold
• Theorem 2: If an itemset is infrequent then none of its supersets will be frequent.
• IF we have a infrequent itemset then if any other item is added to this itemset, obviously,
all the itemsets connected to it through a pathway down in the lattice are infrequent too,
with support equal or lesser than the support of the given itemset.
• Eg. Let itemset {I,F} numbered 12 and with support equal to 2 be infrequent
• Each transaction containing this itemset is also a transaction that contains each of the
subsets of this itemset. i.e. the support of each itemset {A,I,F}, {I,M,F}, {I,O,F},
{A,I,M,F} , {A,I,O,F}, {I,M,O,F}, {A,I,M,O,F} is at the most 2.

• It can be seen the support for {A,I,F} = 0, {I,M,F} = 0, {I,O,F} = 2, {A,I,M,F} = 0,


{A,I,O,F}= 0, {I,M,O,F} = 0, {A,I,M,O,F} = 0
FPM – Setting up min_sup threshold
• A general principle of frequent itemset mining methods is to traverse the itemset lattice in
order to search for those itemsets with support greater than or equal than a predefined
min_sup threshold.
• The methods developed differ in the effectiveness in traversing through the lattice of all
possible itemsets and their representation
• Different heuristics are used to optimize the computational time and memory usage.
• Independent to the method used, the result is expected to be same.
• There are several methods for mining frequent itemsets. The common methods are:
• Apriori – a Join-based method
• Eclat
• FP-Growth
• Maximal and closed frequent itemsets
FPM –Apriori –Mining Frequent Itemset
• Proposed by two Indian Researchers

Rakesh Agarwal Ramakrishnan Srikant


FPM –Apriori –Mining Frequent
Itemset
• It is the most simple and oldest
technique of mining frequent itemsets.
Itemset with
It uses “join-based” principle. length = 1
• The input to the algorithm is the
transactional dataset, T Itemset with
length = 2
• The hyper parameter to be set is the
min_sup
• The algorithm starts with selecting all Itemset with
itemset with length=1 which satisfies length = 3
the min_sup
• In the next iteration similar process is
carried out for itemset with length=2
Itemset with
• When the identified itemset is = 1, stop length = 4
the process of identification. Itemset with
length = 5
FPM –Apriori –Mining Frequent Itemset
• Algorithm generate itemsets which satisfies min-sup is discussed below:
• S1 : INPUT T the transaction dataset
• S2 : INPUT min_sup, the minimum support threshold
• S3 : Set K = 1 [K is Length of itemset ]
• S4 : Set stop = false
• S5 : repeat
• S6 : select all frequent itemsets of length K with support at least min-sup
• S7 : IF there are not at least two frequent itemsets of length K, then
• S8 : stop = true
• S9 : else
• S10: Set K = K + 1
• S11: until stop = true
FPM –Apriori –Mining Frequent Itemset ----eg
• Let the min_sup = 3 and Let K = 1
• I1: There are 4 frequent itemset which satisfies min-sup of 3 and one non-frequent
itemset
Frequent = {A} , {I}, {M}, {O} Non frequent = {F}
• I2: Now set K = 2, identify itemset which are generated from frequent itemset. We see
that there are 4 frequent itemset which satisfies min_sup of 3
Frequent = { A,M} , {I, M} , {I,O}, {M,O} . Observe there is no itemset with {F}
• I3: Now set K = 3, identify itemset which are generated from frequent itemset. We see
that there is 1 frequent itemset which satisfies min_sup of 3 i.e. { I, M, O}
• Since we have only one frequent itemset at level k=3 stop the data mining.
• Thus the no. of frequent itemsets with min_sup=3 is identified to be 9 { I1-4, I2-4, I3-1}
Frequent = {A} , {I}, {M}, {O}, { A,M} , {I, M} , {I,O}, {M,O} , { I, M, O}
FPM –Apriori –Mining Frequent Itemset
• By utilizing the two monotonicity theorem (all subsets of a frequent itemset are
frequent and none of the supersets of a non-frequent itemset is frequent),the amount
of computation can be reduced.
• Candidate itemset: These are itemsets which makes sense to generate and measure
the support of in each iteration.
In the eg. generated candidate itemsets are 12 ( {A}, {M}, {I}, {O}, {F}, {A,I}, {A,M},
{A,O}, {M.O}, {M,I}, {I,O}, {I,M,O}) & the candidate itemset are 9 which satisfies
threshold value of min_sup = 3. ({A}, {M}, {I}, {O}, {A,M}, {M.O, {M,I}, {I,O}, {I,M,O})

• Without considering the monotonicity theorem we would have had to generate 31


candidate itemsets and check their support.

• With Apriori the no. of candidate itemsets processed is about 40% of the possible
itemsets.
FPM –Apriori –Mining Frequent Itemset
• The structure used to illustrate the principle of Apriori is
called as Enumeration tree”.

• Enumeration tree consists of candidate itemsets ordered


by their length (from top to bottom), and also
lexicographically ( from left to right).

• This ordering defines an ancestral relationship in which


each “parent” itemset is a subset of its “child” itemset vice-
versa, each child itemset being a superset of its parent
itemset.

• All frequent itemset mining methods can be considered as


variations of the Apriori method, using various strategies to
explore the space of candidate itemsets defined by an
enumeration tree.
FPM –Eclat –Mining Frequent Itemset
• Disadvantage of Apriori algorithm is that in every step it needs to scan the whole
transactional database in order to count the support of candidate itemsets.
• If the database does not fit into the memory, then counting support will be a bottleneck
for frequent itemset mining algorithm.
• Even if the database does not fit into the memory, we have to consider all the
transactions and see if the candidate itemset is a subset of the transaction.
• This is not effective and hence other ways to represent transactional data are desirable
to speed up the support counting process.
• One way to store transactional data is “Vertical Format”, in which, for every item, a list
of transaction identifiers in which that item occurs is stored. The list is called a TID-set,
standing for “Transaction Identifier”.
• The support of an itemset is counted as the ratio of cardinality(length) of its TID-set
to the number of transactions in the database. The support of an itemset created by
merging two other itemsets is the cardinality of the intersection of their TID-sets.
FPM –Eclat –Mining Frequent Itemset
• The frequent itemsets of size, K=1, are detected and stored together with their corresponding
TID-sets.
• Each frequent itemset of size K is systematically expanded by one item, resulting in an itemset of
size K+1. K is incremented and the process iterates until there are candidates to expand.
Name Arabic Indian Mediterra Orien Fast
nean tal food
Eg.
Andrew √ √
Bernhard √ √ √
Carolina √ √ √
Dennis √ √
Eve √
Fred √ √ √
Gwyneth √ √
Transactional data in vertical format
Hayden √ √ √ for first iteration (k=1) of Elcat
Irene √ √ √ Algorithm for our data set
James √ √
FPM-Eclat – Eg.
• Count support for {M,O}
• TID-set for {M,O} is found by intersecting
TID-set of {M} & TID-set of {O}
• Therefore the support for TID-set, {M,O}
is {Carolina, Fred, Irene}
• Cardinality for support of {M,O} = 3 or 0.3
(3/10) [ check in diagram itemset 13 ]
• Count support for {I,O,F}
• TID-set for {I,O,F} is found by
intersecting TID-set of {I},{F},{O}
• Therefore support for {I,F,O} is
{Bernhard, Hayden} with cardinality of
2. [ check in diagram itemset 24]
FPM-Frequent pattern Growth (FP-Growth)
• FP-growth is an efficient approach when the database is very large to fit into memory.
• The transactional data is used as reference for FP-Growth.
• FP-Growth is that it compresses the transactional data into a so-called FP-Tree structure.
• FP-Tree helps to get the support count and itemset generation faster.
• To build FP-Tree, only two passes of the data are required,
• In the first pass, all the frequent items and their support are found for the desired min_sup
• In the second pass, items in each transaction are processed in a decreasing order according to
their support. Infrequent items are no longer taken into account.
• Items from the transaction are added to the tree following a path wrt their order and the counts
of the affected nodes are incremented. If a new node is added to the tree, its count is initialized
to 1.
• Finally Header table, of items is created, with pointers to the first appearance of these items.
• The pointers are linked and chained through all appearances of the corresponding items.
FPM-Frequent pattern Growth (FP-Growth) Eg. with min_sup=3
• Consider the transaction data set of our data.
• Pass 1:
• Items matching with min-sup = 3 are {A}, {M}, {O}, {I}
• Arrange in descending order of the support i.e. {M}, {I}, {O}, and
{A}
• Pass 2:
• Building of FP-Tree: {new entries in italic}
• Consider 1) Andrew = {M}, {I};
2) Bernhard={I}, {O}
3) Carolina ={M}, {I}, {O}
4) Dennis ={M}, {A}
FPM-Frequent pattern Growth (FP-Growth) Eg. with min_sup=3
5) Eve = {O}
6) Fred = {M}, {I}, {O}
7) Gwyneth = {M}, {A}
8) Hayden = {I}, {O}
9) Irene = {M}, {I}. {O}
10) James = {M}, {A}
FPM-Frequent pattern Growth (FP-Growth) Eg. with min_sup=3
• To count the support for an item, we sum the counts in the nodes linked and chained to the
item required in the header table.
• In the example, to count the support for item, {O} = 6, for item {I} = 6 & for {A,M}=3
• A useful property of the FP-tree is that it is compact and contains complete information about
frequent itemsets in the data.
• The process of mining frequent itemsets is based on the observation that the set of all
frequent itemsets can be divided into non-overlapping subsets.
• In the example, these are four subsets such as:
• Frequent itemsets containing item A
• Frequent itemsets containing item O but not containing item A
• Frequent itemsets containing item I but not containing items A and O
• Frequent itemsets containing item M, but not containing items I,A,O
FPM-Maximal and Closed Frequent itemsets
• FPM often results in a very large number of frequent itemsets and the processing is
complicated.
• Certain itemsets are more representative than others.
• These itemsets are the maximal and closed frequent itemsets. From them all other frequent
itemsets can be derived.
• Maximal frequent itemset: A frequent itemset is maximal if none of its supersets is frequent.
i.e there should be not other frequent itemsets connected to them by a path on the way down in
the itemset lattice.
• Closed Frequent itemset: A frequent itemset is closed if none of its supersets has same
support.
FPM-Maximal and Closed Frequent itemsets
• Eg. In the lattice diagram shown above, itemsets {A,M} and
{I,M,O} are the Maximal Frequent Itemsets, since they do
not have any superset connected to it.
• Therefore we can derive all frequent itemsets as follows: i)
from {A,M} = {A}, {M}
ii) from {I,M,O} = {I}, {M}, {O}, {I,M}, { I,O}, {M,O}
Therefore the frequent itemsets are:
{A}, {M}, {A,M}, {I}, {O}, {I,M}, {M,O}, {I,O}, {I,M,O}
• In the lattice diagram, the itemsets with solid lines are
Closed frequent itemsets, as their supersets do not have
the same support as these items. i.e {A} is not closed
because support for {A} and {A,M} are same.
• Frequent itemsets based on Closed frequent itemsets
are:
• {I}, {M}, {O}, {A,M}, {I,M}, {I,O}, {I,M,O}
Association Rules
• An association rule is an implication of the form A C , where A and C are itemsets
that do not share common items. A is called the antecedent of the rule and C is
consequent of the rule.
• The simpler meaning of association rule is that if its antecedent is present in some
transactions then its consequent should also be present in these transactions.
• More measures are required to express the strength of an association.
• Support is used as a measure to find the strength of an association rule, A C
• For association rule, this is defined as the support of an itemset formed by joining the
antecedent and the consequent of the rule. It is expressed a s
Support ( A C ) = support ( A U C)
Association Rules
Name Arabic Indian Mediterra Orien Fast • In the example considered,
nean tal food support (A M) is the support of the itemset
Andrew √ √ {A,M} = 3 i.e. Three friends who like Arabic
Bernhard
(Antecedent) also like Mediterranean
√ √ √
(Consequent).
Carolina √ √ √ • Similarly support (M A) is the support of
Dennis √ √ the itemset {M,A} = 3, i.e. three friends who
Eve √ like Mediterranean also like Arabic.
Fred √ √ √ • Thus we observe support is not enough to
Gwyneth √ √ measure the quality of a rule.
Hayden √ √ √ • Support gives the same value when the
Irene antecedent and consequent of the rule is
√ √ √
exchanged.
James √ √
Association Rule
• The quality, or the reliability, of a rule can be measured by its “Confidence”.
• Confidence is given by expression: conficence(
• Eg. conficence( = = 1
• conficence( M = = 0.43
• Interpretation:
• All who like Arabic food also likes Mediterranean food. Thus confidence is 1
• Only 3 of friends who like Mediterranean food like Arabic food. Thus confidence = 0.43
• Quality of association {A} {M} is greater than quality of association of {M} {A}
Association Rules – Association Rule Mining
• Association Rule Mining: Given a set of all available items I, transactional data T, and
threshold values min-sup & min-conf, association rule mining aims at finding those
association rules generated from I, for which support in T, is least min_sup and
confidence in T is at least min_conf.
• The number of different association rules that can be generated from the data is given
as: 3|I| - 2|I| + 1 + 1
• For eg. if we have 5 different items then,
--- No. of association rules = 3|I| - 2|I| + 1 + 1 = 35 – 2( 5 + 1) + 1 = 243-64+1 = 180
--- No. of itemsets is = 2|I| - 1 = 25 – 1 = 31
• Since the support for each rule should meet the min_sup threshold, only those rule for
which the joined antecedent and consequent parts form a frequent itemset are
considered.
Association Rules – Association Rule Mining
• Process of mining association rules consists of two phases:
• Mining frequent itemsets which satisfies the min_sup threshold- this is most computationally
extensive part of the process
• Generating association rules which satisfies the min_conf threshold from the identified
frequent itemsets
• Generally if min-sup is selected wisely, the no. of frequent itemsets will be smaller than
the total no. of all itemsets.
• i.e. This process is the first step in association rule mining and it helps in reducing search
space.
• Inspite of reducing the search space, the possible association rule for an size of itemset as
|I| is given as 2|I|-2.
• Generating all rules for each itemset would be not effective.
• Hence monotonicity theorems is used
Monotonicity for Association Rules
• Rule: If the confidence of the rule X Y – X is lower that min_conf, then the confidence
of all rules X’ Y – X’, where X’ is subset of X, will be lower than min_conf, as well.
• Y-X means that all those items in Y which are present in X is to be removed & Y-X’ means
all those items in Y which are present in X’ is to be removed.
• Eg. Let Y = {I, M, O}, X = { I, O}, {X’} = {I} and min_conf = 0.75
a) Then Y – X = { I, M, O } - { I, O } = {M} &
Rule X Y – X is to be seen as { I, O } {M} with confidence calculated as
conficence( X

Practice: Thus the Rule X Y – X does not meet the min_conf threshold
Try with
b) Rule X’ Y – X’ is to be seen as { I } {M, O} with confidence calculated as
X’={O}
conficence( X’

Thus the Rule X’ Y – X’ does not meet the min_conf threshold


Monotonicity for Association Rules
• i.e. If the confidence of the rule does not satisfy the min_conf and we modify the rule by moving
one or more items from its antecedent into its consequent, then the confidence of the modified
rule will not satisfy the min_conf.
• Using this theorem, we can define a algorithm to generate association rule for given frequent
itemset Z
Monotonicity for Association Rules – Eg…
• Consider process of mining association rules from the frequent itemset Z = {I,M,O}, with
min_conf is set to 0.75
• First Step:
• Three rules are constructed: {M,O} {I}, {I,O} {M} & {I,M} {O}. The confidence
of these association rules are given below:
Association Confidence
Confidence eqn
Rule value
{M,O} =
{I} 1.0
{I,O} {M} = 0.6
{I,M} {O} = 0.75
{M} {I,O} = 0.43
Monotonicity for Association Rules – Eg…
• Since the confidence of 2nd rule is < 0.75, the min_conf, it is ignored. Therefore from first
step two rules are returned i.e., {M,O} {I} & {I, M} {O}
• The items {I} & {O}, the consequences in the above two rules, are added to the set C1.
Therefore C1 = ( {I}, {O} ) and value of k = 2.
• There can be only one possible itemset, V of size k=2 that can be generated from the
itemsets {I} and {O} i.e V = { I, O}
• In second step, generate a rule Z-V V: i.e {I,M,O}-{I,O} {I,O}
i.e., { M } {I, O} , the confidence of this support is 0.43 < min_conf,
hence erase it.
• Since there are no itemsets to be added to C2, stop the mining process
Behind support and confidence
• Each pattern reveals a kind of knowledge that helps in further decision of the users.
• But, only some patterns are interesting to provide useful and unexpected knowledge.
• Evaluation of the usefulness of a pattern depends on the application domain and also on
the opinion of the user.
• Due to the high no. of patterns in large data sets, manual analysis or incorporating
human knowledge into automatized evaluation process would be cumbersome.
• To support the evaluation process, several objective evaluation measures have been
developed to assess the quality of association rules, helping the users to select
interesting patterns.
• Some of the important objective evaluation measures are:
• Cross support Patterns
• Lift
• Simpsons paradox
Cross –support patterns
• In real world data most of the items have relatively low or modest support, while a few of
the items have high support.
• If a pattern contains low-support items and high-support items, then it is called
cross-support patterns
• Cross support patterns can represent interesting relationships between items, at the
same time can be spurious since the items it contains are weakly correlated in the
transactions.
• Support Ratio (sup_ratio) is used to measure the extent to which a pattern P, can be
called a cross-support patterns. It is given as below:
sup_ratio (P) =
Where, s(i1), s(i2), …. , s(ik) are the support for items i1, i2, …., ik contained in P
• i.e., sup_ratio computes ratio of minimal support of items present in the pattern to the
maximal support of items present in the pattern.
Cross –support patterns
• This measure can be used to filter out patterns with sup_ratio below or above a user-
defined threshold, depending on the interest of the user. { It is HYPER-PARAMETER}
• Another way to eliminate cross-support patterns is by setting the min_sup threshold
high, but this could eliminate some interesting patterns.
• Eg. Let the support for items i1, i2, i3 be s(i1) = 0.9 , s(i2) = 0.1 , s(i3) = 0.25
sup_ratio threshold is assumed at 0.2
For a set P = {i1, i2}, then sup_ratio (P) = = = 0.11 < 0.2. Hence pattern P is a cross-
support pattern
For a set Q = {i1, i3}, then sup_ratio (P) = = = 0.38 > 0.2. Hence pattern Q is not a cross-
support pattern
Lift
• In some cases high confidence and good support of a rule does not necessarily imply a
cause and effect between its antecedent and consequent
• Lift is a measure to effect of antecedent on the consequent of a rule.
• It is given as lift( X > = 0
where, confidence(
• A lift value >1, indicates positive correlation between the rule’s antecedent and
consequent; i.e., the occurrence of the antecedent has a positive effect on the
occurrence of the consequent.
• A lift value <1, indicates negative correlation between the rule’s antecedent and
consequent; i.e., the occurrence of the antecedent has a negative effect on the
occurrence of the consequent.
• A lift value near 1, indicates no correlation between rule’s antecedent and consequent;
i.e., the occurrence of the antecedent has almost no effect on the occurrence of the
consequent.
Simpson’s paradox
• The relationship observed between the antecedent and the subsequent of the rule can also be
influenced by hidden factors that are not captured in the data or not considered in the analysis.
• Simpson’s paradox, states that certain correlation between pairs of itemsets appearing in different
groups of data may disappear or be reversed when these groups are combined.
• Eg. Consider 800 transactions formed by two groups of students A and B. Group A are students on
Physics and Group B are students on Biology. Let two itemsets be X={genetics} and Y={Data
analytics}. The contingency table is given below:
Simpson’s paradox
• In Group A, the rule X Y, has confidence = 0.8, Lift = 1.79
• In Group B, the rule Y X, has confidence = 0.8, Lift = 1.66
• If considered separately, it looks a strong relationships
• If Group A and B are combined, then the rule X Y, has confidence = 0.44, Lift = 1.4
and the rule Y X, has confidence = 0.48, Lift = 1.4

• Thus the same rules that were strong when the groups are analyzed separately will
become much weaker when combined.
Other Types of Patterns - Sequences
• The other patterns are: Sequences or Graphs.
• Their mining is based on similar principles as were presented for Frequent itemset
mining.
• The approaches for mining frequent sequences and graphs are mostly extensions and
modifications of Apriori, Eclat, and FP-Growth methods.
• Here only basic definitions, focusing on sequential patterns is considered
• The input to sequential pattern mining is a sequence database, denoted by S.
• Each row consists of a sequence of events consecutively recorded in time.
• Each event is an itemset of arbitrary length assembled from items available in the data.
• Let us have two sequences S1 = < X1, X2, …., Xn> and S2 = <Y1, Y2, ….., Ym> , where
n <= m. Then S1 is called “subsequence” of S2 if there exists 1<i1<i2<…<in <= m such
that X1 is a subset of Yi1, X2 is a subset of Yi2,………….
Other Types of Patterns - Sequences
• The support of a given sequence s, in the sequence database S is the number of rows in
S of which s is a subsequence
• Eg., Consider customer id=1 purchasing pattern as below:
• 1st visit: Items a & b ; 2nd visit: Items a, b & c ; 3rd visit: Items a,c,d,e ; 4th visit: items b & f
• Let the sequence be s2 = < Y1, Y2, Y3, Y4> where Y1 = {a,b}, Y2= {a,b,c}, Y3={a,d,c,e}
and Y4 = {b,f} having m=4
• Let s1 = <X1, X2> where X1={b} and X2={a,d,e} having n=2
• It can be observed that, there exists i=1 and i=3 where X1 is subset of Y1 and X2 is subset
of Y3. Therefore s1 is a subsequence of s2
• Similary another mapping also exists where X1 is subset of Y2 and X2 is subset of Y3.
• An example of calculation of support for a sequence is shown in next slide.
Other Types of Patterns - Sequences
• Consider the sequence database S
with items a, b, c, d, e, f as shown
besides

• Support for sequence < {a}, {f} >

From the table it can be seen that the support for sequence
in data base is 4/5 = 0.8
Sequences: Frequent sequence mining
• Let I be set of all available items; S – sequential database and min_sup is threshold
value
• The aim of frequent sequence mining is to find those sequences, called frequent sequences,
generated from I for which support in S is at least = min_sup
• Note: The no. of frequent sequences generated from S with available items I is usually
greater than no. of itemsets generated from I
• E.g. The no. of possible itemsets with 6 items a, b, c, d, e, f are 26 – 1 = 63
For the table shown in the previous page,
a) with min-sup = 0.8 , frequent sequences= 20
b) with min_sup = 0.4, frequent sequences - 237
Sequences: Closed and maxima sequences
• A frequent sequential pattern is closed if it is not a subsequence of any other frequent
sequential pattern with the same support.
• A frequent sequential pattern is maximal if it is not a subsequence of any other frequent
sequential pattern
End of Module 3

You might also like