Bigdata Section4
Bigdata Section4
Data Management
Engineer Merna Magdy
Engineer Ahmed Ramadan
Association Rules
Key Concepts
Support Confidence
and Terms
• Measures the frequency of • % of records that contains
occurrence of an itemset in a X which also contains Y
dataset
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋∩𝑌)
• 𝐶𝑜𝑛𝑓 𝑋 → 𝑌 =
𝐶𝑜𝑢𝑛𝑡 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋)
• 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = 𝑇𝑜𝑡𝑎𝑙 𝑅𝑒𝑐𝑜𝑟𝑑𝑠
• High confidence implies a
• High support indicated that the strong association between
itemset is frequently present in items A and B
the dataset Apriori Property: Any subset of a
frequent itemset is also frequent
Key Concepts and Terms
Lift Leverage
It provides valuable insights into how the It quantifies the difference between the observed frequency of
presence of one item in a transaction co-occurrence of two items in a dataset and the expected
affects the likelihood of the presence of frequency of co-occurrence if the items were independent of
another item. each other
2 sand, beach
5 holiday, sunshine
Solution
𝐦𝐢𝐧 𝐬𝐮𝐩𝐩𝐨𝐫𝐭 = 𝟒𝟎%
𝐦𝐢𝐧 𝐬𝐮𝐩𝐩𝐨𝐫𝐭 𝐜𝐨𝐮𝐧𝐭 = 𝐦𝐢𝐧 𝐬𝐮𝐩𝐩𝐨𝐫𝐭 ∗ 𝐭𝐨𝐭𝐚𝐥 𝐜𝐨𝐮𝐧𝐭
𝟒𝟎% ∗ 𝟓 = 𝟐
Image Associated Tags
ID
1-itemset Support
1 beach, sunshine, count
holiday 1- Support
beach 4 Frequent count
2 sand, beach
holiday 2 itemset
3 sunshine, beach,
ocean ocean 2 beach 4
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 (𝑿 ∩ 𝒀)
𝑳𝒊𝒇𝒕 𝑿 → 𝒀 =
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 ∗ 𝑺𝒖𝒑𝒑𝒐𝒓𝒕(𝒀)
𝒃𝒆𝒂𝒄𝒉} → {𝒔𝒖𝒏𝒔𝒉𝒊𝒏𝒆 0.75 0.938 -0.04 𝒐𝒄𝒆𝒂𝒏 → 𝒃𝒆𝒂𝒄𝒉, 𝒔𝒖𝒏𝒔𝒉𝒊𝒏𝒆 1 1.667 0.16
𝒔𝒖𝒏𝒔𝒉𝒊𝒏𝒆} → {𝒃𝒆𝒂𝒄𝒉 0.75 0.938 -0.04 𝒃𝒆𝒂𝒄𝒉, 𝒐𝒄𝒆𝒂𝒏 → 𝒔𝒖𝒏𝒔𝒉𝒊𝒏𝒆 1 1.25 0.08
b) Assuming the minimum support is 0.4, which itemsets are considered frequent?
d) List all the candidate rules that can be formed from the statistics. Which rules are
considered interesting at the minimum confidence 0.2? out of these interesting rules, which
rule is considered the most useful (least coincidental)?
Solution
a) What are the support values of the preceding itemsets?
𝒄𝒐𝒖𝒏𝒕
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 =
𝟏𝟏,𝟓𝟎𝟎 𝒕𝒐𝒕𝒂𝒍
𝑴𝒊𝒍𝒌 =
𝟏𝟖,𝟎𝟎𝟎
= 𝟎. 𝟗 𝑷𝒂𝒔𝒕𝒂, 𝑶𝒊𝒍 = = 𝟎. 𝟓𝟕𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟗,𝟓𝟎𝟎
𝑪𝒉𝒆𝒆𝒔𝒆 =
𝟏𝟔,𝟎𝟎𝟎
= 𝟎. 𝟖 𝑹𝒊𝒄𝒆, 𝑶𝒊𝒍 = = 𝟎. 𝟒𝟕𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟑,𝟎𝟎𝟎
𝑹𝒊𝒄𝒆 =
𝟏𝟓,𝟎𝟎𝟎
= 𝟎. 𝟕𝟓 𝑴𝒊𝒍𝒌, 𝑪𝒉𝒆𝒆𝒔𝒆 = = 𝟎. 𝟔𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟎,𝟎𝟎𝟎
𝒀𝒐𝒈𝒖𝒓𝒕 =
𝟏𝟒,𝟎𝟎𝟎
= 𝟎. 𝟕 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍 = = 𝟎. 𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟖,𝟎𝟎𝟎
𝑷𝒂𝒔𝒕𝒂 =
𝟏𝟑,𝟓𝟎𝟎
= 𝟎. 𝟔𝟕𝟓 𝑴𝒊𝒍𝒌, 𝒀𝒐𝒈𝒖𝒓𝒕 = = 𝟎. 𝟒
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟖,𝟓𝟎𝟎
𝑶𝒊𝒍 =
𝟏𝟐,𝟎𝟎𝟎
= 𝟎. 𝟔 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍, 𝑪𝒉𝒆𝒆𝒔𝒆 = = 𝟎. 𝟒𝟐5
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟕,𝟓𝟎𝟎
𝑪𝒆𝒓𝒆𝒂𝒍 =
𝟏𝟎,𝟎𝟎𝟎
= 𝟎. 𝟓 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍, 𝒀𝒐𝒈𝒖𝒓𝒕 = = 𝟎. 𝟑𝟕𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟕,𝟑𝟓𝟎
𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍, 𝑪𝒉𝒆𝒆𝒔𝒆, 𝒀𝒐𝒈𝒖𝒓𝒕 = = 𝟎. 𝟑𝟔𝟖
𝟐𝟎,𝟎𝟎𝟎
Solution
b) Assuming the minimum support is 0.4, which itemsets are considered frequent?
𝟏𝟖,𝟎𝟎𝟎
𝑴𝒊𝒍𝒌 = = 𝟎. 𝟗 𝑹𝒊𝒄𝒆, 𝑶𝒊𝒍 =
𝟗,𝟓𝟎𝟎
= 𝟎. 𝟒𝟕𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟔,𝟎𝟎𝟎
𝑪𝒉𝒆𝒆𝒔𝒆 = = 𝟎. 𝟖 𝑴𝒊𝒍𝒌, 𝑪𝒉𝒆𝒆𝒔𝒆 =
𝟏𝟑,𝟎𝟎𝟎
= 𝟎. 𝟔𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟓,𝟎𝟎𝟎
𝑹𝒊𝒄𝒆 = = 𝟎. 𝟕𝟓 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍 =
𝟏𝟎,𝟎𝟎𝟎
= 𝟎. 𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟒,𝟎𝟎𝟎
𝒀𝒐𝒈𝒖𝒓𝒕 = = 𝟎. 𝟕 𝑴𝒊𝒍𝒌, 𝒀𝒐𝒈𝒖𝒓𝒕 =
𝟖,𝟎𝟎𝟎
= 𝟎. 𝟒
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟑,𝟓𝟎𝟎
𝑷𝒂𝒔𝒕𝒂 = = 𝟎. 𝟔𝟕𝟓 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍, 𝑪𝒉𝒆𝒆𝒔𝒆 =
𝟖,𝟓𝟎𝟎
= 𝟎. 𝟒𝟐
𝟐𝟎,𝟎𝟎𝟎
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟐,𝟎𝟎𝟎
𝑶𝒊𝒍 = = 𝟎. 𝟔
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟎,𝟎𝟎𝟎
𝑪𝒆𝒓𝒆𝒂𝒍 = = 𝟎. 𝟓
𝟐𝟎,𝟎𝟎𝟎
𝟏𝟏,𝟓𝟎𝟎
𝑷𝒂𝒔𝒕𝒂, 𝑶𝒊𝒍 = = 𝟎. 𝟓𝟕𝟓
𝟐𝟎,𝟎𝟎𝟎
Solution
c) What are the confidence values of the following rules:
1. {Milk} → {Cereal}
2. {Milk, Cereal} → {Cheese} 𝐒𝐮𝐩𝐩𝐨𝐫𝐭(𝐗 ∩ 𝐘)
3. {Milk, Cereal, Cheese} → {Yogurt} 𝐂𝐨𝐧𝐟 𝐗 → 𝒀 =
𝐒𝐮𝐩𝐩𝐨𝐫𝐭 (𝐗)
Which of the three rules is more interesting? Why?
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 (𝑿 ∩ 𝒀)
𝑳𝒊𝒇𝒕 𝑿 → 𝒀 =
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 ∗ 𝑺𝒖𝒑𝒑𝒐𝒓𝒕(𝒀)
𝐒𝐮𝐩𝐩𝐨𝐫𝐭(𝐗 ∩ 𝐘)
𝐂𝐨𝐧𝐟 𝐗 → 𝒀 =
𝐒𝐮𝐩𝐩𝐨𝐫𝐭 (𝐗)
2- Frequent Support
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 (𝑿 ∩ 𝒀)
itemset 𝑳𝒊𝒇𝒕 𝑿 → 𝒀 =
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 ∗ 𝑺𝒖𝒑𝒑𝒐𝒓𝒕(𝒀)
Pasta, Oil 0.575
𝑳𝒆𝒗𝒆𝒓𝒂𝒈𝒆 𝑿 → 𝒀 = 𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 ∩ 𝒀 − [𝑺𝒖𝒑𝒑𝒐𝒓𝒕 𝑿 ∗ 𝑺𝒖𝒑𝒑𝒐𝒓𝒕(𝒀)]
Rice, Oil 0.475
Milk, Cheese 0.65
Milk, Cereal 0.5 3-Frequent itemset Support
Milk, Yogurt 0.4 Milk, Cereal, Cheese 0.42
Solution
d) List all the candidate rules that can be formed from the statistics. Which rules are
considered interesting at the minimum confidence 0.2? out of these interesting rules,
which rule is considered the most useful (least coincidental)?
𝑹𝒊𝒄𝒆} → {𝑶𝒊𝒍 0.63 1.056 0.025 𝑪𝒉𝒆𝒆𝒔𝒆 → 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍 0.525 1.05 0.02
𝑶𝒊𝒍} → {𝑹𝒊𝒄𝒆 0.79 1.056 0.025 𝑴𝒊𝒍𝒌, 𝑪𝒆𝒓𝒆𝒂𝒍 → 𝑪𝒉𝒆𝒆𝒔𝒆 0.84 1.05 0.02
𝑴𝒊𝒍𝒌} → {𝑪𝒉𝒆𝒆𝒔𝒆 0.72 0.903 -0.07 𝑪𝒆𝒓𝒆𝒂𝒍 → 𝑴𝒊𝒍𝒌, 𝑪𝒉𝒆𝒆𝒔𝒆 0.84 1.292 0.095
𝑪𝒉𝒆𝒆𝒔𝒆} → {𝑴𝒊𝒍𝒌 0.813 0.903 -0.07 𝑴𝒊𝒍𝒌, 𝑪𝒉𝒆𝒆𝒔𝒆 → 𝑪𝒆𝒓𝒆𝒂𝒍 0.646 1.292 0.095