0% found this document useful (0 votes)

10 views46 pages

Big Data - Week04 - Association Rules

The document discusses frequent itemset mining and association rule discovery, particularly in the context of supermarket sales data to identify items frequently bought together. It outlines the market-basket model, the importance of support and confidence in association rules, and various applications including retail and healthcare. Additionally, it describes algorithms for finding frequent itemsets and the challenges associated with processing large datasets efficiently.

Uploaded by

pand4inca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views46 pages

Big Data - Week04 - Association Rules

Uploaded by

pand4inca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Frequent Itemset

Mining & Association

Rules

Slides from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Association Rule Discovery
Supermarket shelf management – Market-basket model:
◼ Goal: Identify items that are bought together by
sufficiently many customers
◼ Approach: Process the sales data collected with
barcode scanners to find dependencies among items
◼ A classic rule:
▪ If someone buys diaper and milk, then he/she is
likely to buy beer
▪ Don’t be surprised if you find six-packs next to diapers!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
The Market-Basket Model
◼ A large set of items Inpu
▪ e.g., things sold in a t:
supermarket
◼ A large set of baskets
◼ Each basket is a
small subset of items
▪ e.g., the things one Outp
customer buys on one day ut:
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
◼ Want to discover {Diaper,
{Diaper,Milk}
Milk}-->
-->
association rules {Beer}
{Beer}
▪ People who bought {x,y,z} tend to buy {v,w}
▪ Amazon!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Applications – (1)
◼ Items = products; Baskets = sets of products someone
bought in one trip to the store
◼ Real market baskets: Chain stores keep TBs of data about
what customers buy together
▪ Tells how typical customers navigate stores, lets them position
tempting items
▪ Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
▪ Need the rule to occur frequently, or no $$’s
◼ Amazon’s people who bought X also bought Y
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Applications – (2)
◼ Baskets = sentences; Items = documents containing
those sentences
▪ Items that appear together too often could represent plagiarism
▪ Notice items do not have to be “in” baskets

◼ Baskets = patients; Items = drugs & side-effects

▪ Has been used to detect combinations
of drugs that result in particular side-effects
▪ But requires extension: Absence of an item
needs to be observed as well as presence
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Outline
First: Define
Frequent itemsets
Association rules:
Confidence, Support, Interestingness
Then: Algorithms for finding frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithmJ. Leskovec,
+ 2 refinements
A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Frequent Itemsets
◼Simplest question: Find sets of items that appear
together “frequently” in baskets
◼Support for itemset I: Number of baskets containing all
items in I
▪ (Often expressed as a fraction
of the total number of baskets)
◼Given a support threshold s,
then sets of items that appear
in at least s baskets are called Support of
frequent itemsets {Beer, Bread} = 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
Example: Frequent Itemsets
◼Items = {milk, coke, pepsi, beer, juice}
◼Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

◼Frequent itemsets: {m}, {c}, {b}, {j},

{m,b , ,
} {b,c}{c,j}.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Association Rules
◼Association Rules:
If-then rules about the contents of baskets
◼{i1, i2,…,ik} → j means: “if a basket contains all of
i1,…,ik then it is likely to contain j”
◼In practice there are many rules, want to find
significant/interesting ones!
◼Confidence of this association rule is the
probability of j given I = {i1,…,ik}

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Interesting Association Rules
◼Not all high-confidence rules are interesting
▪ The rule X → milk may have high confidence for many itemsets X,
because milk is just purchased very often (independent of X) and
the confidence will be high
◼Interest of an association rule I → j:
difference between its confidence and the fraction of
baskets that contain j
▪ Interesting rules are those with high positive or negative interest values
(usually above 0.5)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Example: Confidence and Interest
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

◼ Association rule: {m, b} →c

▪ Confidence = 2/4 = 0.5
▪ Interest = |0.5 – 5/8| = 1/8
▪ Item c appears in 5/8 of the baskets
▪ Rule is not very interesting!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Finding Association Rules
◼Problem: Find all association rules with
support ≥s and confidence ≥c
▪ Note: Support of an association rule is the support
of the set of items on both sides
◼Hard part: Finding the frequent itemsets!
▪ If {i1, i2,…, ik} → j has high support and
confidence, then both {i1, i2,…, ik} and
{i1, i2,…,ik, j} will be “frequent”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Mining Association Rules
◼Step 1: Find all frequent itemsets I
▪ (we will explain this next)
◼Step 2: Rule generation
▪ For every subset A of I, generate a rule A → I \ A
▪ Since I is frequent, A is also frequent
▪ Variant 1: Single pass to compute the rule confidence
▪ confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
▪ Variant 2:
▪ Observation: If A,B,C→D is below confidence, so is A,B→C,D
▪ Can generate “bigger” rules from smaller ones!
▪ Output the rules above the confidence threshold
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

◼ Support threshold s = 3, confidence c = 0.75

◼ 1) Frequent itemsets:
▪ {b,m} {b,c} {c,m} {c,j} {m,c,b}
◼ 2) Generate rules:
▪ b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
▪ m→b: c=4/5 … b,m→c: c=3/4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
Finding Frequent
Itemsets
Itemsets: Computation Model
◼ Back to finding frequent itemsets Ite
Ite
m
Ite
m
◼ Typically, data is kept in flat files Ite
m
Ite
m
rather than in a database system: Ite
m
m
Ite
▪ Stored on disk Ite
m
m
Ite
Ite
m
▪ Stored basket-by-basket m
Ite
Ite
m
▪ Baskets are small but we have m

many baskets and many items Etc

▪ Expand baskets into pairs, triples, etc. .
as you read baskets
▪ Use k nested loops to generate all Items are positive integers,
Note: We want to find frequent
sets of size k itemsets. To find them, we and boundaries between
have to count them. To count them, we have to generate them. baskets are –1. 16
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Computation Model
◼The true cost of mining disk-resident data is
usually the number of disk I/Os
◼In practice, association-rule algorithms read
the data in passes – all baskets read in turn
◼We measure the cost by the number of
passes an algorithm makes over the data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
Main-Memory Bottleneck
◼For many frequent-itemset algorithms,
main-memory is the critical resource
▪ As we read baskets, we need to count
something, e.g., occurrences of pairs of items
▪ The number of different things we can count
is limited by main memory
▪ Swapping counts in/out is a disaster (why?)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

Finding Frequent Pairs
◼ The hardest problem often turns out to be finding the
frequent pairs of items {i1, i2}
▪ Why? Freq. pairs are common, freq. triples are rare
▪ Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
◼ Let’s first concentrate on pairs, then extend to larger sets
◼ The approach:
▪ We always need to generate all the itemsets
▪ But we would only like to count (keep track) of those itemsets
that in the end turn out to be frequent
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Naïve Algorithm
◼ Naïve approach to finding frequent pairs
◼ Read file once, counting in main memory
the occurrences of each pair:
▪ From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
◼ Fails if (#items)2 exceeds main memory
▪ Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
▪ Suppose 105 items, counts are 4-byte integers
▪ Number of pairs of items: 105(105-1)/2 = 5*109
▪ Therefore, 2*1010 (20 gigabytes) of memory needed
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Counting Pairs in Memory
Two approaches:
◼ Approach 1: Count all pairs using a matrix
◼ Approach 2: Keep a table of triples [i, j, c] = “the count of the
pair of items {i, j} is c.”
▪ If integers and item ids are 4 bytes, we need approximately 12 bytes
for pairs with count > 0
▪ Plus some additional overhead for the hashtable
Note:
◼ Approach 1 only requires 4 bytes per pair
◼ Approach 2 uses 12 bytes per pair
(but only for pairs with count > 0)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
Comparing the 2 Approaches

12 per
4 bytes per
occurring
pair
pair

Triangular Matrix Triples

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

Comparing the two approaches
◼ Approach 1: Triangular Matrix
▪ n = total number items
▪ Count pair of items {i, j} only if i<j
▪ Keep pair counts in lexicographic order:
▪ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
▪ Pair {i, j} is at position (i –1)(n– i/2) + j –1
▪ Total number of pairs n(n –1)/2; total bytes = 2n2
▪ Triangular Matrix requires 4 bytes per pair
◼ Approach 2 uses 12 bytes per occurring pair
(but only for pairs with count > 0)
▪ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
Comparing the two approaches
◼ Approach 1: Triangular Matrix
Problem is if we
▪ n = total number items
▪ Count pair of items {i, j} only if i<j
have too many items
▪ Keep pair counts in lexicographic order:
▪ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
so (ithe
▪ Pair {i, j} is at position –1)(n– i/2)pairs
+ j –1
do not fit
▪ Total number of pairs n(n –1)/2; total bytes= 2n
▪ Triangular Matrix requires 4 bytes per pair
into 2

◼ Approach 2 uses 12 memory. bytes per pair

(but only for pairs with count > 0)
Can we do better?
▪ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
A-Priori Algorithm
A-Priori Algorithm – (1)
◼ A two-pass approach called
A-Priori limits the need for
main memory
◼ Key idea: monotonicity
▪ If a set of items I appears at
least s times, so does every subset J of I
◼ Contrapositive for pairs:
If item i does not appear in s baskets, then no pair including
i can appear in s baskets
◼ So, how does A-Priori find freq. pairs?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
A-Priori Algorithm – (2)
◼ Pass1: Read Baskets and count in main memory the
occurrences of each individual item
▪ Requires only memory proportional to #items

◼ Items that appear >= s times are the frequent items

◼ Pass2: Read baskets again and count in main memory only
those pairs where both elements are frequent (from pass1)
▪Requires memory proportional to the square of frequent
items only (for counts)
▪Plus a list of frequent items
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
Main-Memory: Picture of A-Priori

Item counts Frequent items

Counts of
pairs of
memory

frequent items
Main

(candidate
pairs)

Pass 1 Pass 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

Detail for A-Priori
◼ You can use the triangular
matrix method with n =
Item counts Frequent Old
number of frequent items items Item #s
▪ May save space compared
Counts
Countsof of
pairs
with storing triples

memory
ofpairs
frequent
of
◼ Trick: re-number frequent

Main
itemsitems
frequent
items 1,2,… and keep a
table relating new numbers Pass 1 Pass 2
to original item numbers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Frequent Triples, Etc.
◼ For each k, we construct two sets of
k-tuples (sets of size k):
▪ Ck = candidate k-tuples = those that might be frequent sets
(support > s) based on information from the pass for k–1
▪ Lk = the set of truly frequent k-tuples
Count All pairs Count To be
All
the items of items the pairs explained
items
from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

** Note here we generate new candidates by generating C k

Example
from Lk-1 and L1.
But that one can be more careful with candidate generation.
For example, in C3 we know {b,m,j} cannot be frequent since
{m,j} is not frequent
◼ Hypothetical steps of the A-Priori algorithm
▪ C1 = { {b} {c} {j} {m} {n} {p} }
▪ Count the support of itemsets in C1
▪ Prune non-frequent: L1 = { b, c, j, m }
▪ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
▪ Count the support of itemsets in C2
▪ Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
▪ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} } **
▪ Count the support of itemsets in C3
▪ Prune non-frequent: L3 = {J. Ullman:
J. Leskovec, A. Rajaraman,
{b,c,m} }
Mining of Massive Datasets, http://www.mmds.org 31
A-Priori for All Frequent Itemsets
◼ One pass for each k (itemset size)
◼ Needs room in main memory to count
each candidate k–tuple
◼ For typical market-basket data and reasonable support (e.g., 1%), k = 2
requires the most memory
◼ Many possible extensions:
▪ Association rules with intervals:
▪ For example: Men over 65 have 2 cars
▪ Association rules when items are in a taxonomy
▪ Bread, Butter → FruitJam
▪ BakedGoods, MilkProduct → PreservedGoods
▪ Lower the support s as itemset gets bigger
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
PCY (Park-Chen-Yu) Algorithm
PCY (Park-Chen-Yu) Algorithm
◼ Observation:
In pass 1 of A-Priori, most memory is idle
▪ We store only individual item counts
▪ Can we use the idle memory to reduce
memory required in pass 2?
◼ Pass 1 of PCY: In addition to item counts, maintain a hash table
with as many
buckets as fit in memory
▪ Keep a count for each bucket into which
pairs of items are hashed
▪ For each bucket just keep the count, not the actual
pairs that hash to the bucket!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
PCY Algorithm – First Pass
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in
hash the pair to a bucket;
PCY
add 1 to the count for that bucket;
◼ Few things to note:
▪ Pairs of items need to be generated from the input file; they
are not present in the file
▪ We are not just interested in the presence of a pair, but we
need to see whether it is present at least s (support) times
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
Observations about Buckets
◼ Observation: If a bucket contains a frequent pair, then the bucket is
surely frequent
◼ However, even without any frequent pair,
a bucket can still be frequent ☹
▪ So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
◼ But, for a bucket with total count less than s,
none of its pairs can be frequent ☺
▪ Pairs that hash to this bucket can be eliminated as candidates (even if the pair
consists of 2 frequent items)

◼ Pass 2:
Only count pairs that hash to frequent buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
PCY Algorithm – Between Passes
◼Replace the buckets by a bit-vector:
▪ 1 means the bucket count exceeded the support s
(call it a frequent bucket); 0 means it did not

◼4-byte integer counts are replaced by bits,

so the bit-vector requires 1/32 of memory
◼Also, decide which items are frequent
and list them for the second pass
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
PCY Algorithm – Pass 2
◼ Count all pairs {i, j} that meet the
conditions for being a candidate pair:
1. Both i and j are frequent items
2. The pair {i, j} hashes to a bucket whose bit in
the bit vector is 1 (i.e., a frequent bucket)

◼ Both conditions are necessary for the

pair to have a chance of being frequent
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
Main-Memory: Picture of PCY

Main memory
Item counts Frequent items
Bitmap
Hash
Hash table Counts of
table
for pairs candidate
pairs

Pass 1 Pass 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Frequent Itemsets
in < 2 Passes
Frequent Itemsets in < 2 Passes
◼A-Priori, PCY, etc., take k passes to find frequent
itemsets of size k
◼Can we use fewer passes?
◼Use 2 or fewer passes for all sizes,
but may miss some frequent itemsets
▪ Random sampling (often a cure for having too much data)
▪ SON (Savasere, Omiecinski, and Navathe)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

Random Sampling (1)
◼Take a random sample of the market baskets
◼Run a-priori or one of its improvements
in main memory Copy of
sample

Main memory
▪ So we don’t pay for disk I/O each baskets
time we increase the size of itemsets
▪ Reduce support threshold Space
for
proportionally to match the sample size counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

Random Sampling (2)
◼Optionally, verify that the candidate pairs are truly
frequent in the entire data set by a second pass
(avoid false positives)

◼But you don’t catch sets frequent in the whole but

not in the sample
▪ Smaller threshold, e.g., s/1.25, helps catch more truly
frequent itemsets
▪ But requires more space
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43
SON Algorithm – (1)
◼Repeatedly read small subsets of the baskets into
main memory and run an in-memory algorithm
to find all frequent itemsets
▪ Note: we are not sampling, but processing the entire
file in memory-sized chunks

◼An itemset becomes a candidate if it is found to

be frequent in any one or more subsets of the
baskets.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
SON Algorithm – (2)
◼On a second pass, count all the candidate
itemsets and determine which are frequent in
the entire set
◼Key “monotonicity” idea: an itemset cannot
be frequent in the entire set of baskets unless
it is frequent in at least one subset.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

SON – Distributed Version
◼SON lends itself to distributed data mining
◼Baskets distributed among many nodes
▪ Compute frequent itemsets at each node
▪ Distribute candidates to all nodes
▪ Accumulate the counts of all candidates

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

4 Frequent Item Set Mining & Association Rules
No ratings yet
4 Frequent Item Set Mining & Association Rules
68 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
Ch06 Frequent Itemsets
No ratings yet
Ch06 Frequent Itemsets
59 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
59 pages
Association Rules and Frequent Item Sets
No ratings yet
Association Rules and Frequent Item Sets
98 pages
Big Data Analytics AAM Unit 4
No ratings yet
Big Data Analytics AAM Unit 4
80 pages
Association Rules
No ratings yet
Association Rules
56 pages
02 Assocrules
No ratings yet
02 Assocrules
56 pages
L2: Frequent Itemsets Mining and Association Rules
No ratings yet
L2: Frequent Itemsets Mining and Association Rules
54 pages
Association Rules
No ratings yet
Association Rules
58 pages
ch03 Assocrules
No ratings yet
ch03 Assocrules
59 pages
L13 Apriori
No ratings yet
L13 Apriori
32 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
"Association Rules": Market Baskets Frequent Itemsets A-Priori Algorithm
No ratings yet
"Association Rules": Market Baskets Frequent Itemsets A-Priori Algorithm
30 pages
Assoc Rules1
No ratings yet
Assoc Rules1
32 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
5 Frequent Pattern Mining
No ratings yet
5 Frequent Pattern Mining
44 pages
Lec1b Assoc Rules
No ratings yet
Lec1b Assoc Rules
32 pages
Asssociation Rules: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Asssociation Rules: Prof. Sin-Min Lee Department of Computer Science
68 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Unit 3
No ratings yet
Unit 3
44 pages
Lecture 10-Assiciation Rule Mining-I-M
No ratings yet
Lecture 10-Assiciation Rule Mining-I-M
30 pages
Data Mining: Frequent Itemsets and Association Rules
No ratings yet
Data Mining: Frequent Itemsets and Association Rules
105 pages
VIPDMTheoryChapter 5
No ratings yet
VIPDMTheoryChapter 5
96 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
178 pages
Unit 4 - Part 1
No ratings yet
Unit 4 - Part 1
152 pages
1 Best Naive Lec02-Assocrules Apriori
No ratings yet
1 Best Naive Lec02-Assocrules Apriori
58 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Module1 Part2
No ratings yet
Module1 Part2
17 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
Mining Frequent, Patterns, Associations, and Correlations
No ratings yet
Mining Frequent, Patterns, Associations, and Correlations
13 pages
38 GM - ASAP-Association Rule Mining
No ratings yet
38 GM - ASAP-Association Rule Mining
64 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
Incremental Rules: Goals For Market-Basket Mining
No ratings yet
Incremental Rules: Goals For Market-Basket Mining
5 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
12 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Session 8-Association Rules Mining
No ratings yet
Session 8-Association Rules Mining
75 pages
Chap5 Frequent Itemset
No ratings yet
Chap5 Frequent Itemset
70 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
No ratings yet
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
86 pages
Unit2 Apriori FP Growth
No ratings yet
Unit2 Apriori FP Growth
27 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
KDD 3 AssociationRules
No ratings yet
KDD 3 AssociationRules
55 pages
Association Rules Max-Pattern Closed-Pattern Sequential Pattern
No ratings yet
Association Rules Max-Pattern Closed-Pattern Sequential Pattern
8 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
DA Unit 4
100% (1)
DA Unit 4
125 pages
Association Rules
No ratings yet
Association Rules
39 pages
Data Mining of Very Large Data
No ratings yet
Data Mining of Very Large Data
50 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
06apriori Edited v3
No ratings yet
06apriori Edited v3
29 pages
Data Warehouse and Data Mining - Unit 5
No ratings yet
Data Warehouse and Data Mining - Unit 5
30 pages
Lecture - 2
No ratings yet
Lecture - 2
48 pages
Draw A Deployment Diagram For The Following Scenar
No ratings yet
Draw A Deployment Diagram For The Following Scenar
1 page
Lecture - 7
No ratings yet
Lecture - 7
50 pages
Project1 - Qizheng Liu 2098839
No ratings yet
Project1 - Qizheng Liu 2098839
8 pages
Lecture - 1
No ratings yet
Lecture - 1
38 pages
Project 1
No ratings yet
Project 1
6 pages
Lecture 9
No ratings yet
Lecture 9
54 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
Lecture - 10
No ratings yet
Lecture - 10
91 pages
BD - Lecture07 - RecSys1
No ratings yet
BD - Lecture07 - RecSys1
45 pages
Big Data - Lecture 06 - SVD
No ratings yet
Big Data - Lecture 06 - SVD
56 pages
演示文稿1
No ratings yet
演示文稿1
2 pages
Csci 7645 Practice Problem Set 3b
No ratings yet
Csci 7645 Practice Problem Set 3b
1 page
Csci 7645 Practice Problem Set 5a
No ratings yet
Csci 7645 Practice Problem Set 5a
1 page
Csci 7645 Practice Problem Set 4b
No ratings yet
Csci 7645 Practice Problem Set 4b
1 page
Agent Intelligence Through Data Mining Multiagent Systems Artificial Societies and Simulated Organizations 14 1st edition by Andreas Symeonidis, Pericles Mitkas ISBN 0387243526 Â 978-0387243528 instant download
100% (2)
Agent Intelligence Through Data Mining Multiagent Systems Artificial Societies and Simulated Organizations 14 1st edition by Andreas Symeonidis, Pericles Mitkas ISBN 0387243526 Â 978-0387243528 instant download
56 pages
ML 1
No ratings yet
ML 1
35 pages
Assignment-1 Group 13 Page
No ratings yet
Assignment-1 Group 13 Page
10 pages
Review Csmodel
No ratings yet
Review Csmodel
17 pages
6th - SEM Machine Learning Notes PDF
100% (1)
6th - SEM Machine Learning Notes PDF
36 pages
Application and Comparison of Classification Techniques in Controlling Credit Risk
0% (1)
Application and Comparison of Classification Techniques in Controlling Credit Risk
16 pages
174819-Market Basket Analysis
No ratings yet
174819-Market Basket Analysis
54 pages
PHD Thesis Knowledge Sharing
100% (2)
PHD Thesis Knowledge Sharing
7 pages
BSC - Computer Science Cs - Semester 6 - 2022 - April - Data Analytics 2019 Pattern
No ratings yet
BSC - Computer Science Cs - Semester 6 - 2022 - April - Data Analytics 2019 Pattern
2 pages
273 - Pattern Recognition PDF
No ratings yet
273 - Pattern Recognition PDF
139 pages
Unit 3.5 & 5 ML
No ratings yet
Unit 3.5 & 5 ML
16 pages
Data Mining Notes
No ratings yet
Data Mining Notes
9 pages
1.1 Overview: Data Mining Based Risk Estimation of Road Accidents
No ratings yet
1.1 Overview: Data Mining Based Risk Estimation of Road Accidents
61 pages
Unit 2 Question and Answers Bdhdns
No ratings yet
Unit 2 Question and Answers Bdhdns
15 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Machine MCQ
No ratings yet
Machine MCQ
32 pages
Kantar Consultant Interview Questions 1
No ratings yet
Kantar Consultant Interview Questions 1
11 pages
Efficient Algorithm For Mining Frequent Patterns Java Project
No ratings yet
Efficient Algorithm For Mining Frequent Patterns Java Project
38 pages
Survey of Heart Disease Prediction Based On Data Mining Algorithms Ijariie1844
No ratings yet
Survey of Heart Disease Prediction Based On Data Mining Algorithms Ijariie1844
5 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Iv-I Cse Question Bank - R15
No ratings yet
Iv-I Cse Question Bank - R15
53 pages
Dynamic Itemset Counting and Implication Rules For Market Basket Data
No ratings yet
Dynamic Itemset Counting and Implication Rules For Market Basket Data
10 pages
DW Lab Manual
No ratings yet
DW Lab Manual
39 pages
Week 6 - Basic Association Analysis
No ratings yet
Week 6 - Basic Association Analysis
71 pages
Thesis Topics Mining Engineering
100% (2)
Thesis Topics Mining Engineering
8 pages
Fpgrowth
No ratings yet
Fpgrowth
11 pages
DSBA Extended MRA Project Problem Statement
No ratings yet
DSBA Extended MRA Project Problem Statement
2 pages
CH08 DSS Turban Data Warehouse
No ratings yet
CH08 DSS Turban Data Warehouse
65 pages
Discrete Math Lab Manual
No ratings yet
Discrete Math Lab Manual
158 pages
Ghadekar 2019
No ratings yet
Ghadekar 2019
5 pages

Big Data - Week04 - Association Rules

Uploaded by

Big Data - Week04 - Association Rules

Uploaded by

Frequent Itemset

Mining & Association

Slides from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

◼ Baskets = patients; Items = drugs & side-effects

◼Frequent itemsets: {m}, {c}, {b}, {j},

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

◼ Association rule: {m, b} →c

◼ Support threshold s = 3, confidence c = 0.75

many baskets and many items Etc

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

Triangular Matrix Triples

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

◼ Approach 2 uses 12 memory. bytes per pair

◼ Items that appear >= s times are the frequent items

Item counts Frequent items

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

C1 Filter L1 Construct C2 Filter L2 Construct C3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

◼4-byte integer counts are replaced by bits,

◼ Both conditions are necessary for the

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

◼But you don’t catch sets frequent in the whole but

◼An itemset becomes a candidate if it is found to

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

You might also like