Module1 Part2
Module1 Part2
2
The model: data
3
Transaction data: supermarket data
Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it may
have TID (transaction ID)
A transactional dataset: A set of transactions
4
The model: rules
A transaction t contains X, a set of items
(itemset) in I, if X t.
An association rule is an implication of the
form:
X Y, where X, Y I, and X Y =
5
Rule strength measures
Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X Y.
sup = Pr(X Y).
Confidence: The rule holds in T with
confidence conf if conf% of tranactions that
contain X also contain Y.
conf = Pr(Y | X)
An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.
6
Support and Confidence
Support count: The support count of an
itemset X, denoted by X.count, in a data set
T is the number of transactions in T that
contain X. Assume T has n transactions.
Then,
( X Y ).count
support
n
( X Y ).count
confidence
X .count
7
Goal and key features
Goal: Find all rules that satisfy the user-
specified minimum support (minsup) and
minimum confidence (minconf).
Key Features
Completeness: find all rules.
No target item(s) on the right-hand-side
Mining with data on hard disk (not in memory)
8
t1: Beef, Chicken, Milk
An example t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese,
Transaction data Milk
t6: Chicken, Clothes, Milk
Assume: t7: Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
Association rules from the itemset:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
9
Transaction data representation
A simplistic view of shopping baskets,
Some important information not considered.
E.g,
the quantity of each item purchased and
the price paid.
10
Many mining algorithms
There are a large number of them!!
They use different strategies and data structures.
Their resulting sets of rules are all the same.
Given a transaction data set T, and a minimum support and
a minimum confident, the set of association rules existing in
T is uniquely determined.
Any algorithm should find the same set of rules
although their computational efficiencies and
memory requirements may be different.
We study only one: the Apriori Algorithm
11
The Apriori algorithm
Probably the best known algorithm
Two steps:
Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
Use frequent itemsets to generate rules.
12
Step 1: Mining all frequent
itemsets
A frequent itemset is an itemset whose support
is ≥ minsup.
Key idea: The apriori property (downward
closure property): any subsets of a frequent
itemset are also frequent itemsets
ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
13
The Algorithm
Iterative algo. (also called level-wise search):
Find all 1-item frequent itemsets; then all 2-item
frequent itemsets, and so on.
In each iteration k, only consider itemsets that
15
Step 2: Generating rules from frequent
itemsets
Frequent itemsets association rules
One more step is needed to generate
association rules
For each frequent itemset X,
For each proper nonempty subset A of X,
Let B = X - A
A B is an association rule if
Confidence(A B) ≥ minconf,
17