AR Mining Rev
AR Mining Rev
• Market-Basket Analysis
• Grocery Store: Large no. of ITEMS
• Customers fill their market baskets with subset of items
• 98% of people who purchase diapers also buy beer
• Used for shelf management
• Used for deciding whether an item should be put on sale
• Other interesting applications
• Basket=documents, Items=words
Words appearing frequently together in documents may
represent phrases or linked concepts. Can be used for
intelligence gathering.
Association Rules
• Purchasing of one product when another product is
purchased represents an AR
• Used mainly in retail stores to
• Assist in marketing
• Shelf management
• Inventory control
• Faults in Telecommunication Networks, traffic analysis,
document analysis, bioinformatics, computational
chemistry,
• Transaction Database
• Item-sets, Frequent or large item-sets
Types of Association Rules
• Boolean/Quantitative ARs
Based on type of values handled
Bread Butter (Presence or absence)
age(X, “30….39”) & income(X, “42K…48K”) buys(X, Projection TV)
• Single/Multi-Dimensional ARs
Based on dimensions of data involved
buys(X,Bread) buys(X,Butter)
Single/Multi-Level ARs
Based on levels of Abstractions involved
age(X, “30….39”) buys(X, laptop)
age(X, “30….39”) buys(X, computer)
Support & Confidence
• A rule must have some minimum user-specified
confidence
1 & 2 => 3 has 90% confidence if when a customer
bought 1 and 2, in 90% of cases, the customer also
bought 3.
• A rule must have some minimum user-specified
support
1 & 2 => 3 should hold in some minimum percentage
of transactions to have business value
• AR X => Y holds with support T, if T% of transactions
in DB that support X also support Y
Support & Confidence
Customer
buys both Customer
buys diaper
Customer
buys beer
Support & Confidence
I=Set of all items
D=Transaction Database
AR A=>B has support s if s is the %age of transactions
in D that contain AUB (both A & B)
s(A=>B )=P(AUB)
AR A=>B has confidence c in D if c is the %age of
transactions in D containing A that also contain B
c(A=>B)=P (B/A)=P(AUB)/P(A)
Example
•Transaction Database
1. Apriori
2. Sampling
3. Partitioning
4. Hash based Technique
5. Transaction Reduction
6. etc
Apriori Algorithm (Boolean ARs)
Candidate Generation
Level-wise search
Frequent 1-itemset (L1) is found
Frequent 2-itemset (L2) is found & so on…
Until no more Frequent k-itemsets (Lk) can be
found
Finding each Lk requires one pass
Apriori Algorithm
•Apriority Property
All nonempty subsets of a FI must also be frequent”
i.e., if {AB} is a frequent itemset, both {A} and {B} should
be a frequent itemset
•Anti-Monotone Property
“If a set cannot pass a test, all its supersets will fail the test
as well”
P(I) < min_sup P(I U A) < min_sup, where A is any item
Property is monotonic in the context of failing a test
Frequent itemset /Apriori Property:
example
If {a, c, d} is a large itemset then {a, c}, {a, d}, {c,
d}, {a}, {c},{d}, {} are large itemsets too.
{}
a b c d
ab ac ad bc bd cd
abcd
Apriori Algorithm -Example
Database D L1
C1
Scan D
C2 C2
L2 Scan D
C3 Scan D L3
Apriori Algorithm
2-Step Process
Prune Step
Prunes those candidate itemsets all of whose subsets are
not frequent
Candidate Generation
Given Lk-1
Ck = φ
For all itemsets l1 ∈ Lk-1 do
For all itemsets l2 ∈ Lk-1 do
If l1[1] = l2[1] ∧ l1[2] = l2[2] ∧…∧ l1[k-2] =
l2[k-2] ∧ l1[k-1] < l2[k-1]
Then c= l1[1], l1[2], l1[3]…. l1[k-1], l2[k-1]
Ck = Ck U {c}
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C ={abcd}
ARs from FIs
• For each FI l, generate all non-empty
subsets of l
a b c d
ab ac ad bc bd cd
abcd C1
Sampling Algorithm :Example
Assume that L = {{a},{c}, {d}, {a, c}, {a, d}, {c, d}} after the
database scan in step 4. Since {a, c} and {a, d} are in
NB(PL), we need to execute step 5. C2 will be L U {{a, c, d}}.
{}
a b c d
ab ac ad bc bd cd
abcd C2
Partitioning
• Instead of sampling transactions in database, the
database D is subdivided into n partitions D1, D2, …,
Dn .
• Partitioning may improve the performance by:
– A large itemset must be large in at least one of the
partitions.
– We can adjust the size of each partition so that it is
small enough to fit in main memory.
Partitioning
Algorithm
1. Split database D into n partitions
2. Using apriori algorithm to find set of large
itemset of each partition, Let Li denote set of
large itemsets of partition i.
3. Candidate set C = Un Li
4. Scan the original database, check the minimum
support of each candidate c in C. If the criteria is
met, add c into L.
Partitioning: Example
A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
σ = 20%
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 1 0 1 0 0
0 0 1 0 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 0 1 0 1 1 0 0
1 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 1
Partitioning: Example
Apriori:
{2} 6 {2,3} 3 {3,5,7} 3
L1 = L2 = L3 =
{3} 6 {2,4} 3
{4} 4 {3,5} 3
{5} 8 {3,7} 3
{6} 5 {5,6} 3
{7} 7 {5,7} 5
{8} 4 {6,7} 3
{{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {1,5}, {1,6}, {1,8},
L 1= {2,3}, {2,4}, {2,8}, {4,5}, {4,7}, {4,8}, {5,6}, {5,8}, {5,7},
{6,7}, {6,8}, {1,6,8}, {1,5,6}, {1,5,8}, {2,4,8}, {4,5,7},
{5,6,8}, {5,6,7}, {1,5,6,8}
L 2={……} L 3={…….}
The candidate set C =L 1∪ L 2 ∪
L3
Read database once to compute the global support of
the sets in C and get the final set of frequent itemsets L
Hash-Based Algorithm
• The larger the Ck the more processing
cost required to discover FIs
• Reduces the size of Ck for k>1
• DHP or PCY has 2 major features
• Efficient generation for FIs (2-itemsets)
• Reduction of Tr. DB size (right after the
generation of large 2-itemsets)
Hash-Based Algorithm
• Efficient counting
• For each Tr. After 1-itemsets are counted,
2-itemsets of the Tr. are generated and
hashed into a hash table H2
• Subset function: finds all the candidates
contained in a transaction
• When a 2-itemset is hashed to a bucket, the
count of the bucket is incremented
Hash-Based Algorithm:
Example
C1 L1
L1*L1=({1,2},{1,3},{1,5},{2,3} {2,5},{3,5})
Hash-Based Algorithm:
Example (generating C2)
C2 100 (1,3) (1,4), (3,4)
200 (2,3) (2,5), (3,5) H(x,y)= {(order of x)*10+
{1,3} (order of y)} mod 7
300 (1,2) (1,3), (1,5), (2,3),
{2,3} (2,5), (3,5)
400 (2,5)
{2,5}
3,5 2,5 1,3
{3,5} 3,5 2,3 2,5 3,4 Hash Table H2
1,4 1,5 2,3 2,5 1,2 1,3
3 1 2 0 3 1 3 count
0 1 2 3 4 5 6 Bucket no
1 0 1 0 1 0 1 Bit Vector
L1*L1=({1,2},{1,3},{1,5},{2,3} {2,5},{3,5})
1 3 1 2 3 3 No. in the bucket with itemset
Multiple-Level Association Rules
•Items often form hierarchy.
•Items at the lower level are
expected to have lower
Food
support.
•Rules regarding itemsets at milk bread
appropriate levels could be
quite useful. skim 2% wheat white
Fraser Sunset
milk ⇒ bread [20%,
60%]
2% milk ⇒ wheat bread [6%,
50%].
Multiple-Level Association Rules
mining multilevel association rules.
2% milk ⇒ wheat bread
2% milk ⇒ bread
Multi-level Association: Uniform
Support vs. Reduced Support
• Uniform Support: the same minimum support for all
levels
• + One minimum support threshold. No need to examine
itemsets containing any item whose ancestors do not have
minimum support
• – Lower level items do not occur as frequently. If support
threshold
• too high ⇒ miss low level associations
• too low ⇒ generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
Uniform Support
Level 1
min_sup = 5% Milk
[support = 10%]
Level 2
min_sup = 5% 2% Milk Skim Milk
[support = 6%] [support = 4%]
Reduced Support
Level 1
min_sup = 5% Milk
[support = 10%]
Level 2
min_sup = 3% 2% Milk Skim Milk
[support = 6%] [support = 4%]
Multi-level Association:
Redundancy Filtering
• Some rules may be redundant due to “ancestor”
relationships between items.
• Example
• milk ⇒ wheat bread [support = 8%,
confidence = 70%]
• 2% milk ⇒ wheat bread [support = 2%,
confidence = 72%]
• We say the first rule is an ancestor of the second
rule.
• A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
Multi-Dimensional Association:
Concepts
• Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
• Multi-dimensional rules: 2 dimensions or predicates
• Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
• hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
• Categorical Attributes
• finite number of possible values, no ordering among values
• Quantitative Attributes
• numeric, implicit ordering among values
Techniques for Mining MD
Associations
• Search for frequent k-predicate set:
• Example: {age, occupation, buys} is a 3-predicate set.
• Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes
• Quantitative attributes are statically discretized by using
predefined concept hierarchies.
2. Quantitative association rules
• Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data.
3. Distance-based association rules
• This is a dynamic discretization process that considers the
distance between data points.
Static Discretization of
Quantitative Attributes
• Discretized prior to mining using concept hierarchy.
• Numeric values are replaced by ranges.
• In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans. ()