0% found this document useful (0 votes)
9 views45 pages

AR Mining Rev

The document discusses Association Rule Mining, particularly its application in market-basket analysis and various fields such as retail and telecommunications. It outlines the concepts of support and confidence in association rules, the Apriori algorithm for finding frequent itemsets, and variations like sampling and partitioning to improve efficiency. Additionally, it covers the generation of strong association rules from frequent itemsets and the importance of minimum support and confidence thresholds.

Uploaded by

sstitiksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views45 pages

AR Mining Rev

The document discusses Association Rule Mining, particularly its application in market-basket analysis and various fields such as retail and telecommunications. It outlines the concepts of support and confidence in association rules, the Apriori algorithm for finding frequent itemsets, and variations like sampling and partitioning to improve efficiency. Additionally, it covers the generation of strong association rules from frequent itemsets and the importance of minimum support and confidence thresholds.

Uploaded by

sstitiksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Association Rule Mining

• Market-Basket Analysis
• Grocery Store: Large no. of ITEMS
• Customers fill their market baskets with subset of items
• 98% of people who purchase diapers also buy beer
• Used for shelf management
• Used for deciding whether an item should be put on sale
• Other interesting applications
• Basket=documents, Items=words
Words appearing frequently together in documents may
represent phrases or linked concepts. Can be used for
intelligence gathering.
Association Rules
• Purchasing of one product when another product is
purchased represents an AR
• Used mainly in retail stores to
• Assist in marketing
• Shelf management
• Inventory control
• Faults in Telecommunication Networks, traffic analysis,
document analysis, bioinformatics, computational
chemistry,
• Transaction Database
• Item-sets, Frequent or large item-sets
Types of Association Rules
• Boolean/Quantitative ARs
Based on type of values handled
Bread  Butter (Presence or absence)
age(X, “30….39”) & income(X, “42K…48K”)  buys(X, Projection TV)
• Single/Multi-Dimensional ARs
Based on dimensions of data involved
buys(X,Bread)  buys(X,Butter)
Single/Multi-Level ARs
Based on levels of Abstractions involved
age(X, “30….39”)  buys(X, laptop)
age(X, “30….39”)  buys(X, computer)
Support & Confidence
• A rule must have some minimum user-specified
confidence
1 & 2 => 3 has 90% confidence if when a customer
bought 1 and 2, in 90% of cases, the customer also
bought 3.
• A rule must have some minimum user-specified
support
1 & 2 => 3 should hold in some minimum percentage
of transactions to have business value
• AR X => Y holds with support T, if T% of transactions
in DB that support X also support Y
Support & Confidence
Customer
buys both Customer
buys diaper

Customer
buys beer
Support & Confidence
I=Set of all items
D=Transaction Database
AR A=>B has support s if s is the %age of transactions
in D that contain AUB (both A & B)
s(A=>B )=P(AUB)
AR A=>B has confidence c in D if c is the %age of
transactions in D containing A that also contain B
c(A=>B)=P (B/A)=P(AUB)/P(A)
Example
•Transaction Database

•For minimum support = 50%, minimum


confidence = 50%, we have the following rules
1 => 3 with 50% support and 66% confidence
3 => 1 with 50% support and 100% confidence
Mining Associations Rules
2 Step Process

• Find all frequent Itemsets


i.e. all itemsets satisfying min_sup

• Generate strong ARs from frequent


itemsets
i.e. ARs satisfying min_sup & min_conf
Frequent Itemsets (FIs)

Algorithms for finding FIs

1. Apriori
2. Sampling
3. Partitioning
4. Hash based Technique
5. Transaction Reduction
6. etc
Apriori Algorithm (Boolean ARs)
Candidate Generation
Level-wise search
Frequent 1-itemset (L1) is found
Frequent 2-itemset (L2) is found & so on…
Until no more Frequent k-itemsets (Lk) can be
found
Finding each Lk requires one pass
Apriori Algorithm
•Apriority Property
All nonempty subsets of a FI must also be frequent”
i.e., if {AB} is a frequent itemset, both {A} and {B} should
be a frequent itemset

•Anti-Monotone Property
“If a set cannot pass a test, all its supersets will fail the test
as well”
P(I) < min_sup  P(I U A) < min_sup, where A is any item
Property is monotonic in the context of failing a test
Frequent itemset /Apriori Property:
example
If {a, c, d} is a large itemset then {a, c}, {a, d}, {c,
d}, {a}, {c},{d}, {} are large itemsets too.
{}

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd
Apriori Algorithm -Example
Database D L1
C1
Scan D

C2 C2
L2 Scan D

C3 Scan D L3
Apriori Algorithm
2-Step Process

Join Step (candidate generation)


Guarantees that no candidate of length > k are generated
using Lk-

Prune Step
Prunes those candidate itemsets all of whose subsets are
not frequent
Candidate Generation
Given Lk-1
Ck = φ
For all itemsets l1 ∈ Lk-1 do
For all itemsets l2 ∈ Lk-1 do
If l1[1] = l2[1] ∧ l1[2] = l2[2] ∧…∧ l1[k-2] =
l2[k-2] ∧ l1[k-1] < l2[k-1]
Then c= l1[1], l1[2], l1[3]…. l1[k-1], l2[k-1]
Ck = Ck U {c}
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace

• Pruning:
• acde is removed because ade is not in L3

• C ={abcd}
ARs from FIs
• For each FI l, generate all non-empty
subsets of l

• For each non-empty subset s of l, output


the rule s ⇒ (l-s) if support_count(l) min_conf

support_count(s)
Example
• Suppose l = {2,3,5}
• {2,3}, {2.5}, {3,5}, {2}, {3}, & {5}
• Association Rules are
2,3 ⇒ 5 confidence 100%
2,5 ⇒ 3 confidence 66%
3,5 ⇒ 2 confidence 100%
2 ⇒ 3,5 confidence 100%
3 ⇒ 2,5 confidence 66%
5 ⇒ 2,3 confidence 100%
Apriori: Some Observations
• C2= L1*L1
• No. of Candidates in C2 = L1C2
• The larger the C2 / Ck the more processing cost
required to discover FIs
Variations of the Apriori
Many variations of the Apriori has been proposed that focus on
improving the efficiency of the original algorithm
• Hash-based technique- hashing itemset counts
• Transaction reduction-reducing the number of transactions
scanned in future iterations
• Partitioning-partitioning the data to find candidate itemsets
• Sampling-mining on a subset of the given data

• Dynamic itemset counting-adding candidate itemsets


at different points during a scan
Sampling Algorithm
• Random transactions of the original database are
selected (sampled) and placed in a much smaller
sampled database.
• The size of sampled database is small enough so
that it can reside in main memory.
• This reduces the number of (original) database
scans to at most two.
• Any standard algorithm, such as Apriori, can be used
to create a set of large itemsets in sampled
database.
Sampling Algorithm cont…
• Since these large itemsets is applied to sampled database,
some may not be the actual large itemsets of the original
database. These itemsets are called potentially large
itemsets, and PL denotes the set of potentially large itemsets.
• Some actual large itemsets may not be in PL. Additional
candidates for large itemsets are determined by applying
negative border function, NB(), against PL.
• Negative border returns the itemsets that are not in PL but
has all of their subsets in PL.
• Usually, the minimum support threshold is lowered when
finding the PL from sampled database.
Sampling Algorithm :
Algorithm
1. Sample transactions from Database D.
2. Using Apriori (or something else) algorithm to find PL from
sampled database.
3. The candidate set C1 contain itemsets from PL ∪NB(PL).
4. Scan the original database, check the support of each
candidate in C1. Those that meet the minimum support
requirement will be added into L.
5. If some itemsets from NB(PL) were added into L in step 4.
Initially candidate set C2 is equal to L.
Repeatedly add NB(C2) into C2 until no growth in C2.
Scan the original database, check the support for each
candidate in C2 . Adding large itemsets into L.
Sampling Algorithm :Example
• Let I = {a, b, c, d},
• After step 2, let PL = {{a}, {c}, {d}, {c, d}}.
• After step 3, C1 = {{a}, {b}, {c}, {d}, {a, c}, {a, d}, {c, d},
{}

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd C1
Sampling Algorithm :Example
Assume that L = {{a},{c}, {d}, {a, c}, {a, d}, {c, d}} after the
database scan in step 4. Since {a, c} and {a, d} are in
NB(PL), we need to execute step 5. C2 will be L U {{a, c, d}}.
{}

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd C2
Partitioning
• Instead of sampling transactions in database, the
database D is subdivided into n partitions D1, D2, …,
Dn .
• Partitioning may improve the performance by:
– A large itemset must be large in at least one of the
partitions.
– We can adjust the size of each partition so that it is
small enough to fit in main memory.
Partitioning
Algorithm
1. Split database D into n partitions
2. Using apriori algorithm to find set of large
itemset of each partition, Let Li denote set of
large itemsets of partition i.
3. Candidate set C = Un Li
4. Scan the original database, check the minimum
support of each candidate c in C. If the criteria is
met, add c into L.
Partitioning: Example
A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0

0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
σ = 20%
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 1 0 1 0 0
0 0 1 0 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 0 1 0 1 1 0 0
1 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 1
Partitioning: Example
Apriori:
{2} 6 {2,3} 3 {3,5,7} 3
L1 = L2 = L3 =
{3} 6 {2,4} 3

{4} 4 {3,5} 3

{5} 8 {3,7} 3

{6} 5 {5,6} 3

{7} 7 {5,7} 5

{8} 4 {6,7} 3

The Frequent set L=L1∪L2 ∪


L3
Partitioning: Example
Dividing database in 3 equal partitions. Local
support=20%=σ1=σ2=σ3=σ

{{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {1,5}, {1,6}, {1,8},
L 1= {2,3}, {2,4}, {2,8}, {4,5}, {4,7}, {4,8}, {5,6}, {5,8}, {5,7},
{6,7}, {6,8}, {1,6,8}, {1,5,6}, {1,5,8}, {2,4,8}, {4,5,7},
{5,6,8}, {5,6,7}, {1,5,6,8}
L 2={……} L 3={…….}
The candidate set C =L 1∪ L 2 ∪
L3
Read database once to compute the global support of
the sets in C and get the final set of frequent itemsets L
Hash-Based Algorithm
• The larger the Ck the more processing
cost required to discover FIs
• Reduces the size of Ck for k>1
• DHP or PCY has 2 major features
• Efficient generation for FIs (2-itemsets)
• Reduction of Tr. DB size (right after the
generation of large 2-itemsets)
Hash-Based Algorithm
• Efficient counting
• For each Tr. After 1-itemsets are counted,
2-itemsets of the Tr. are generated and
hashed into a hash table H2
• Subset function: finds all the candidates
contained in a transaction
• When a 2-itemset is hashed to a bucket, the
count of the bucket is incremented
Hash-Based Algorithm:
Example
C1 L1

L1*L1=({1,2},{1,3},{1,5},{2,3} {2,5},{3,5})
Hash-Based Algorithm:
Example (generating C2)
C2 100 (1,3) (1,4), (3,4)
200 (2,3) (2,5), (3,5) H(x,y)= {(order of x)*10+
{1,3} (order of y)} mod 7
300 (1,2) (1,3), (1,5), (2,3),
{2,3} (2,5), (3,5)
400 (2,5)
{2,5}
3,5 2,5 1,3
{3,5} 3,5 2,3 2,5 3,4 Hash Table H2
1,4 1,5 2,3 2,5 1,2 1,3
3 1 2 0 3 1 3 count
0 1 2 3 4 5 6 Bucket no
1 0 1 0 1 0 1 Bit Vector
L1*L1=({1,2},{1,3},{1,5},{2,3} {2,5},{3,5})
1 3 1 2 3 3 No. in the bucket with itemset
Multiple-Level Association Rules
•Items often form hierarchy.
•Items at the lower level are
expected to have lower
Food
support.
•Rules regarding itemsets at milk bread
appropriate levels could be
quite useful. skim 2% wheat white

Fraser Sunset
milk ⇒ bread [20%,
60%]
2% milk ⇒ wheat bread [6%,
50%].
Multiple-Level Association Rules
mining multilevel association rules.
2% milk ⇒ wheat bread
2% milk ⇒ bread
Multi-level Association: Uniform
Support vs. Reduced Support
• Uniform Support: the same minimum support for all
levels
• + One minimum support threshold. No need to examine
itemsets containing any item whose ancestors do not have
minimum support
• – Lower level items do not occur as frequently. If support
threshold
• too high ⇒ miss low level associations
• too low ⇒ generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
Uniform Support
Level 1
min_sup = 5% Milk
[support = 10%]

Level 2
min_sup = 5% 2% Milk Skim Milk
[support = 6%] [support = 4%]
Reduced Support
Level 1
min_sup = 5% Milk
[support = 10%]

Level 2
min_sup = 3% 2% Milk Skim Milk
[support = 6%] [support = 4%]
Multi-level Association:
Redundancy Filtering
• Some rules may be redundant due to “ancestor”
relationships between items.
• Example
• milk ⇒ wheat bread [support = 8%,
confidence = 70%]
• 2% milk ⇒ wheat bread [support = 2%,
confidence = 72%]
• We say the first rule is an ancestor of the second
rule.
• A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
Multi-Dimensional Association:
Concepts
• Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
• Multi-dimensional rules:  2 dimensions or predicates
• Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
• hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
• Categorical Attributes
• finite number of possible values, no ordering among values
• Quantitative Attributes
• numeric, implicit ordering among values
Techniques for Mining MD
Associations
• Search for frequent k-predicate set:
• Example: {age, occupation, buys} is a 3-predicate set.
• Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes
• Quantitative attributes are statically discretized by using
predefined concept hierarchies.
2. Quantitative association rules
• Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data.
3. Distance-based association rules
• This is a dynamic discretization process that considers the
distance between data points.
Static Discretization of
Quantitative Attributes
• Discretized prior to mining using concept hierarchy.
• Numeric values are replaced by ranges.
• In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans. ()

• Data cube is well suited for mining.


(age) (income) (buys)
• The cells of an n-dimensional
cuboid correspond to the predicate sets.
(age, income) (age,buys) (income,buys)
• Mining from data cubes
can be much faster. (age,income,buys)
Quantitative Association
Rules
• Numeric attributes are dynamically discretized
• Such that the confidence or compactness of the rules mined
is maximized.
• 2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
• Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid.
• Example:
age(X,”34-35”) ∧ income(X,”24K -
48K”)
⇒ buys(X,”high resolution TV”)
Mining Distance-based
Association Rules
• Binning methods do not capture the semantics of
interval data

• Distance-based partitioning, more meaningful


discretization considering:
• density/number of points in an interval
• “closeness” of points in an interval

You might also like