Unit 2 Material
Unit 2 Material
Association Mining
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by
a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get automotive
services done
• Applications
– * Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
– Home Electronics * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”
• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}
– confidence, c, conditional probability that a transaction having {X Y} also contains Z
– A C (50%, 66.6%)
– C A (50%, 100%)
Transaction ID Items
Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.
Apriori property
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100
Food
Milk Bread
Wheat
Skim 2% White
Fraser
Sunset
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111,122,211, 221, 413}
– If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
– If adopting reduced min_support at lower levels then examine only those
descendents whose ancestor’s support is frequent/non-negligible.
Correlation in detail.
2
(observed _ exp ected )
2
exp ected
Numeric correlation
• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { , , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , , }
– e.g. {snacks, sodas } S.Type
– Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg}, and
{ , , , , , }.
• e.g. count(S1.Type) 1 , avg(S2.Price) 100
Constrained Association Query Optimization Problem
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate
p, where is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can
be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies
that each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies
that each pattern of which S is a suffix w.r.t. R also satisfies C
Property of Constraints: Anti-Monotone
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1
, i.e., it contains a subset belongs to A1 ,
• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct
• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint
alone is not affected by the iterative support counting.