0% found this document useful (0 votes)
33 views43 pages

Lecture 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views43 pages

Lecture 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Association Rule Mining

* Data Mining:Association Rules 1


Overview

■ Data Mining
■ Association rule mining
■ Apriori method
■ Some other methods – a brief.
■ Later, we see them.

* Data Mining:Association Rules 2


Association Mining?
• Association rule mining:
– Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
• Applications:
– Basket data analysis, cross-marketing, catalog design,
clustering, classification, etc.
• Examples.
– Rule form: “Body → Ηead [support, confidence]”.
– buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) → grade(x, “A”) [1%, 75%]

* Data Mining:Association 3
Rules
Association Rules: Basic Concepts
• Given: (1) database of transactions, (2) each transaction
is a list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of
items with that of another set of items
– E.g., 98% of people who buy a laptop also buy antivirus
software.

* Data Mining:Association 4
Rules
Association Rule

■ An example: There is a super-market, and


people are buying items from it.
■ The goods bought by each person are stored

in a database.
■ Let the items are {A, B, C, … }.

* Data Mining:Association Rules 5


Association Rule

■ A rule like, if a person buys a set of items


{A,C,E} then mostly he/she will buy
another set of items {D,F}.
■ {A,C,E} -> {D,F} is the association rule.

■ Eg: People who buy potato chips are also


buying cool-drinks .
■ Potato chips -> cool-drinks

* Data Mining:Association Rules 6


Association Rule

■ But, how good (sound) are these rules?


■ That is, how much we can trust these

rules.

■ Are these rules useful?


■ How frequently is this rule applicable.

* Data Mining:Association Rules 7


Association Rule

■ {D} -> {A} is an association rule.


■ According to the given database, this rule is true.
[confidence is high]
■ But, only one person bought both D and A.
[support is low]

* Data Mining:Association Rules 8


Association Rule

■ {A} -> {C} is an association rule.


■ According to the given database, this rule is true
only partly. [confidence is not 100%]
■ But, 2 out of 4 bought both A and C.
[support is moderate]

* Data Mining:Association Rules 9


Notation and Definitions

■ Let I be the set of all items.


■ X, Y, … be the subsets of I
■ We call X, Y, … as itemsets.

■ If X has k items, then X is called as a


k-itemset
■ If I is of size n . That is, in total there are
n items.
■ Then, the total number of itemsets is
2n – 1.
■ Association rule is of the form X -> Y

* Data Mining:Association Rules 10


Notation and Definitions

* Data Mining:Association Rules 11


The Example

For rule A ⇾ C :
support = 0.5 (or 50%)
confidence = 0.666 (or 66.6%)

* Data Mining:Association Rules 12


Notation

■ Normally support is defined for an


itemset.
■ Support (X) = percent of transactions

having X.

■ Confidence is defined for a rule.


■ Confidence (X ⇾ Y) =

Support (X and Y) / Support (X)

* Data Mining:Association Rules 13


An Exercise Problem

Transaction Id Items bought


100 A,B,C
101 B,C
102 A,C
103 A,B,D
104 A,B,C
105 A,C,E
106 B,D
107 A,B,C

Find out support and confidence of A ⇾ B


Find out support and confidence of B ⇾ A

* Data Mining:Association Rules 14


Functional Dependency in DBMS

* Data Mining:Association Rules 15


Functional Dependency in DBMS

■ FDs in relational DBs are association rules with


100% confidence.
■ Support is irrelevant.

* Data Mining:Association Rules 16


The Problem of finding Association Rules

■ Given a transactional database, find out all


association rules satisfying the given
minimum support and confidence.

* Data Mining:Association Rules 17


The Problem of finding Association Rules

■ This problem boils down into two sub-problems


■ Find out all itemsets for which the support is

more than the minimum value.


■ This is called frequent itemset mining.

■ Find out the association rules using frequent

itemsets.

* Data Mining:Association Rules 18


The Problem

■ Frequent itemset mining is the more difficult


problem.
■ Find out all itemsets for which the support is

more than a given value.

■ How much difficult is this problem?

* Data Mining:Association Rules 19


Association rules from frequent
itemsets can be easily found !!

But how is that we know that we need capture


support of A ?

* Data Mining:Association Rules 20


Association rules from frequent
itemsets can be easily found !!

* Data Mining:Association Rules 21


A simple algorithm

* Data Mining:Association Rules 22


A Naive Algorithm

1) For each itemset create a counter.


2) Initialize all counters to zero.
3) For each transaction in the database,
1) Find out all subsets of the transaction and increment
their respective counters.
4) Select those itemsets for which the counter
value is more than the given threshold value.

* Data Mining:Association Rules 23


Analysis of the Algorithm

■ If there are n items. Then the total number of


counters is 2n – 1 .
■ If n is a small number (perhaps < 20) then this is a
feasible solution.
■ But when n is large (like 1000) then it is not feasible
to create 21000 – 1 counters.
■ As an exercise, try to figure out how much big

this number is.

* Data Mining:Association Rules 24


Analysis

■ Time complexity is O(m ) [Good]


■ #database scans = one only. [Good]
■ Space complexity is O(2 n) [Very Bad]

■ In data mining #database scans is one


important measure of scalability.

* Data Mining:Association Rules 25


Other Naïve Method

■ The other way is to use only one counter


and find the support for each itemset
separately.
■ For this one has to scan the database
2 n – 1 times.
■ Space complexity is reduced, but time
complexity is increased.

* Data Mining:Association Rules 26


Apriori Algorithm

■ One of the initial algorithms to solve this


problem in a better way.
■ It uses an important property regarding
the itemsets
■ A subset of a frequent itemset must
also be a frequent itemset
■ i.e., if {A,B} is a frequent itemset, both {A}
and {B} should also be frequent.
■ If either {A} or {B} is not frequent, then
{A,B} is also non-frequent.

* Data Mining:Association Rules 27


Apriori Algorithm

■ Some of the itemsets, we can discard at


early stages.

■ For example, if X is a non-frequent


itemset, then there is no need to worry
about all supersets of X.

■ But, if X is frequent, then may be a


superset of X is also frequent.

* Data Mining:Association Rules 28


Apriori Algorithm

■ This is a bottom-up method.

■ First find frequent 1-itemsets, then find


frequent 2-itemsets, …

■ If we already found frequent k-itemsets.


■ We call this L
K

* Data Mining:Association Rules 29


Apriori Algorithm Continued …

■ We generate candidates which can be


frequent K+1 itemsets.
■ We call these candidates as C
K+1

■ We find count of these candidates and


find L K+1

* Data Mining:Association Rules 30


How candidates are generated

■ If {A,B,C} and {A,B,D} are two itemsets


in L 3 then a candidate itemset in C 4 is
{A,B,C,D} provided all its subsets of size
3 are in L 3

■ If, for example, {B,C,D} is not in L 3 then


{A,B,C,D} can not be frequent and is
removed from C 4 [This is
called the pruning step ]

* Data Mining:Association Rules 31


The Apriori Algorithm
C k: Candidate itemset of size k
L k : frequent itemset of size k

Find L 1 ;
for (k = 1; L k !=∅; k ++) do begin
C k+1 = candidates generated from L k
for each transaction t in database do
(i) increment the count of all
(ii) candidates in C k+1 that are contained
in t
(iii) L k+1 = candidates in C k+1 with
min_support
end
return ∪ k L k

* Data Mining:Association Rules 32


The Apriori Algorithm — Example
Database D
L1
C1
Scan D

C2 C2
L2 Scan D

C3 Scan D L3

* Data Mining:Association Rules 33


Analysis of Apriori Algorithm

■ If the largest itemset size is k then we


need to scan the database atleast k times.

■ The space required depends on the


number of candidates generated.

■ But, certainly this is better than the naïve


methods.

* Data Mining:Association Rules 34


Exercise Problem

Transaction Id Items bought


100 A,B,C,D,E
101 A,B,C,D,F

102 B,C,F
103 A,C,F,G

Let the minimum support required is 50%, find out all frequent
itemsets using the Apriori algorithm.
At each stage show the candidates generated and describe how the
Apriori property is used to prune the candidates set.

* Data Mining:Association Rules 35


Transaction Id Items bought

100 A,B,C,D,E

101 A,B,C,D,F

102 B,C,F

103 A,C,F,G

* Data Mining:Association Rules 36


Methods to Improve Apriori’s Efficiency

■ Hash-based itemset counting: A k -itemset whose


corresponding hashing bucket count is below the
threshold cannot be frequent
■ Transaction reduction: A transaction that does not
contain any frequent k-itemset is useless in
subsequent scans
■ Partitioning: Any itemset that is potentially
frequent in DB must be frequent in at least one of
the partitions of DB
* Data Mining:Association Rules 37
Methods to Improve Apriori’s Efficiency

■ Sampling: mining on a subset of given


data, lower support threshold + a method
to determine the completeness
■ Dynamic itemset counting: add new
candidate itemsets only when all of their
subsets are estimated to be frequent

* Data Mining:Association Rules 38


Mining Frequent Patterns
Without Candidate Generation

■ Compress a large database into a compact,


Frequent-Pattern tree (FP-tree) structure

■ highly condensed, but complete for


frequent pattern mining

■ avoid costly database scans

* Data Mining:Association Rules 39


FP-tree based mining

■ Develops an efficient, FP-tree-based


frequent pattern mining method
■ A divide-and-conquer methodology:
decompose mining tasks into smaller
ones
■ Avoid candidate generation:
sub-database test only!

* Data Mining:Association Rules 40


Partition based methods

■ Partition the database and then


apply divide-and-conquer strategies.

* Data Mining:Association Rules 41


Summary

■ Association rule mining


■ probably the most significant contribution from
the database community in KDD
■ A large number of papers have been published
■ Many interesting issues have been explored
■ An interesting research direction
■ Association analysis in other types of data:
spatial data, multimedia data, time series data,
etc.

* Data Mining:Association Rules 42


Thank you !!!
* Data Mining:Association Rules 43

You might also like