Unit 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

102046706 – Data Mining &

Business Intelligence

Unit-5
Concept Description and
Association Rule Mining
Outline
 What is concept description?
 Market basket analysis
 Association Rule Mining
 Generating Rules
 Improved apriori algorithm
 Incremental ARM (Association Rule Mining)
 Associative Classification
 Rule Mining
Concept description
 Data mining can be classified into two categories: descriptive data
mining and predictive data mining.
 Descriptive data mining describes the data set in a concise and
summative manner and presents interesting general properties
of the data.
 Predictive data mining analyzes the data in order to construct one
or a set of models, and attempts to predict the behavior of new
data sets.
 Database is usually storing the large amounts of data in great
detail. However users often like to view sets of summarized data
in concise, descriptive terms.
Concept description (Cont..)
The simplest kind of descriptive data mining is called
concept description.
 A concept usually refers to a collection of data such as
frequent_buyers, graduate_students etc.
 As a data mining task, concept description is not a simple
enumeration (number of things done one by one) of the data.
 Concept description generates descriptions for characterization
and comparison of the data it is also called class description.
Concept description (Cont..)
 Characterization provides a concise and brief summarization of
the data.
 While concept or class comparison (also known as discrimination)
provides discriminations (inequity) comparing two or more
collections of data.
 Example
• Given the ABC Company database, for example, examining individual
customer transactions.
• Sales managers may prefer to view the data generalized to higher levels,
such as summarized by customer groups according to geographic regions,
frequency of purchases per group and customer income.
Market basket analysis
 Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data
sets.
 With massive amounts of data continuously being collected and
stored, many industries are becoming interested in mining such
patterns from their databases.
 The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many
business decision-making processes such as
✔ catalog design,
✔ cross-marketing, and
✔ customer shopping behavior analysis.
Market basket analysis (Cont..)
 An example of frequent itemset mining is market basket analysis.
 This process analyzes customer buying habits by finding
associations between the different items that customers place in
their “shopping baskets”.
 The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are
frequently purchased together by customers.
 For instance, if customers are buying milk, how likely are they to
also buy bread on the same trip.
Market basket analysis (Cont..)
Association rule mining
 Given a set of transactions, we need rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction.
 Market-Basket transactions
TID Items
1 Bread, Milk Example of Association Rules
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke {Chocolate} → {Pepsi},
4 Bread, Milk, Chocolate, Pepsi {Milk, Bread} → {Eggs, Coke},
5 Bread, Milk, Chocolate, Coke {Pepsi, Bread} → {Milk}
Association rule mining (Cont..)
TID Items
 Itemset
• A collection of one or more items
1 Bread, Milk
o E.g. : {Milk, Bread, Chocolate} 2 Bread, Chocolate, Pepsi, Eggs
• k-itemset 3 Milk, Chocolate, Pepsi, Coke
4 Bread, Milk, Chocolate, Pepsi
An itemset that contains k items
5 Bread, Milk, Chocolate, Coke
 Support count (σ)
• Frequency of occurrence of an itemset
o E.g. σ({Milk, Bread, Chocolate}) = 2

 Support
• Fraction of transactions that contain an itemset
o E.g. s({Milk, Bread, Chocolate}) = 2/5

 Frequent Itemset
• An itemset whose support is greater than or equal to a minimum support
threshold
Association rule mining (Cont..)
 Association Rule
• An implication expression of the form X → Y, where X and Y are itemsets
o E.g.: {Milk, Chocolate} → {Pepsi}

 Rule Evaluation
• Support (s)
o Fraction of transactions that contain both X and Y
• Confidence (c)
o Measures how often items in Y appear in transactions that contain X
Association rule mining (Cont..)
TID Items
1 Bread, Milk
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke
4 Bread, Milk, Chocolate, Pepsi
5 Bread, Milk, Chocolate, Coke

Example:
Find support & confidence for {Milk, Chocolate} ⇒ Pepsi

= c = 67
Association rule mining - Example
TID Items
Calculate Support & Confidence:
1 Bread, Milk {Milk, Chocolate} → {Pepsi}
2 Bread, Chocolate, Pepsi, Eggs {Milk, Pepsi} → {Chocolate}
3 Milk, Chocolate, Pepsi, Coke {Chocolate, Pepsi} → {Milk}
4 Bread, Milk, Chocolate, Pepsi {Pepsi} → {Milk, Chocolate}
{Chocolate} → {Milk, Pepsi}
5 Bread, Milk, Chocolate, Coke {Milk} → {Chocolate, Pepsi}

Answer
Support (s) : 0.4
{Milk, Chocolate} → {Pepsi} c = 0.67
{Milk, Pepsi} → {Chocolate} c = 1.0
{Chocolate, Pepsi} → {Milk} c = 0.67
{Pepsi} → {Milk, Chocolate} c = 0.67
{Chocolate} → {Milk, Pepsi} c = 0.5
{Milk} → {Chocolate, Pepsi} c = 0.5
Association rule mining (Cont..)
 A common strategy adopted by many association rule
mining algorithms is to decompose the problem into two
major subtasks:
1. Frequent Itemset Generation
• The objective is to find all the item-sets that satisfy the
minimum support threshold.
• These itemsets are called frequent itemsets.
2. Rule Generation
• The objective is to extract all the high-confidence rules
from the frequent itemsets found in the previous step.
• These rules are called strong rules.
Apriori algorithm
 Purpose: The Apriori Algorithm is an influential algorithm for
mining frequent itemsets for Boolean association rules.
 Key Concepts:
• Frequent Itemsets:
The sets of item which has minimum support (denoted by Li for ith-Itemset).
• Apriori Property:
Any subset of frequent itemset must be frequent.
• Join Operation:
To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 itself.
Apriori algorithm (Cont..)
 Find the frequent itemsets
• The sets of items that have minimum support and a subset of a frequent
itemset must also be a frequent itemset (Apriori Property).
• E.g. if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset.
• Use the frequent item sets to generate association rules.
 The Apriori Algorithm : Pseudo code
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1) itemset that is not frequent cannot be a subset of a
frequent k-itemset
Apriori algorithm steps (Cont..)
 Step 1:
• Start with itemsets containing just a single item (Individual items).
 Step 2:
• Determine the support for itemsets.
• Keep the itemsets that meet your minimum support threshold and
remove itemsets that do not support minimum support.
 Step 3:
• Using the itemsets you have kept from Step 1, generate all the possible
itemset combinations.
 Step 4:
• Repeat steps 1 & 2 until there are no more new itemsets.
Apriori algorithm - Pseudo code (Cont..)
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
L1= {frequent items};
for (k = 1; Lk != ∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Increment the count of all candidates in Ck+1
That are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Apriori algorithm - Example Minimum Support = 2

TID Items C1 ItemSet Min. Sup L1 ItemSet Min. Sup


100 134 {1} 2 {1} 2
200 235 Scan D
{2} 3 {2} 3
300 1235 {3} 3 {3} 3
400 25 {4} 1 {5} 3
{5} 3
C2 Itemset Min. Sup C2 Itemset
L2 ItemSet Min. Sup
{1 2} 1 {1 2}
{1 3} 2
{1 3} 2 Scan D {1 3}
{2 3} 2
{1 5} 1 {1 5}
{2 5} 3
{2 3} 2 {2 3}
{3 5} 2
{2 5} 3 {2 5}
{3 5} 2 {3 5}
Apriori algorithm - Example Minimum Support = 2

ItemSet Min. Sup ItemSet Min. Sup


L2 {1 3} 2 C3 {1 2 3} 1
L3 Items Sup
{2 3} 2 Scan D {1 2 5} 1
{2 3 5} 2
{2 5} 3 {1 3 5} 1
{3 5} 2 {2 3 5} 2

Rules generation
Association Rule Support Confidence Confidence (%)
2^35 2 2/2 = 1 100 %
3^52 2 2/2 = 1 100 %
2^53 2 2/3 = 0.66 66%
2=3^5 2 2/3 = 0.66 66%
3=2^5 2 2/3 = 0.66 66%
5=2^3 2 2/3 = 0.66 66%
Improve apriori’s efficiency
 Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain any frequent
k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB.
 Sampling: Mining on a subset of given data, lower support threshold + a
method to determine the completeness.
 Dynamic itemset counting: Add new candidate itemsets only when all of
their subsets are estimated to be frequent.
Incremental Mining of Association Rules
 It is noted that analysis of past transaction data can provide very
valuable information on customer buying behavior, and thus improve
the quality of business decisions.
 With the increasing use of the record-based databases whose data is
being continuously added, updated, deleted etc.
 Examples of such applications include Web log records, stock market
data, grocery sales data, transactions in e-commerce, and daily
weather/traffic records etc.
 In many applications, we would like to mine the transaction database for
a fixed amount of most recent data (say, data in the last 12 months).
Incremental Mining of Association Rules
 Mining is not a one-time operation, a naive approach to solve the
incremental mining problem is to re-run the mining algorithm on the
updated database.
Mining

Data

Feedback Business
Strategy
FP-Growth Algorithm
 The FP-Growth Algorithm is proposed by Han.
 It is an efficient and scalable method for mining the complete set
of frequent patterns.
 Using prefix-tree structure for storing information about frequent
patterns named frequent-pattern tree (FP-tree).
 Once an FP-tree has been constructed, it uses a recursive divide-
and-conquer approach to mine the frequent item sets.
FP-Growth Algorithm (Cont..)
 Building the FP-Tree
1. Scan data to determine the support count of each item.
• Infrequent items are discarded, while the frequent items are sorted in
decreasing support counts.
2. Make a second pass over the data to construct the FP­-tree.
• As the transactions are read, before being processed, their items are sorted
according to the above order.
FP-Growth Algorithm - Example
FP-Tree Generation Step:1 Step:2
TID Transactions Freq. 1-Itemsets. Transactions with items sorted based on
Min_Sup  2 frequencies, and ignoring the infrequent
1 ABCEFO Transactions items.
2 ACG A:8 ACEBF
3 EI C:8 ACG
4 ACDEG E:8 E
5 ACEGL
G:5 ACEGD
B:2 ACEG
6 EJ
D:2 E
7 ABCEFP F:2 ACEBF
8 ACD Remaining all ACD
O,I,J,L,P,M & N
9 ACEGM is with ACEG
min_sup = 1
10 A C E G N ACEG
FP-Tree after reading 1st transaction
ACEBF Header null
ACG
A:8
E A:1
C:8
ACEGD E:8
C:1
ACEG G:5
E B:2
E:1
ACEBF D:2
ACD F:2
B:1
ACEG
ACEG F:1
FP-Tree after reading 2nd transaction
ACEBF Header null
ACG
A:8 A:2
E C:8
ACEGD E:8
C:2
ACEG G:5
E B:2
E:1 G:1
D:2
ACEBF
F:2
ACD B:1
ACEG
ACEG F:1
FP-Tree after reading 3rd transaction
ACEBF Header null
ACG
A:8
E A:2 E:1
C:8
ACEGD E:8
C:2
ACEG G:5
E B:2 G:1
E:1
ACEBF D:2
ACD F:2
B:1
ACEG
ACEG F:1
FP-Tree after reading 4th transaction
ACEBF Header null
ACG
A:8
E A:3 E:1
C:8
ACEGD E:8
C:3
ACEG G:5
E B:2 G:1
E:2
ACEBF D:2
ACD F:2 G:1
B:1
ACEG
ACEG F:1 D:1
FP-Tree after reading 5th transaction
ACEBF Header null
ACG
A:8
E A:4 E:1
C:8
ACEGD
E:8
C:4
ACEG G:5
E B:2 G:1
E:3
ACEBF D:2
ACD F:2 G:2
B:1
ACEG
ACEG F:1 D:1
FP-Tree after reading 6th transaction
ACEBF Header null
ACG
A:8
E A:4 E:2
C:8
ACEGD
E:8
ACEG C:4
G:5
E B:2
E:3 G:1
ACEBF D:2
ACD F:2 G:2
B:1
ACEG
ACEG F:1 D:1
FP-Tree after reading 7th transaction
ACEBF Header null
ACG
A:8
E A:5 E:2
C:8
ACEGD
E:8
ACEG C:5
G:5
E B:2
E:4 G:1
ACEBF D:2
ACD F:2 G:2
B:2
ACEG
ACEG F:2 D:1
FP-Tree after reading 8th transaction
ACEBF Header null
ACG
A:8
E A:6 E:2
C:8
ACEGD
E:8
ACEG C:6
G:5
E B:2
E:4 G:1 D:1
ACEBF D:2
ACD F:2
B:2
G:2
ACEG
ACEG F:2 D:1
FP-Tree after reading 9th transaction
ACEBF Header null
ACG
A:8
E A:7 E:2
C:8
ACEGD
E:8
ACEG C:7
G:5
E B:2
E:5 G:1 D:1
ACEBF D:2
ACD F:2 G:3
B:2
ACEG
ACEG F:2 D:1
FP-Tree after reading 10th transaction
ACEBF Header null
ACG
A:8
E A:8 E:2
C:8
ACEGD
E:8
ACEG C:8
G:5
E B:2
E:6 G:1 D:1
ACEBF D:2
ACD F:2 G:4
B:2
ACEG
ACEG F:2 D:1
FP-Growth algorithm - Example
Minimum Support
>= 2 FP-Tree
TID Items
null
1 AB
2 BCD Header B:8 A:2
3 ACDE Item Support
A:5 C:3 C:1 D:1
4 ADE B 8
5 ABC A 7 C:3 D:1 D:1 E:1 D:1 E:1
6 ABCD C 7
7 BC D 5 D:1 E:1
8 ABC E 3
9 ABD
10 BCE
FP-Growth Example (Try it) Minimum Support = 3

TID Items Bought Frequent Items


(sorted order)
100 FACDGIMP FCAMP
200 ABCFLMO FCABM
300 BFHJOW FB
400 BCKSP CBP
500 AFCELPMN FCAMP
FP-Growth Example - Answer
FP-Tree Construction

null
Header Table
f:4 c:1
Item frequency
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1

You might also like