0% found this document useful (0 votes)
34 views58 pages

CH 4

The document discusses concept description and association rule mining. It defines concept description as providing concise characterizations and comparisons of data collections through techniques like data generalization, summarization, and analytical attribute analysis. Association rule mining finds frequent patterns and correlations among items in transactional databases. It describes market basket analysis and the Apriori algorithm for finding frequent item sets.

Uploaded by

gauravkhunt110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views58 pages

CH 4

The document discusses concept description and association rule mining. It defines concept description as providing concise characterizations and comparisons of data collections through techniques like data generalization, summarization, and analytical attribute analysis. Association rule mining finds frequent patterns and correlations among items in transactional databases. It describes market basket analysis and the Apriori algorithm for finding frequent item sets.

Uploaded by

gauravkhunt110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Unit 4

Unit 5- Concept Description &


Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Concept Description
 Descriptive vs. predictive data mining

 Descriptive mining: describes concepts or task-relevant data


sets in concise, summarative, informative, discriminative
forms
 Predictive mining: Based on data and analysis, constructs
models for the database, and predicts the trend and
properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct
summarization of the given collection of data
 Comparison: provides descriptions comparing two or more
collections of data
Concept Description vs. OLAP
 Concept description:
 can handle complex data types of the attributes and
their aggregations
 a more automated process
 OLAP:
 restricted to a small number of dimension and measure
types
 user-controlled process
Unit 5- Concept Description &
Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Data Generalization and Summarization-based
Characterization
 Data generalization
 A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to higher
ones. 1
2
3
4
Conceptual Levels
5

 Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach
Characterization: Data Cube Approach
(without using AO-Induction)
 Perform computations and store results in data cubes
 Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures
 e.g., count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a data cube
by roll-up and drill-down
 Limitations
 handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Attribute-Oriented Induction
 Not confined to categorical data nor particular measures.
 How it is done?
 Collect the task-relevant data( initial relation) using a
relational database query
 Perform generalization by attribute removal or attribute
generalization.
 Apply aggregation by merging identical, generalized tuples
and accumulating their respective counts.
 Interactive presentation with users.
Basic Principles of Attribute-Oriented
Induction
 Data focusing: task-relevant data, including dimensions, and the result
is the initial relation.
 Attribute-removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct values for A,
and there exists a set of generalization operators on A, then select an
operator and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final relation/rule
size.
Basic Algorithm for Attribute-Oriented
Induction
 InitialRel: Query processing of task-relevant data, deriving the initial
relation.
 PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or
how high to generalize?
 PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
 Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
Example
 DMQL: Describe general characteristics of graduate students in
the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place,
birth_date, residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Initial Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime Generalized M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
Relation
… … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Presentation of Generalized Results
 Generalized relation:
Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
 Cross tabulation:
Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
Unit 5- Concept Description &
Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Association Mining
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Association Mining

 Association rule mining:


 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E
 support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A  D (60%, 100%)
buys beer
D  A (60%, 75%)
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo
@ SIGMOD’98)
 Closed pattern is a lossless compression of freq. patterns
 Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Scalable Methods for Mining Frequent Patterns

 The downward closure property of frequent patterns


 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Two major approaches
 Apriori
 Freq. pattern growth
Apriori: A Candidate Generation-and-Test Approach

 Apriori Principle : All nonempty subsets of a frequent itemset


must also be frequent.
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
Example
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan {B, C, E} 2
{B, C, E}
Example
Example
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database

 Huge number of candidates

 Tedious workload of support counting for candidates

 Improving Apriori: general ideas


 Reduce passes of transaction database scans

 Shrink number of candidates

 Facilitate support counting of candidates


Methods to Improve Apriori’s Efficiency
 Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent.
Outline of the Presentation

Outline
 Frequent Pattern Mining: Problem statement and an
example
 Review of Apriori-like Approaches
 FP-Growth:
 Overview
 FP-tree:
structure, construction and advantages
 FP-growth:
 FP-tree conditional pattern bases  conditional FP-tree
frequent patterns
 Experiments
 Discussion:
 Improvement of FP-growth
 Conclusion Remarks

32
Frequent Pattern Mining: An
Example
Given a transaction database DB and a minimum support threshold ξ,
find all frequent patterns (item sets) with support no less than ξ.

Input: DB: TID Items bought


100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

Minimum support: ξ =3

Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…

Problem Statement: How to efficiently find all frequent patterns?

33
Apriori
Candidate
 Main Steps of Apriori Algorithm: Generation
 Use frequent (k – 1)-itemsets (Lk-1) to generate candidates of
frequent k-itemsets Ck
 Scan database and count each pattern in Ck , get frequent k-
itemsets ( Lk ) .
 E.g. , Candidate
Test
TID Items bought Apriori iteration
100 {f, a, c, d, g, i, m, p} C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n
200 {a, b, c, f, l, m, o} L1 f, a, c, m, b, p

300 {b, f, h, j, o} C2 fa, fc, fm, fp, ac, am, …bp


400 {b, c, k, s, p} L2 fa, fc, fm, …
500 {a, f, c, e, l, p, m, n} …

34
Performance Bottlenecks of Apriori

 Bottlenecks of Apriori: candidate generation


 Generate huge candidate sets:
 104 frequent 1-itemset will generate 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
one needs to generate 2100  1030 candidates.

 Candidate Test incur multiple scans of database:


each candidate

35
Overview of FP-Growth: Ideas
 Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
 highly compacted, but complete for frequent pattern mining
 avoid costly repeated database scans

 Develop an efficient, FP-tree-based frequent pattern


mining method (FP-growth)
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only.

36
FP-tree:
Construction and Design
FP-tree

Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent
items (single item patterns) and order them into a list L
in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according
to the order in L; Scan DB the second time, construct FP-
tree by putting each frequency ordered transaction onto
it.
38
FP-tree Example: step 1
Step 1: Scan DB once, find frequent 1-itemset

TID Items bought


100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o} Item frequency
300 {b, f, h, j, o} f 4
400 {b, c, k, s, p} c 4
500 {a, f, c, e, l, p, m, n} a 3
b 3
m 3
p 3

39
FP-tree Example: step 2

Step 2: scan the DB for the second time, order frequent items
in each transaction

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

40
FP-tree Example: step 2
Step 2: construct FP-tree

{} {}

f:1 f:2
{f, c, a, m, p} {f, c, a, b, m}
{} c:1 c:2

a:1 a:2

m:1 m:1 b:1


NOTE: Each
transaction
corresponds to one p:1 p:1 m:1
path in the FP-tree

41
FP-tree Example: step 2
Step 2: construct FP-tree

{} {} {}

f:3 f:3 c:1 f:4 c:1


{f, b} {c, b, p} {f, c, a, m, p}
c:2 b:1 c:2 b:1 b:1 c:3 b:1 b:1

a:2 a:2 p:1 a:3 p:1

m:1 b:1 m:1 b:1 m:2 b:1

p:1 m:1 p:1 m:1 p:2 m:1


Node-Link

42
Construction Example
Final FP-tree

{}
Header Table
f:4 c:1
Item head
f
c c:3 b:1 b:1
a
b a:3 p:1
m
p m:2 b:1

p:2 m:1

43
FP-Tree Definition
 FP-tree is a frequent pattern tree . Formally, FP-tree is a tree
structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the
children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
 item-name : register which item this node represents,
 count, the number of transactions represented by the portion of the
path reaching this node,
 node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
 item-name, and
 head of node-link that points to the first node in the FP-tree carrying
the item-name.

44
Advantages of the FP-tree Structure
 The most significant advantage of the FP-tree
 Scan the DB only twice and twice only.

 Completeness:
 the FP-tree contains all the information related to mining frequent
patterns (given the min-support threshold). Why?

 Compactness:
 The size of the tree is bounded by the occurrences of frequent items

 The height of the tree is bounded by the maximum number of items in


a transaction

45
Questions?
 Why descending order?
 Example 1: {}

f:1 a:1

TID (unordered) frequent items


100 {f, a, c, m, p} a:1 f:1
500 {a, f, c, p, m}
c:1 c:1

m:1 p:1

p:1 m:1

46
FP-tree

Questions?
 Example 2: {}
TID (ascended) frequent items
100 {p, m, a, c, f} p:3 m:2 c:1
200 {m, b, a, c, f}
300 {b, f} m:2 b:1 b:1 b:1
400 {p, b, c}
500 {p, m, a, c, f}
a:2 c:1 a:2 p:1

This tree is larger than FP-tree,


c:2 c:1
because in FP-tree, more frequent
items have a higher position, which
makes branches less f:2 f:2

47
FP-growth:
Mining Frequent Patterns
Using FP-tree
Mining Frequent Patterns Using FP-tree
 General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree:
looking for shorter ones recursively and then concatenating
the suffix:
 For each frequent item, construct its conditional pattern
base, and then its conditional FP-tree;
 Repeat the process on each newly created conditional FP-
tree until the resulting FP-tree is empty, or it contains
only one path (single path will generate all the
combinations of its sub-paths, each of which is a frequent
pattern)

49
3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header
table
Step 2
Construct conditional FP-tree from each conditional pattern base
Step 3
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path,
simply enumerate all the patterns

50
Step 1: Construct Conditional Pattern Base
 Starting at the bottom of frequent-item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
{} Conditional pattern bases
Header Table item cond. pattern base
f:4 c:1
Item head p fcam:2, cb:1
f m fca:2, fcab:1
c c:3 b:1 b:1
a b fca:1, f:1, c:1
b a:3 p:1 a fc:3
m
p m:2 b:1 c f:3
f {}
p:2 m:1
51
Properties of FP-Tree
 Node-link property
 For any frequent item ai, all the possible frequent patterns that
contain ai can be obtained by following ai's node-links, starting from
ai's head in the FP-tree header.
 Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.

52
Step 2: Construct Conditional FP-tree
 For each pattern base
 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the
pattern base

{}
Header Table
Item head f:4 {}
f 4
c 4 c:3 f:3
m- cond. pattern base:
a 3
b 3
a:3  fca:2, fcab:1 
c:3
m 3 m:2 b:1
p 3 a:3
m:1 m-conditional FP-tree

53
Step 3: Recursively mine the conditional FP-tree
conditional FP-tree of conditional FP-tree of conditional FP-tree of
“m”: (fca:3) “am”: (fc:3) add “cam”: (f:3)
{} “c” {}
{} add Frequent Pattern Frequent Pattern
Frequent Pattern
“a” f:3 f:3
f:3 add c:3 add add
“c” “f” “f”
c:3
conditional FP-tree of conditional FP-tree of
a:3 “cm”: (f:3) of “fam”: 3
add
{} “f”
Frequent Pattern
Frequent Pattern
add conditional FP-tree of
f:3 “fcm”: 3
“f”

Frequent Pattern Frequent Pattern


fcam
conditional FP-tree of “fm”: 3

54
Frequent Pattern
Principles of FP-Growth
 Pattern growth property
 Let  be a frequent itemset in DB, B be 's conditional pattern
base, and  be an itemset in B. Then    is a frequent itemset
in DB iff  is frequent in B.
 Is “fcabm ” a frequent pattern?
 “fcab” is a branch of m's conditional pattern base
 “b” is NOT frequent in transactions containing “fcab ”
 “bm” is NOT a frequent itemset.

55
Conditional Pattern Bases and
Conditional FP-Tree

Item Conditional pattern base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
order of L
56
Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P. The complete set of frequent
pattern of T can be generated by enumeration of all the combinations of
the sub-paths of P
{}
All frequent patterns concerning m:
combination of {f, c, a} and m
f:3
m,
c:3  fm, cm, am,
fcm, fam, cam,
a:3
fcam
m-conditional FP-tree

57
Summary of FP-Growth Algorithm
 Mining frequent patterns can be viewed as first
mining 1-itemset and progressively growing each 1-
itemset by mining on its conditional pattern base
recursively

 Transform a frequent k-itemset mining problem into a


sequence of k frequent 1-itemset mining problems via
a set of conditional pattern bases

You might also like