0% found this document useful (0 votes)
26 views22 pages

Unit II

Unit II

Uploaded by

bhuvana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views22 pages

Unit II

Unit II

Uploaded by

bhuvana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

191CSC503T DATA MINING

UNIT-II

UNIT II
DATA MINING - FREQUENT PATTERN ANALYSIS

Mining Frequent Patterns, Associations and Correlations – Mining Methods – Pattern Evaluation
Method -Pattern Mining in Multilevel ,Multi Dimensional Space – Constraint Based Frequent Pattern
Mining ,Classification using Frequent Patterns .

Association Mining

• Association rule mining:


– Finding frequent patterns, associations, correlations, or causal structures among sets of
items or objects in transaction databases, relational databases, and other information
repositories.
• Applications:
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
• Examples.
– Rule form: “Body ® Head [support, confidence]”.
– buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

Association Rule: Basic Concepts

• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a
customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get automotive services
done
• Applications

Page 1
191CSC503T DATA MINING
UNIT-II

– * Maintenance Agreement (What the store should do to boost Maintenance Agreement


sales)
– Home Electronics  * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”

Rule Measures: Support and Confidence

• Find all the rules X & Y  Z with minimum confidence and support
– support, s, probability that a transaction contains {X  Y  Z}
– confidence, c, conditional probability that a transaction having {X  Y} also contains Z

Let minimum support 50%, and minimum confidence 50%, we have

– A  C (50%, 66.6%)
– C  A (50%, 100%)

Transaction ID Items Bought


2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

Association Rule Mining: A Road Map

• Boolean vs. quantitative associations (Based on the types of values handled)


– buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality

Page 2
191CSC503T DATA MINING
UNIT-II

– Maxpatterns and closed itemsets


– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Market – Basket analysis

A market basket is a collection of items purchased by a customer in a single transaction, which


is a well-defined business activity. For example, a customer's visits to a grocery store or an online
purchase from a virtual store on the Web are typical customer transactions. Retailers accumulate huge
collections of transactions by recording business activities over time. One common analysis run against
a transactions database is to find sets of items, or itemsets, that appear together in many transactions.
A business can use knowledge of these patterns to improve the Placement of these items in the store or
the layout of mail- order catalog page and Web pages. An itemset containing i items is called an i-itemset.
The percentage of transactions that contain an itemset is called the itemset's support. For an itemset to
be interesting, its support must be higher than a user-specified minimum. Such itemsets are said to be
frequent.

Figure : Market basket analysis.

Page 3
191CSC503T DATA MINING
UNIT-II

Rule support and confidence are two measures of rule interestingness. They respectively reflect
the usefulness and certainty of discovered rules. A support of 2% for association Rule means that 2%
of all the transactions under analysis show that computer and financial management software are
purchased together. A confidence of 60% means that 60% of the customers who purchased a computer
also bought the software. Typically, association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence threshold.

Mining Frequent Patterns

The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.

Apriori property

• All nonempty subsets of a frequent item set most also be frequent.


– An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) can not occur
more frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set can not pass a test, all its supersets will fail the same test as well.
• This property is monotonic in failing the test.

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself


• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Page 4
191CSC503T DATA MINING
UNIT-II

Example

Page 5
191CSC503T DATA MINING
UNIT-II

The method that mines the complete set of frequent itemsets without generation.

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure


– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones
– Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)


2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree

Page 6
191CSC503T DATA MINING
UNIT-II

Header Table

Item frequency head


f 4
c 4
a 3
b 3
m 3
p3

Benefits of the FP-tree Structure

• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100

Mining Frequent Patterns Using FP-tree

• General idea (divide-and-conquer)


– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will generate
all the combinations of its sub-paths, each of which is a frequent pattern)

Page 7
191CSC503T DATA MINING
UNIT-II

Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the FP-tree


2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
▪ If the conditional FP-tree contains a single path, simply enumerate all the patterns

Principles of Frequent Pattern Growth

• Pattern growth property


– Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset
in B. Then    is a frequent itemset in DB iff  is frequent in B.
• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”

Why Is Frequent Pattern Growth Fast?

• Our performance study shows


– FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-
projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
Basic operation is counting and FP-tree building

Mining multilevel association rules from transactional databases.

• Items often form hierarchy.


• Items at the lower level are expected to have lower support.

Page 8
191CSC503T DATA MINING
UNIT-II

• Rules regarding itemsets at


appropriate levels could be quite useful.

• Transaction database can be encoded based on dimensions and levels


• We can explore shared multi-level mining

Food

Milk Bread

Skim 2% Wheat White

Fraser Sunset

TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}

Mining Multi-Level Associations

• A top_down, progressive deepening approach:


– First find high-level strong rules:
milk ® bread [20%, 60%].

– Then find their lower-level “weaker” rules:


2% milk ® wheat bread [6%, 50%].

• Variations at mining multiple-level association rules.


– Level-crossed association rules:
2% milk ® Wonder wheat bread

Page 9
191CSC503T DATA MINING
UNIT-II

– Association rules with multiple, alternative hierarchies:


2% milk ® Wonder bread

Multi-level Association: Uniform Support vs. Reduced Support

• Uniform Support: the same minimum support for all levels


– + One minimum support threshold. No need to examine itemsets containing any item
whose ancestors do not have minimum support.
– – Lower level items do not occur as frequently. If support threshold
• too high  miss low level associations
• too low  generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
– There are 4 search strategies:
• Level-by-level independent
• Level-cross filtering by k-itemset
• Level-cross filtering by single item
• Controlled level-cross filtering by single item

Multi-level Association: Redundancy Filtering

• Some rules may be redundant due to “ancestor” relationships between items.


• Example
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk  wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor

Multi-Level Mining: Progressive Deepening

• A top-down, progressive deepening approach:


– First mine high-level frequent items:
milk (15%), bread (10%)

Page 10
191CSC503T DATA MINING
UNIT-II

– Then mine their lower-level “weaker” frequent itemsets:


2% milk (5%), wheat bread (4%)

• Different min_support threshold across multi-levels lead to different algorithms:

– If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
– If adopting reduced min_support at lower levels then examine only those descendents
whose ancestor’s support is frequent/non-negligible.

Correlation in detail.

• Interest (correlation, lift)


– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A and B positively
correlated

X111100 00 Itemset Support Interest


Y110000 00 X,Y 25% 2
Z011111 11 X,Z 37.50% 0.9
2 Correlation
Y,Z 12.50% 0.57
• 2 measures correlation between categorical attributes

=2=  (observed _ exp ected )2


exp ected

game not game sum(row)


video 4000(4500) 3500(3000) 7500
not video 2000(1500) 500 (1000) 2500
sum(col.) 6000 4000 10000

Page 11
191CSC503T DATA MINING
UNIT-II

• expected(i,j) = count(row i) * count(column j) / N


• 2 = (4000 - 4500)2 / 4500 - (3500 - 3000)2 / 3000 - (2000 - 1500)2 / 1500 - (500 - 1000)2 /
1000 = 555.6
• 2 > 1 and observed value of (game, video) < expected value, there is a negative correlation

Numeric correlation

• Correlation concept in statistics


– Used to study the relationship existing between 2 or more numeric variables
– A correlation is a measure of the linear relationship between variables
Ex: number of hours spent studying in a class with grade received
– Outcomes:
• → positively related
• → Not related
• → negatively related
– Statistical relationships
• Covariance
• Correlation coefficient

Constraint-Based Mining in detail.

• Interactive, exploratory mining giga-bytes of data?


– Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
– Knowledge type constraint: classification, association, etc.
– Data constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98.
– Dimension/level constraints:
• in relevance to region, price, brand, customer category.
– Rule constraints
• small sales (price < $10) triggers big sales (sum > $200).

Page 12
191CSC503T DATA MINING
UNIT-II

– Interestingness constraints:
• strong rules (min_support  3%, min_confidence  60%).

Rule Constraints in Association Mining

• Two kind of rule constraints:


– Rule form constraints: meta-rule guided mining.
• P(x, y) ^ Q(x, w) ® takes(x, “database systems”).
– Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
• 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99):
– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Page 13
191CSC503T DATA MINING
UNIT-II

Pattern Mining in Multilevel Mutidimensional Space

This section first focuses on multilevel associations involving different abstraction levels.

Mining Mutilevel Association

Consider the example shown in Fig. 3.1.6. It shows the concept hierarchy of the transactions from Table 3.1.5.
Mining Multidimensional Associations

Mining Quantitative Association Rules

In this section three methods of a mining quantitative association rules are discussed.

1) A data cube method 2) A clustering based method 3) A statistical analysis method

Data cube method


In this numeric values are replaced by interval labels.
Constraint based Pattern Mining.

Metarule Guided Mining of Association Rules


Pruning Pattern Space with Pattern Pruning Constraints

Classification using Frequent Patterns


Associative Classification

You might also like