0% found this document useful (0 votes)
30 views34 pages

Unit 3 1

This document provides an overview of association analysis and frequent pattern mining. It discusses how retailers can analyze customer transaction data to discover purchasing patterns and relationships between items. The key goals are to find frequent itemsets that occur together above a minimum support threshold and generate high-confidence association rules. It describes the basic terminology, formulations, and challenges of the problem. It also gives an overview of the Apriori algorithm for mining frequent itemsets and generating association rules from transactional data.

Uploaded by

o190585
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views34 pages

Unit 3 1

This document provides an overview of association analysis and frequent pattern mining. It discusses how retailers can analyze customer transaction data to discover purchasing patterns and relationships between items. The key goals are to find frequent itemsets that occur together above a minimum support threshold and generate high-confidence association rules. It describes the basic terminology, formulations, and challenges of the problem. It also gives an overview of the Apriori algorithm for mining frequent itemsets and generating association rules from transactional data.

Uploaded by

o190585
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Mining Frequent Patterns,

Association and Correlations:


Basic Concepts and Methods

BY
Bheema. Shirisha
s soc ia ti o n
A
Ana l ys is
• Many business enterprises generate large quantities of data
from their daily operations.
– Example: Customer purchase data are collected daily at the
checkout counters of grocery stores.

• Each row corresponds to a transaction, which contains a


unique identifier labeled TID and a set of items.
• Retailers are interested in analyzing the data to learn about
the purchasing behavior of their customers.
• Such valuable information can be used to support a variety of
business-related applications such as marketing promotions,
inventory management, catalog design, store layout and
customer relationship management.
Association Analysis
• Useful for discovering interesting relationships
hidden in large data sets.
• The uncovered relationships can be represented in
the form of association rules or sets of frequent
items
{Diapers}  {Beer}
• The rule suggests, customers who buy diapers also
buy beer.
• Retailers can use this type of rules to help them
identify new opportunities for cross-selling their
products to the customers.
Key Issues
• Two key issues when applying association
analysis to market basket data.
– Discovering patterns from a large transaction data
set can be computationally expensive.
– Some of the discovered patterns are potentially
spurious because they may happen simply by
chance.
Basic
Terminology

• Each row corresponds to a transaction and each column


corresponds to an item.
• An item can be treated as a binary variable whose value
is one if the item is present in a transaction and zero
otherwise.
• Because the presence of an item in a transaction is often
considered more important than its absence, an item is
an asymmetric binary variable.
• It ignores certain important aspects of the data such as
the quantity of items sold or the price paid to purchase
them.
Itemset and Support Count
• Let I ={i1,i2,... ,id} be the set of all items in a market basket data and T
={t1,t2,...,tN} be the set of all transactions.
• Each transaction ti contains a subset of items chosen from I.

• In association analysis, a collection of zero or more items is termed an itemset.


• If an itemset contains k items, it is called a k-itemset.
• {Beer, Diapers, Milk} is an example of a 3-itemset.
• The null (or empty) set is an itemset that does not contain any items.

• Support count refers to the number of transactions that contain a particular


itemset.
• Support count for {Beer, Diapers, Milk} is equal to two because there are
only two transactions that contain all three items

• The transaction width is defined as the number of items present in a


transaction.
Association Rule
• An association rule is an implication expression of
the form X → Y, where X and Y are disjoint
itemsets, i.e., X ∩Y = ∅ .
• The strength of an association rule can be
measured in terms of its support and confidence.
• Support determines how often a rule is applicable
to a given data set
• Confidence determines how frequently items in Y
appear in transactions that contain X.
{Milk, Diapers} → {Beer}
Why Use Support and Confidence?
• A low support rule is uninteresting from a business
perspective because it may not be profitable to promote.
• Support is often used to eliminate uninteresting rules

• Confidence measures the reliability of the inference made by


a rule
• For a given rule X → Y, the higher the confidence, the more
likely it is for Y to be present in transactions that contain X.
• Confidence also provides an estimate of the conditional
probability of Y given X.
• It suggests a strong co-occurrence relationship between items
in the antecedent and consequent of the rule.
Formulation of Association Rule
Mining Problem
• Definition 6.1 (Association Rule Discovery):

Given a set of transactions T, find all the rules


having support ≥ min_sup and confidence ≥
min_conf, where min_sup and min_conf are the
corresponding support and confidence thresholds.
Brute-force approach
• The support of a rule X → Y depends only on the
support of its corresponding itemset, X ∪ Y.
• For example, the following rules have identical
support because they involve items from the same
itemset, {Beer, Diapers, Milk}:

• If the itemset is infrequent, then all six candidate


rules can be pruned immediately without our
having to compute their confidence values
Association rule mining algorithms
• A common strategy adopted by many association
rule mining algorithms is to decompose the
problem into two major subtasks:
• 1. Frequent Itemset Generation whose objective
is to find all the itemsets that satisfy the min_sup
threshold. These itemsets are called frequent
itemsets.
• 2. Rule Generation whose objective is to extract
all the high-confidence rules from the frequent
itemsets found in the previous step. These rules are
called strong rules.
Basic Concepts
• Frequent patterns and Association rules are helpful for
making recommendations in business
• Frequent patterns are itemsets, subsequences or substructures
that appear frequently in a data set
– Frequent itemset – appear frequently in a transaction dataset
(Milk and Bread)
– Subsequence – Buying first PC, then camera, then a memory
card, if it occurs frequently in a shopping history DB is a
sequential pattern
– Substructure – refer to different structural forms like subgraphs,
subtrees or sublattices which may be combined with itemsets or
subsequences.
• Finding frequent patterns plays an essential role in mining
associations and correlation relationships among data stored
in transactional and relational data.
• Helps in data classification, clustering and other DM tasks
Market Basket Analysis(MBA)
• Discovery of interesting
correlation and association
relationships can help in many
business decision making
process
• Helpful for selective
marketing and plan for shelf
space
• Advertisement strategies or
design the new catalogue
• Design of different store
layouts
• To plan which items to put on
sale at reduced prices
Cont.,
• Set of items available at store, each item has a Boolean
variable, each basket can be represented by a Boolean
vector of values assigned to items
• The Boolean vector can be analysed for buying patterns
that are frequently associated or purchased together.

• Customers who purchase computer also tend to buy


antivirus software at the same time
• Support and Confidence are two measures of rule
interestingness
Cont.,

• Support of 2% means that 2% of all the transactions


under analysis show that computer and AV purchased
together
• Confidence of 60% means that 60% of customers who
purchased a computer also bought the AV
• Association rules are considered interesting if they
satisfy both minimum support threshold and
minimum confidence threshold
• Additional analysis can be performed to discover
interesting statistical correlations between associated
items
Frequent Itemsets, Closed Itemsets and
Association Rules
• The set {computer, antivirus} is a 2 itemset
• The occurrence frequency of an itemset is the
number of transactions that contain the itemset.
• This is also known as frequency, support count or
count
• Association rule mining can be viewed as two step
process
• i. Find all frequent itemsets – that has to support
min_sup
• ii. Generate strong association rules from the
frequent itemsets – must satisfy min_sup and
min_conf
Closed frequent itemsets and Maximal
frequent itemsets
• An itemset X is closed in a dataset D if there exists no proper
super-itemset Y s.t Y has the same support count as X in D
• An itemset X is a closed frequent itemset in set D if X is
both closed and frequent in D
• An itemset X is a maximal frequent itemset in a dataset D if
X is frequent, and there exists no super – itemset Y s.t
and Y is frequent in D

• Let C be the set of closed frequent itemsets for a dataset D


satisfying a min_sup_thershold
• Let M be the set of maximal frequent itemsets for a dataset D
satisfying min_sup
Example
Frequent Itemset Mining Methods
• Apriori Algorithm : mining frequent itemsets for
Boolean association rules
• Apriori is a candidate generation and test approach
• Name of the algorithm is based on Prior
Knowledge of frequent itemset properties
• It is an iterative approach known as level-wise
search, where k-itemsets are used to explore (k+1)-
itemsets.
• First, the set of frequent 1-itemsets is found by
scanning the database to accumulate the count for
each item, and collecting those items that satisfy
minimum support. The resulting set is denoted by
L1.
• Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until
no more frequent k-itemsets can be found. The
finding of each Lk requires one full scan of the
database.
• Apriori property: All nonempty subsets of a frequent itemset
must also be frequent.

• By definition, if an itemset I does not satisfy the minimum
support threshold, min_sup, then I is not frequent, that is,
P(I)< min sup.
• If an item A is added to the itemset I, then the resulting
itemset (i.e., I U A) cannot occur more frequently than I.
• Therefore, I U A is not frequent either, that is, P(I U A)<
min_sup.
• This property belongs to a special category of properties
called antimonotonicity in the sense that if a set cannot pass
a test, all of its supersets will fail the same test as well.
• It is called antimonotonicity because the property is
monotonic in the context of failing a test.
Apriori Algorithm (Pseudo Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Apriori Example
Generating Association Rules from
Frequent Itemsets
• strong association rules satisfy both minimum
support and minimum confidence
Pattern – Growth approach for
Mining Frequent Patterns
• The FP - Growth Approach
– Depth-first search
– Avoid explicit candidate generation
• Adopts divide – and – conquer strategy
• First, it compresses the database representing frequent items into a frequent
pattern tree, or FP-tree, which retains the itemset association information.
• It then divides the compressed database into a set of conditional databases
(a special kind of projected database), each associated with one frequent
item or “pattern fragment,” and mines each database separately.
• For each “pattern fragment,” only its associated data sets need to be
examined.
• Therefore, this approach may substantially reduce the size of the data sets to
be searched, along with the “growth” of patterns being examined.
Example
T…U

You might also like