Market Basket Analysis Using Association Rules Unit 5
Market Basket Analysis Using Association Rules Unit 5
Association Rules
• Think back to the last time you made an impulse
purchase. Maybe you were waiting in the grocery
store checkout lane and bought a pack of
chewing gum or a candy bar.
• You might have even bought this book on a whim
on a bookseller's recommendation.
• These impulse buys are no coincidence, as
retailers use sophisticated data analysis
techniques to identify patterns that will drive
retail behavior.
• In years past, such recommendation systems were
based on the subjective intuition of marketing
professionals and inventory managers or buyers.
• More recently, as barcode scanners, computerized
inventory systems, and online shopping trends have
built a wealth of transactional data, machine learning
has been increasingly applied to learn purchasing
patterns.
• The practice is commonly known as market basket
analysis due to the fact that it has been so frequently
applied to supermarket data.
Association Rules
• The building blocks of a market basket
analysis are the items that may appear in any
given transaction. Groups of one or more
items are surrounded by brackets to indicate
that they form a set, or more specifically, an
itemset that appears in the data with some
regularity. Transactions are specified in terms
of itemsets, such as the following transaction
that might be found in a typical grocery store:
• Association rules are always composed from
subsets of itemsets and are denoted by
relating one itemset on the left-hand side
(LHS) of the rule to another itemset on the
right-hand side (RHS) of the rule. The LHS is
the condition that needs to be met in order to
trigger the rule, and the RHS is the expected
result of meeting that condition. A rule
identified from the example transaction might
be expressed in the form:
• In plain language, this association rule states that
if peanut butter and jelly are purchased together,
then bread is also likely to be purchased. In other
words, "peanut butter and jelly imply bread.“
• Association rule learners are unsupervised, there
is no need for the algorithm to be trained; data
does not need to be labeled ahead of time. The
program is simply unleashed on a dataset in the
hope that interesting associations are found.
• Although association rules are most often
used for market basket analysis, they are
helpful for finding patterns in many different
types of data. Other potential applications
include:
• Searching for interesting and frequently
occurring patterns of DNA and protein
sequences in cancer data
• Finding patterns of purchases or medical
claims that occur in combination with
fraudulent credit card or insurance use
• Identifying combinations of behavior that
precede customers dropping their cellular
phone service or upgrading their cable
television package
The Apriori algorithm for association rule
learning
• Just as it is challenging for humans, transactional data
makes association rule mining a challenging task for
machines as well. Transactional datasets are typically
extremely large, both in terms of the number of
transactions as well as the number of items or features
that are monitored. The problem is that the number of
potential itemsets grows exponentially with the
number of features. Given k items that can appear or
not appear in a set, there are 2^k possible itemsets
that could be potential rules. A retailer that sells only
100 different items could have on the order of 2^100 =
1.27e+30 itemsets that an algorithm must evaluate—a
seemingly impossible task.
• The most-widely used approach for efficiently
searching large databases for rules is known as
Apriori. Introduced in 1994 by Rakesh Agrawal
and Ramakrishnan Srikant, the Apriori algorithm
has since become somewhat synonymous with
association rule learning.
• The name is derived from the fact that the
algorithm utilizes a simple prior (that is, a priori)
belief about the properties of frequent itemsets.
• the Apriori algorithm employs a simple a priori belief
to reduce the association rule search space: all
subsets of a frequent itemset must also be frequent.
This heuristic is known as the Apriori property.
• Apriori algorithm refers to an algorithm that is used
in mining frequent products sets and relevant
association rules. Generally, the apriori algorithm
operates on a database containing a huge number of
transactions. For example, the items customers but
at a Big Bazar.
• By looking at the sets of purchases, one can
infer that there are a couple of typical buying
patterns. A person visiting a sick friend or
family member tends to buy a get well card
and flowers, while visitors to new mothers
tend to buy plush toy bears and balloons.
Measuring rule interest – support
and
confidence
• Whether or not an association rule is deemed
interesting is determined by two statistical
measures: support and confidence measures.
• The support of an itemset or rule measures
how frequently it occurs in the data. For
instance the itemset {get well card, flowers},
has support of 3 / 5 = 0.6 in the hospital gift
shop data.
• The support can be calculated for any itemset
or even a single item; for instance, the
support for {candy bar} is 2 / 5 = 0.4, since
candy bars appear in 40 percent of purchases.
A function defining support for the itemset X
can be defined as follows:
• N is the number of transactions in the
database and count(X) is the number of
transactions containing itemset X.
• A rule's confidence is a measurement of its
predictive power or accuracy. It is defined as the
support of the itemset containing both X and Y
divided by the support of the itemset containing only
X:
• Essentially, the confidence tells us the proportion of
transactions where the presence of item or itemset X
results in the presence of item or itemset Y. Keep in
mind that the confidence that X leads to Y is not the
same as the confidence that Y leads to X.
• For example, the confidence of {flowers} →
{get well card} is 0.6 / 0.8 = 0.75. In
comparison, the confidence of {get well card}
→ {flowers} is 0.6 / 0.6 = 1.0. This means that
a purchase involving flowers is accompanied
by a purchase of a get well card 75 percent of
the time, while a purchase of a get well card is
associated with flowers 100 percent of the
time.