0% found this document useful (0 votes)
90 views9 pages

Data Minig Unit 2nd

Association rule learning is an unsupervised machine learning technique used to discover relationships between variables in large datasets. It analyzes frequent patterns and correlations between data to identify association rules showing itemsets that occur together in transactions. This allows retailers, for example, to analyze customer purchase data to discover related products customers buy together and inform marketing strategies. Key concepts include support, confidence and lift metrics for evaluating strong rules, and the Apriori algorithm for efficiently mining large transactional datasets to extract frequent itemsets and association rules.

Uploaded by

Malik Bilaal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views9 pages

Data Minig Unit 2nd

Association rule learning is an unsupervised machine learning technique used to discover relationships between variables in large datasets. It analyzes frequent patterns and correlations between data to identify association rules showing itemsets that occur together in transactions. This allows retailers, for example, to analyze customer purchase data to discover related products customers buy together and inform marketing strategies. Key concepts include support, confidence and lift metrics for evaluating strong rules, and the Apriori algorithm for efficiently mining large transactional datasets to extract frequent itemsets and association rules.

Uploaded by

Malik Bilaal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Association Rule Learning

Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of dataset.
It is based on different rules to discover the interesting relations between variables in the
database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the associations
between items. We can understand it by taking an example of a supermarket, as in a supermarket,
all products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby.
Consider the below Examples:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions.

Support Count(\sigma) – Frequency of occurrence of a itemset.

Here \sigma({Milk, Bread, Diaper})=2

Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.

Association Rule – An implication expression of the form X -> Y, where X and Y are
any 2 item sets.

Example: {Milk, Diaper}->{Beer}


Rule Evaluation Metrics –

 Support(s) –

The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction. It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
Support = \sigma(X+Y) \div total –
It is interpreted as fraction of transactions that contain both X and Y.
 Confidence(c) –

It is the ratio of the no of transactions that includes all items in {B} as well as the no
of transactions that includes all items in {A} to the no of transactions that includes all
items in {A}.
 Conf(X=>Y) = Supp(X\cupY) \div Supp(X) –

It measures how often each item in Y appears in transactions that contains items in X
also.
 Lift(l) –

The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the item sets X and Y are independent of each other. The expected
confidence is the confidence divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) \div Supp(Y) –

Lift value near 1 indicates X and Y almost often appear together as expected, greater
than 1 means they appear together more than expected and less than 1 means they
appear less than expected. Greater lift values indicate stronger association.

Example – From the above table, {Milk, Diaper}=>{Beer}

s= \sigma({Milk, Diaper, Beer}) \div |T|


= 2/5
= 0.4

c= \sigma(Milk, Diaper, Beer) \div \sigma(Milk, Diaper)


= 2/3
= 0.67

l= Supp({Milk, Diaper, Beer}) \div Supp({Milk, Diaper})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analysing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consist of a large number of transaction records which
list all items bought by a customer on a single purchase. So, the manager could know if certain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of
association rule mining. This technique is commonly used by big retailers to determine the
association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.

o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.

Some Important Definitions


o Apriori, translated from Latin as “from the former”, is an algorithm that generates
association rules.
o Association rules represent relationships between individual items or item sets within the
data. These are often written in {A}→{B} format.
o A market basket is a group of one or more items that a customer purchases in one
transaction.
o Confidence (denoted as c) is an estimate of the conditional probability that one item (for
example, {A} is in the basket given that another {B} is already present.
o High confidence rules meet or exceed a predefined confidence threshold.
An item is an individual unit, especially when described as part of a set, list, basket, or
other grouping.
o Support (denoted as s) of an individual item is “the fraction of transactions that contain
[it],” or the frequentist probability with which it occurs within a set of transactions. It is,
therefore, an estimate of the proportion of future baskets that will contain the item.
o An item set is any grouping of one or more items.
o Frequent item sets are individual item sets that meet or exceed a given minimum support
threshold.
o Support between item sets is an association rule that gives the frequency at which both
item sets occur in the same basket. It is also an estimate of the proportion of future baskets
that will contain the item sets.
And these are all the important definitions you need to know for association analysis!

Association Analysis
Association analysis is the task of finding interesting relationships in large datasets. These
interesting relationships can take two forms: frequent item sets or association rules.
Frequent item sets are a collection of items that frequently occur together. The second way
to view interesting relationships is association rules. Association rules suggest that a
strong relationship exists between two items. I’ll illustrate these two concepts with an
example. A list of transactions from a grocery store is shown in figure 11.1.
Figure 11.1.
A simple list of transactions from a natural foods grocery store called Hole Foods

Frequent items sets are lists of items that commonly appear together. One example from
figure 11.1 is {wine, diapers, soy milk}. (Recall that sets are denoted by a pair of brackets
{}). From the dataset we can also find an association rule such as diapers → wine. This
means that if someone buys diapers, there’s a good chance they’ll buy wine. With the
frequent item sets and association rules, retailers have a much better understanding of
their customers. Although common examples of association analysis are from the retail
industry, it can be applied to a number of other industries, such as website traffic analysis
and medicine.

Association Analysis:
Basic Concepts and Algorithms

Many business enterprises accumulate large quantities of data from their day-to-day
operations. For example, huge amounts of customer purchase data are collected daily at
the checkout counters of grocery stores. Table 6.1 illustrates an example of such data,
commonly known as market basket transactions. Each row in this table corresponds to a
transaction, which contains a unique identifier labelled T ID and a set of items bought by a
given customer. Retailers are interested in analysing the data to learn about the
purchasing behaviour of their customers. Such valuable information can be used to support
a variety of business-related applications such as marketing promotions, inventory
management, and customer relationship management.

This chapter presents a methodology known as association analysis,


which is useful for discovering interesting relationships hidden in large data
sets. The uncovered relationships can be represented in the form of association rules or
sets of frequent items. For example, the following rule can be
extracted from the data set shown in Table 6.1:
Table 6.1. An example of market basket transactions.
T ID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}

{Diapers} −→ {Beer}.

The rule suggests that a strong relationship exists between the sale of diapers and beer
because many customers who buy diapers also buy beer. Retailers can use this type of
rules to help them identify new opportunities for cross-selling their products to the
customers.
Besides market basket data, association analysis is also applicable to other application
domains such as bioinformatics, medical diagnosis, Web mining, and scientific data
analysis. In the analysis of Earth science data, for example, the association patterns may
reveal interesting connections among the ocean, land, and atmospheric processes. Such
information may help Earth scientists develop a better understanding of how the different
elements of the Earth system interact with each other. Even though the techniques
presented here are generally applicable to a wider variety of data sets, for illustrative
purposes, our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying association
analysis to market basket data. First, discovering patterns from a large transaction data set
can be computationally expensive. Second, some of the discovered patterns are potentially
spurious because they may happen simply by chance. The remainder of this chapter is
organized around these two issues. The first part of the chapter is devoted to explaining
the basic concepts of association analysis and the algorithms used to efficiently mine such
patterns. The second part of the chapter deals with the issue of evaluating the discovered
patterns in order to prevent the generation of spurious results.

Problem Definition
This section reviews the basic terminology used in association analysis and
presents a formal description of the task.

Binary Representation Market basket data can be represented in a binary


format as shown in Table 6.2, where each row corresponds to a transaction
and each column corresponds to an item. An item can be treated as a binary
variable whose value is one if the item is present in a transaction and zero
otherwise. Because the presence of an item in a transaction is often considered
more important than its absence, an item is an asymmetric binary variable.

Table 6.2. A binary 0/1 representation of market basket data.


TID Bread Milk Diapers Beer Eggs Cola
1110000
2101110
3011101
4111100
5111001
This representation is perhaps a very simplistic view of real market basket data
because it ignores certain important aspects of the data such as the quantity
of items sold or the price paid to purchase them.

Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items
in a market basket data and T = {t1, t2, . . . , tN } be the set of all transactions.
Each transaction ti contains a subset of items chosen from I. In association
analysis, a collection of zero or more items is termed an itemset. If an itemset
contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk}
is an example of a 3-itemset. The null (or empty) set is an itemset that does
not contain any items.

The transaction width is defined as the number of items present in a trans-


action. A transaction tj is said to contain an itemset X if X is a subset of
tj . For example, the second transaction shown in Table 6.2 contains the item-
set {Bread, Diapers} but not {Bread, Milk}. An important property of an
itemset is its support count, which refers to the number of transactions that
contain a particular itemset. Mathematically, the support count, σ(X), for an
itemset X can be stated as follows:
σ(X) = ∣
∣{ti|X ⊆ ti, ti ∈ T }∣
∣,
where the symbol | · | denote the number of elements in a set. In the data set
shown in Table 6.2, the support count for {Beer, Diapers, Milk} is equal to
two because there are only two transactions that contain all three items.
Association Rule An association rule is an implication expression of the
form X −→ Y , where X and Y are disjoint itemsets, i.e., X ∩ Y = ∅. The
strength of an association rule can be measured in terms of its support and
confidence. Support determines how often a rule is applicable to a given
data set, while confidence determines how frequently items in Y appear in
transactions that contain X. The formal definitions of these metrics are
Support, s(X −→ Y ) = σ(X ∪ Y ) (6.1)
_______ ;
N

Confidence, c(X −→ Y ) = σ(X ∪ Y ) (6.2)


________
σ(X).
Example 6.1. Consider the rule {Milk, Diapers} −→ {Beer}. Since the
support count for {Milk, Diapers, Beer} is 2 and the total number of trans-
actions is 5, the rule’s support is 2/5 = 0.4. The rule’s confidence is obtained
by dividing the support count for {Milk, Diapers, Beer} by the support count
for {Milk, Diapers}. Since there are 3 transactions that contain milk and di-
apers, the confidence for this rule is 2/3 = 0.67.
Why Use Support and Confidence?
Support is an important measure because a rule that has very low support may occur
simply by chance. A low support rule is also likely to be uninteresting from a business
perspective because it may not be profitable to promote items that customers seldom buy
together (with the exception of the situation described in Section 6.8). For these reasons,
support is often used to eliminate uninteresting rules. As will be shown in Section 6.2.1,
support also has a desirable property that can be exploited for the efficient discovery of
association rules. Confidence, on the other hand, measures the reliability of the inference
made by a rule. For a given rule X −→ Y , the higher the confidence, the more likely it is for
Y to be present in transactions that contain X. Confidence also provides an estimate of the
conditional probability of Y given X. Association analysis results should be interpreted with
caution. The inference made by an association rule does not necessarily imply causality.
Instead, it suggests a strong co-occurrence relationship between items in the antecedent
and consequent of the rule. Causality, on the other hand, requires knowledge about the
causal and effect attributes in the data and typically involves relationships occurring over
time (e.g., ozone depletion leads to global warming)

What is Apriori Algorithm?


Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the Apriori algorithm operates on a database
containing a huge number of transactions. For example, the items customers but at a Big
Bazar.

Apriori algorithm helps the customers to buy their products with ease and increases the
sales performance of the particular store.

Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other
words, we can say that the Apriori algorithm is an association rule leaning that analyzes
that people who bought product A also bought product B.

The primary objective of the Apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to
one another. Apriori algorithm is also called frequent pattern mining. Generally, you
operate the Apriori algorithm on a database that consists of a huge number of transactions.
Let's understand the Apriori algorithm with the help of an example; suppose you go to Big
Bazar and buy different products. It helps the customers buy their products with ease and
increases the sales performance of the Big Bazar.
How does the Apriori Algorithm work in Data Mining?
We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}.
The database comprises six transactions where 1 represents the presence of the product
and 0 represents the absence of the product.

Transaction ID Rice Pulse Oil Milk Apple


t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1

The Apriori Algorithm makes the given assumptions

o All subsets of a frequent itemset must be frequent.


o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1:

Make a frequency table of all the products that appear in all the transactions. Now, short
the frequency table to add only those products with a threshold support level of over 50
percent. We find the given frequency table.

Product Frequency (Number of transactions)


Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M) 4
The above table indicated the products frequently bought by the customers.

Step 2:

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.

Itemset Frequency (Number of transactions)


RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3:

Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4:

Now, look for a set of three products that the customers buy together. We get the given
combination.

RP and RO give RPO


PO and PM give POM

Step 5:

Calculate the frequency of the two item sets, and you will get the given frequency table.

Itemset Frequency (Number of transactions)


RPO 4
POM 3
If you implement the threshold assumption, you can figure out that the customers' set of
three products is RPO.

We have considered an easy example to discuss the Apriori algorithm in data mining. In
reality, you find thousands of such combinations.

__________________________________________________________

You might also like