Data Minig Unit 2nd
Data Minig Unit 2nd
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of dataset.
It is based on different rules to discover the interesting relations between variables in the
database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the associations
between items. We can understand it by taking an example of a supermarket, as in a supermarket,
all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby.
Consider the below Examples:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Before we start defining the rule, let us first see the basic definitions.
Frequent Itemset – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are
any 2 item sets.
Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction. It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
Support = \sigma(X+Y) \div total –
It is interpreted as fraction of transactions that contain both X and Y.
Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no
of transactions that includes all items in {A} to the no of transactions that includes all
items in {A}.
Conf(X=>Y) = Supp(X\cupY) \div Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X
also.
Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the item sets X and Y are independent of each other. The expected
confidence is the confidence divided by the frequency of {Y}.
Lift(X=>Y) = Conf(X=>Y) \div Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater
than 1 means they appear together more than expected and less than 1 means they
appear less than expected. Greater lift values indicate stronger association.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.
Association Analysis
Association analysis is the task of finding interesting relationships in large datasets. These
interesting relationships can take two forms: frequent item sets or association rules.
Frequent item sets are a collection of items that frequently occur together. The second way
to view interesting relationships is association rules. Association rules suggest that a
strong relationship exists between two items. I’ll illustrate these two concepts with an
example. A list of transactions from a grocery store is shown in figure 11.1.
Figure 11.1.
A simple list of transactions from a natural foods grocery store called Hole Foods
Frequent items sets are lists of items that commonly appear together. One example from
figure 11.1 is {wine, diapers, soy milk}. (Recall that sets are denoted by a pair of brackets
{}). From the dataset we can also find an association rule such as diapers → wine. This
means that if someone buys diapers, there’s a good chance they’ll buy wine. With the
frequent item sets and association rules, retailers have a much better understanding of
their customers. Although common examples of association analysis are from the retail
industry, it can be applied to a number of other industries, such as website traffic analysis
and medicine.
Association Analysis:
Basic Concepts and Algorithms
Many business enterprises accumulate large quantities of data from their day-to-day
operations. For example, huge amounts of customer purchase data are collected daily at
the checkout counters of grocery stores. Table 6.1 illustrates an example of such data,
commonly known as market basket transactions. Each row in this table corresponds to a
transaction, which contains a unique identifier labelled T ID and a set of items bought by a
given customer. Retailers are interested in analysing the data to learn about the
purchasing behaviour of their customers. Such valuable information can be used to support
a variety of business-related applications such as marketing promotions, inventory
management, and customer relationship management.
{Diapers} −→ {Beer}.
The rule suggests that a strong relationship exists between the sale of diapers and beer
because many customers who buy diapers also buy beer. Retailers can use this type of
rules to help them identify new opportunities for cross-selling their products to the
customers.
Besides market basket data, association analysis is also applicable to other application
domains such as bioinformatics, medical diagnosis, Web mining, and scientific data
analysis. In the analysis of Earth science data, for example, the association patterns may
reveal interesting connections among the ocean, land, and atmospheric processes. Such
information may help Earth scientists develop a better understanding of how the different
elements of the Earth system interact with each other. Even though the techniques
presented here are generally applicable to a wider variety of data sets, for illustrative
purposes, our discussion will focus mainly on market basket data.
There are two key issues that need to be addressed when applying association
analysis to market basket data. First, discovering patterns from a large transaction data set
can be computationally expensive. Second, some of the discovered patterns are potentially
spurious because they may happen simply by chance. The remainder of this chapter is
organized around these two issues. The first part of the chapter is devoted to explaining
the basic concepts of association analysis and the algorithms used to efficiently mine such
patterns. The second part of the chapter deals with the issue of evaluating the discovered
patterns in order to prevent the generation of spurious results.
Problem Definition
This section reviews the basic terminology used in association analysis and
presents a formal description of the task.
Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items
in a market basket data and T = {t1, t2, . . . , tN } be the set of all transactions.
Each transaction ti contains a subset of items chosen from I. In association
analysis, a collection of zero or more items is termed an itemset. If an itemset
contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk}
is an example of a 3-itemset. The null (or empty) set is an itemset that does
not contain any items.
Apriori algorithm helps the customers to buy their products with ease and increases the
sales performance of the particular store.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other
words, we can say that the Apriori algorithm is an association rule leaning that analyzes
that people who bought product A also bought product B.
The primary objective of the Apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to
one another. Apriori algorithm is also called frequent pattern mining. Generally, you
operate the Apriori algorithm on a database that consists of a huge number of transactions.
Let's understand the Apriori algorithm with the help of an example; suppose you go to Big
Bazar and buy different products. It helps the customers buy their products with ease and
increases the sales performance of the Big Bazar.
How does the Apriori Algorithm work in Data Mining?
We will understand this algorithm with the help of an example
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}.
The database comprises six transactions where 1 represents the presence of the product
and 0 represents the absence of the product.
Step 1:
Make a frequency table of all the products that appear in all the transactions. Now, short
the frequency table to add only those products with a threshold support level of over 50
percent. We find the given frequency table.
Step 2:
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.
Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3
Step 4:
Now, look for a set of three products that the customers buy together. We get the given
combination.
Step 5:
Calculate the frequency of the two item sets, and you will get the given frequency table.
We have considered an easy example to discuss the Apriori algorithm in data mining. In
reality, you find thousands of such combinations.
__________________________________________________________