0% found this document useful (0 votes)
68 views27 pages

Chapter 3

The document discusses association rule mining and the Apriori algorithm. Association rule mining finds frequent patterns and correlations among items in transaction databases. The Apriori algorithm uses an iterative approach known as the Apriori principle to efficiently find all frequent itemsets in a database. It performs multiple passes over the data and uses candidate generation and pruning to reduce the search space. The algorithm evaluates candidate itemsets based on support and confidence metrics to identify the most important relationships represented as rules.

Uploaded by

Bikila Seketa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views27 pages

Chapter 3

The document discusses association rule mining and the Apriori algorithm. Association rule mining finds frequent patterns and correlations among items in transaction databases. The Apriori algorithm uses an iterative approach known as the Apriori principle to efficiently find all frequent itemsets in a database. It performs multiple passes over the data and uses candidate generation and pruning to reduce the search space. The algorithm evaluates candidate itemsets based on support and confidence metrics to identify the most important relationships represented as rules.

Uploaded by

Bikila Seketa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Wollega University

Chapter Three
Data Warehousing and Data Mining
November 13, 2021
CHAPTER-3
Association rule Mining
Description
Principle
Design
Algorithm
Rule evaluation
• What Is Association Rule Mining?
• Association rule mining Finding frequent patterns, associations,
correlations, or causal structures among sets of items in transaction
databases.
• Understand customer buying habits by finding associations and
correlations between the different items that customers place in their
“shopping basket” .
• Applications „ Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, web log analysis, fraud detection
(supervisor->examiner)
• Basic Concepts Association Rule:
• Basic Concepts „ Given: „ (1) database of transactions, „
• (2) each transaction is a list of items purchased by a customer in a
visit
• Rule basic Measures:

• Support and Confidence


• A ⇒ B [ s, c ]
• Support: denotes the frequency of the rule within transactions. A
high value means that the rule involve a great part of database.
• support(A ⇒ B [ s, c ]) = p(A ∪ B)

• Confidence: denotes the percentage of transactions containing A


which contain also B. It is an estimation of conditioned probability .
• confidence(A ⇒ B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).
• Design The Association Rules
• How Association Rules Work
• Association rule mining, at a basic level, involves the use of 
machine learning models to analyze data for patterns, in a
database.
• It identifies frequent if-then associations, which themselves are
the association rules.
• An association rule has two parts:
• an antecedent (if) and a consequent (then).
• An antecedent is an item found within the data. A consequent is an
item found in combination with the antecedent.
• Association rules are created by searching data for frequent if-then
patterns and using the criteria support and confidence to identify
the most important relationships.
•  Support is an indication of how frequently the items appear in the
data. 
• Confidence indicates the number of times the if-then statements
are found true.
• A third metric, called lift, can be used to compare confidence with
expected confidence, or how many times an if-then statement is
expected to be found true.
• Principle Of Association Rules
• The strength of a given association rule is measured by two main
parameters: support and confidence.
• Support is an indication of how frequently the items appear in the
data. 
• Confidence indicates the number of times the if-then statements are
found true.
• A rule may show a strong correlation in a data set because it appears
very often but may occur far less when applied. This would be a case
of high support, but low confidence.
• A third value parameter, known as the lift value, is the ratio of
confidence to support.
• If the lift value is a negative value, then there is a negative
correlation between data points. If the value is positive, there is a
positive correlation, and if the ratio equals 1, then there is no
correlation
• Understanding Basic Association Algorithm Concepts (Rule
evaluation)
• Before going to the algorithm principles, this section introduces a
few basic association algorithm concepts.
• The following sections define the terms and concepts you will need
to understand before implementing the algorithm principles:
• Item set
An item set is a set of items. Each item is an attribute value.
• item set contains a set of products such, as cake, Pepsi, and milk.
• Each item set has a size, which is the number of items contained in
the item set.
• The size of item set {Cake, Pepsi, Milk} is 3.
• What Is An Item set?
•  A set of items together is called an item set. If any item set has k-
items it is called a k-item set. An item set consists of two or more
items. An item set that occurs frequently is called a frequent item
set. 
• frequent item set mining is a data mining technique to identify
the items that often occur together.
•  For Example, Bread and butter, Laptop and Antivirus software,
etc.
•  What Is A Frequent Item set?
• A set of items is called frequent if it satisfies a minimum threshold
value for support and confidence.
• Support shows transactions with items purchased together in a
single transaction.
• Confidence shows transactions where the items are purchased
one after the other.
• Support
Support is used to measure the popularity of an itemset. Support of
an item set {A, B} is made up of the total number of transactions
that contain both A and B.
• Support ({A, B}) = Number of Transactions(A, B)

• Note:
• Minimum_Support is a threshold parameter you need to specify
before processing an association model.
• It means that you are interested only in those item sets and rules
that represent at least minimum support of the dataset.
• Probability(Confidence)
Probability is a property of an association rule.
• The probability of a rule A=>B is calculated using the support of
item set {A,B} divided by the support of {A}.
• This probability is also called confidence in the data mining research
community. It is defined as follows:
•  Probability (A => B) = Probability (B|A) = Support (A, B)/ Support
(A)
• Support and Confidence for Item set A and B are represented by
formulas:
• Importance
• Importance is also called the interesting score or the lift in some
literature. Importance can be used to measure item sets and rules.
The importance of an item set is defined using the following
formula:
•  Importance ({A,B}) = Probability (A, B)/(Probability (A)*
Probability (B))
•  If importance = 1, A and B are independent items. It means that
the purchase of product A and purchase of product B are two
independent events.
• If importance < 1, A and B are negatively correlated. This means if a
customer buys A, it is unlikely he will also buy B.
• If importance > 1, A and B are positively correlated. This means if a
customer buys A, it is very likely he also buys B.
• An importance of 0 means that there is no association between A
and B.
• A positive importance score means that the probability of B goes up
when A is true.
• A negative importance score means that the probability of B goes
down when A is true.
• Candidate Generation
• An item set of size k+1 is candidate to be frequent only if all of its
subsets of size k are known to be frequent.
• Association rule algorithms
• Most algorithms used to identify large itemsets can be classified as
either sequential or parallel.
• i) Sequential Algorithms:
• i) AIS: The AIS algorithm makes multiple passes over the entire
database. During each pass, it scans all transactions.
• In the first pass, it counts the support of individual items and
determines which of them are large or frequent in the database.
• Large itemsets of each pass are extended to generate candidate
itemsets.
• After scanning a transaction, the common itemsets between large
itemsets of the previous pass and items of this transaction are
determined.
• The AIS algorithm was the first published algorithm developed to
generate all large itemsets in a transaction database.
• Advantage:
• The algorithm was used to find if there was an association between
departments in the customer’s purchasing behavior.
• Disadvantage:
• Drawback of the AIS algorithm is that the data structures required
for maintaining large and candidate itemsets were not specified.
• ii)SETM:
• Similar to the AIS algorithm, the SETM algorithm makes multiple
passes over the database.
• In the first pass, it counts the support of individual items and
determines which of them are large or frequent in the database.
Then, it generates the candidate itemsets
• the SETM remembers the TIDs of the generating transactions with
the candidate itemsets.
• The relational merge-join operation can be used to generate
candidate itemsets.
• Advantage:
• Generating candidate sets, the SETM algorithm saves a copy of the
candidate itemsets together with TID of the generating transaction
in a sequential manner.
• Disadvantage:
• Since for each candidate itemset there is a TID associated with it, it
requires more space to store a large number of TIDs.
• Apriori Algorithm :

• It is by far the most well-known association rule algorithm.


• The fundamental differences of this algorithm from the AIS and
SETM algorithms are the way of generating candidate itemsets and
the selection of candidate itemsets for counting.
• The Apriori generates the candidate itemsets by joining the large
itemsets of the previous pass and deleting those subsets which are
small in the previous pass without considering the transactions in
the database.
• By only considering large itemsets of the previous pass, the number
of candidate large itemsets is significantly reduced.
• The Apriori principle
• Apriori is an algorithm for frequent item set mining and
association rule learning over relational databases.
• It proceeds by identifying the frequent individual items in the
database and extending them to larger and larger item sets as long
as those item sets appear sufficiently often in the database.
• Apriori principle (Main observation):
• – If an itemset is frequent, then all of its subsets must also be
frequent
• – If an itemset is not frequent, then all of its supersets cannot be
frequent ∀X ,Y (: X ⊆ Y ) ⇒ s ( X ) ≥ s ( Y)
• – The support of an itemset never exceeds the support of its
subsets – This is known as the anti-monotone property of support
• Important Points About Apriori Algorithm – 
• Apriori algorithm was the first algorithm that was proposed for
frequent item set mining.
• It was later improved by R Agarwal and R Srikant and came to be
known as Apriori.
• This algorithm uses two steps “join” and “prune” to reduce the
search space. It is an iterative approach to discover the most
frequent itemsets.
• Apriori says:
• The probability that item I is not frequent is if:
• P(I) < minimum support threshold, then I is not frequent.
• P (I+A) < minimum support threshold, then I+A is not frequent,
where A also belongs to item set.
• If an item set has value less than minimum support then all of its
supersets will also fall below min support, and thus can be ignored.
This property is called the Antimonotone property.
• The steps followed in the Apriori Algorithm of data mining are:
•  Join Step: This step generates (K+1) item set from K-item sets by
joining each item with itself.
• Prune Step: This step scans the count of each item in the database.
If the candidate item does not meet minimum support, then it is
regarded as infrequent and thus it is removed. This step is
performed to reduce the size of the candidate item sets.
• Example problem on Apriori Algorithm.
• let us consider the transaction database D as shown in below. there
are 9 transaction in the database use Apriori Algorithm for finding
frequent itemsets in D.
• NOTE : Minimum support count=2

T-ID LIST OF ITEM -ID'S

T1 I1,I2,I5
T2 I2,I4
T3 I2,I3
T4 I1,I2,I4
T5 I1,I3
T6 I2,I3
T7 I1,I3
T8 I1,I2,I3,I5
T9 I1,I2,I3
• STEP:1
• SCAN D for count of each candidate "C1".
• C1 L1

ITEMS SUPPORT
COUNT ITEMS SUPPORT COUNT
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6
{I3} 6
{I4} 2
{I4} 2
2
{I5}
• STEP-2: {I5} 2

• Compare candidate support count with minimum support count


"L1".
• STEP-3 Generate C2 from L1
• "C2“ “L2”
ITEMS SUPPORT COUNT
ITEMS SUPPORT COUNT
{I1,I2} 4
{I1,I2} 4
{I1,I3} 4
{I1,I3} 4
{I1,I4} 1
{I1,I5} 2
{I1,I5} 2
{I2,I3} 4
{I2,I3} 4
{I2,I4} 2
{I2,I4} 2
{I2,I5} 2
{I2,I5} 2

{I3,I4} 0

{I3,I5} 1

{I4,I5} 0

• STEP-4: Compare Candidate support count with minimum support


count."L2“
• STEP-5: Generate C3 from L2
• "C3“ "L3"
ITEMS SUPPORT COUNT

{I1,I2,I3} 2
ITEMS SUPPORT COUNT
{I1,I2,I5} 2
{I1,I2,I3} 2
{I1,I2,I4} 1
{I1,I2,I5} 2
{I1,I3,I5} 1

{I2,I3,I4} 0

{I2,I3,I5} 1

{I2,I4,I5} 0

• STEP-6 : COMPARE candidate support count with minimum support


count "L3"
• STEP-7: Generate " C4" From L3
• "C4"
ITEMS SUPPORT COUNT

{I1,I2,I3,I5} 1

• STEP-8
• Compare candidate support count with minimum support count
"L4"
• L4

as per apriori Algorithm whenever L4=0 The algorithm


ITEMS SUPPORT COUNT
terminated.
0 0 Lk=0 i.e
as per Algorithm Rule we know that Lk-1 L 4-1=L3
frequent item sets.
the item sets in L3 are frequent item sets i.e {I1,I2,I3}
and {I1,I2,I5}
• Below are a few real-world use cases for association
rules:
• Medicine. Doctors can use association rules to help diagnose
patients. There are many variables to consider when making a
diagnosis, as many diseases share symptoms.
• By using association rules and machine learning-fueled data
analysis, doctors can determine the conditional probability of a
given illness by comparing symptom relationships in the data from
past cases.
• Retail. Retailers can collect data about purchasing patterns,
recording purchase data as item barcodes are scanned by point-of-
sale systems.
• Machine learning models can look for co-occurrence in this data to
determine which products are most likely to be purchased together.
The retailer can then adjust marketing and sales strategy to take
advantage of this information.
• User experience (UX) design. Developers can collect data on how
consumers use a website they create. They can use associations in
the data to optimize the website user interface by analyzing where
users tend to click and what maximizes the chance that they
engage with a call to action.

• Entertainment. Services like Netflix and Spotify can use association


rules to fuel their content recommendation engines.
• Machine learning models analyze past user behavior data for
frequent patterns, develop association rules and use those rules to
recommend content that a user is likely to engage with, or organize
content in a way that is likely to put the most interesting content for
a given user first.

You might also like