Data Mining
Data Mining
Frequent patterns in data mining are patterns that occur frequently in a dataset. Identifying
these patterns is crucial in tasks like association rule mining, sequential pattern mining, and
structural pattern mining. Frequent patterns can take the form of itemsets, subsequences, or
substructures.
The goal is to find patterns that appear frequently enough to be of interest based on a user-
specified threshold, known as minimum support.
Association Rules
Association Rules are used is to identify relationships between items in large datasets.
Association rules are used to discover how the occurrence of one item is associated with the
occurrence of other items. These rules are expressed in the form of "if-then" statements that
describe the likelihood of items being purchased or occurring together.
The main task in association rule mining is to identify the strong rules discovered in databases
using measures of support, confidence, and lift
Example
Computer → antivirus software [support = 2%, confidence = 60%]
A support of 2% means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.
A confidence of 60% means that 60% of the customers who purchased a computer also bought
the software.
Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by users or
domain experts.
a) Support
Support refers to the frequency with which an itemset appears in the dataset.
For a rule X → Y, support is the fraction of transactions in the dataset that contain both X and
Y. It reflects the prevalence of the rule in the dataset.
Mathematically, support for an association rule X → Y is defined as:
Example: If 100 out of 1,000 transactions contain both bread and butter, then the support for
the rule {Bread} → {Butter} is 10%.
b) Confidence
Confidence measures the strength of the association rule. It indicates the likelihood of Y
being purchased when X has already been purchased.
Confidence is defined as the ratio of the number of transactions that contain both X and Y to
the number of transactions that contain X.
Mathematically, confidence is calculated as:
Support(X∪Y)
Confidence(X→Y) = ----------------------
Support(X)
Example: If 80 out of 100 transactions that include bread also include butter, the confidence
of the rule {Bread} → {Butter} is 80%.
c) Lift
Lift measures how much more likely the consequent Y is to occur when the antecedent X has
occurred, compared to the likelihood of Y occurring independently.
Mathematically, lift is calculated as:
Confidence(X→Y)
Lift(X→Y) = -------------------------
Support(Y)
A lift value:
o Greater than 1: Indicates a positive association, meaning that the occurrence of X
increases the likelihood of Y occurring.
o Equal to 1: Indicates independence between X and Y (no association).
o Less than 1: Indicates a negative association, meaning that the occurrence of X
decreases the likelihood of Y occurring.
Example: If the lift of {Bread} → {Butter} is 1.5, it means that customers who buy bread are
1.5 times more likely to buy butter than a customer chosen at random.
The most commonly used algorithms for frequent itemset mining include:
o Apriori Algorithm: Uses a level-wise search and employs the downward closure
property (if an itemset is frequent, all its subsets must also be frequent).
o FP-Growth Algorithm: Builds an FP-tree to find frequent itemsets without
candidate generation.
b) Step 2: Generate Strong Association Rules
After identifying frequent itemsets, the next step is to generate association rules from
these itemsets. Rules are generated by dividing the frequent itemsets into antecedent
and consequent pairs.
The rules are evaluated based on confidence. Only the rules with confidence higher
than a user-specified minimum confidence threshold are considered strong association
rules.
Market Basket Analysis
Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets
A typical example of frequent itemset mining is market basket analysis. This process analyzes
customer buying habits by finding associations between the different items that customers
place in their “shopping baskets” (see Figure below). The discovery of these associations can
help retailers develop marketing strategies by gaining insight into which items are frequently
purchased together by customers. For instance, if customers are buying milk, how likely are
they to also buy bread (and what kind of bread) on the same trip to the supermarket? This
information can lead to increased sales by helping retailers do selective marketing and plan
their shelf space.
“Which groups or sets of items are customers likely to purchase on a given trip to the store?”
To answer your question, market basket analysis may be performed on the retail data of
customer transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog.