What is market
basket analysis?
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Selecting a bookstore layout
MARKET BASKET ANALYSIS IN PYTHON
Exploring transaction data
TID Transaction
1 biography, history TID = unique ID associated with each
transaction.
2 ction
3 biography, poetry
Transaction = set of unique items
4 ction, history purchased together.
5 biography
... ...
75000 ction, poetry
MARKET BASKET ANALYSIS IN PYTHON
What is market basket analysis?
1. Identify products frequently purchased together.
Biography and history
Fiction and poetry
2. Construct recommendations based on these ndings.
Place biography and history sections together.
Keep ction and history apart.
MARKET BASKET ANALYSIS IN PYTHON
The use cases of market basket analysis
1. Build Net ix-style recommendations engine.
2. Improve product recommendations on an e-commerce store.
3. Cross-sell products in a retail setting.
4. Improve inventory management.
5. Upsell products.
MARKET BASKET ANALYSIS IN PYTHON
Using market basket analysis
TID Transaction Market basket analysis
Construct association rules
11 ction, biography
Identify items frequently purchased
12 ction, biography together
13 history, biography Association rules
{antecedent} → {consequent}
... ... { ction} → {biography}
19 ction, biography
20 ction, biography
... ...
MARKET BASKET ANALYSIS IN PYTHON
Loading the data
import pandas as pd
# Load transactions from pandas.
books = pd.read_csv("datasets/bookstore.csv")
# Print the header
print(books.head(2))
TID Transaction
0 biography, history
1 fiction
For a refresher, see the Pandas Cheat Sheet.
MARKET BASKET ANALYSIS IN PYTHON
Building transactions
# Split transaction strings into lists.
transactions = books['Transaction'].apply(lambda t: t.split(','))
# Convert DataFrame into list of strings.
transactions = list(transactions)
MARKET BASKET ANALYSIS IN PYTHON
Counting the itemsets
# Print the first transaction.
print(transactions[0])
['biography', 'history']
# Count the number of transactions that contain biography and fiction.
transactions.count(['biography', 'fiction'])
218
MARKET BASKET ANALYSIS IN PYTHON
Making a recommendation
# Count the number of transactions that contain fiction and poetry.
transactions.count(['fiction', 'poetry'])
5357
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Identifying
association rules
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Loading and preparing data
import pandas as pd
# Load transactions from pandas.
books = pd.read_csv("datasets/bookstore.csv")
# Split transaction strings into lists.
transactions = books['Transaction'].apply(lambda t: t.split(','))
# Convert DataFrame into list of strings.
transactions = list(transactions)
MARKET BASKET ANALYSIS IN PYTHON
Exploring the data
print(transactions[:5])
[['language', 'travel', 'humor', 'fiction'],
['humor', 'language'],
['humor', 'biography', 'cooking'],
['cooking', 'language'],
['travel']]
MARKET BASKET ANALYSIS IN PYTHON
Association rules
Association rule
Contains antecedent and consequent
{health} → {cooking}
Multi-antecedent rule
{humor, travel} → {language}
Multi-consequent rule
{biography} → {history, language}
MARKET BASKET ANALYSIS IN PYTHON
Dif culty of selecting rules
Finding useful rules is dif cult.
Set of all possible rules is large.
Most rules are not useful.
Must discard most rules.
What if we restrict ourselves to simple rules?
One antecedent and one consequent.
Still challenging, even for small dataset.
MARKET BASKET ANALYSIS IN PYTHON
Generating the rules
ction health
poetry travel
history language
biography humor
cooking
MARKET BASKET ANALYSIS IN PYTHON
Generating the rules
Fiction Rules Poetry Rules ... Humor Rules
ction->poetry poetry-> ction ... humor-> ction
ction->history poetry->history ... humor->history
ction->biography poetry->biography ... humor->biography
ction->cooking poetry->cooking ... humor->cooking
... ... ... ...
ction->humor poetry->humor ...
MARKET BASKET ANALYSIS IN PYTHON
Generating rules with itertools
from itertools import permutations
# Extract unique items.
flattened = [item for transaction in transactions for item in transaction]
items = list(set(flattened))
# Compute and print rules.
rules = list(permutations(items, 2))
print(rules)
[('fiction', 'poetry'),
('fiction', 'history'),
...
('humor', 'travel'),
('humor', 'language')]
MARKET BASKET ANALYSIS IN PYTHON
Counting the rules
# Print the number of rules
print(len(rules))
72
MARKET BASKET ANALYSIS IN PYTHON
Looking ahead
# Import the association rules function
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori
# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.001,
max_len = 2, use_colnames = True)
# Compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets,
metric = "lift",
min_threshold = 1.0)
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
The simplest metric
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Metrics and pruning
A metric is a measure of performance for rules.
{humor} → {poetry}
0.81
{ ction} → {travel}
0.23
Pruning is the use of metrics to discard rules.
Retain: {humor} → {poetry}
Discard: { ction} → {travel}
MARKET BASKET ANALYSIS IN PYTHON
The simplest metric
The support metric measures the share of transactions that contain an itemset.
number of transactions with items(s)
number of transactions
number of transactions with milk
total transactions
MARKET BASKET ANALYSIS IN PYTHON
Support for language
TID Transaction TID Transaction
0 travel, humor, ction 5 poetry, health, travel, history
1 humor, language 6 humor
2 humor, biography, cooking 7 travel
3 cooking, language 8 poetry, ction, humor
4 travel 9 ction, biography
Support for {language} = 2 / 10 = 0.2
MARKET BASKET ANALYSIS IN PYTHON
Support for {Humor} → {Language}
TID Transaction TID Transaction
0 travel,humor, ction 5 poetry,health,travel,history
1 humor,language 6 humor
2 humor,biography,cooking 7 travel
3 cooking,language 8 poetry, ction,humor
4 travel 9 ction,biography
SUPPORT for {language} → {humor} = 0.1
MARKET BASKET ANALYSIS IN PYTHON
Preparing the data
print(transactions)
[['travel', 'humor', 'fiction'],
...
['fiction', 'biography']]
from mlxtend.preprocessing import TransactionEncoder
# Instantiate transaction encoder
encoder = TransactionEncoder().fit(transactions)
MARKET BASKET ANALYSIS IN PYTHON
Preparing the data
# One-hot encode itemsets by applying fit and transform
onehot = encoder.transform(transactions)
# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)
print(onehot)
biography cooking ... poetry travel
0 False False ... False True
...
9 True False ... False False
MARKET BASKET ANALYSIS IN PYTHON
Computing support for single items
print(onehot.mean())
biography 0.2
cooking 0.2
fiction 0.3
health 0.1
history 0.1
humor 0.5
language 0.2
poetry 0.2
travel 0.4
dtype: float64
MARKET BASKET ANALYSIS IN PYTHON
Computing support for multiple items
import numpy as np
# Define itemset that contains fiction and poetry
onehot['fiction+poetry'] = np.logical_and(onehot['fiction'],onehot['poetry'])
print(onehot.mean())
biography 0.2
cooking 0.2
... ...
travel 0.4
fiction+poetry 0.1
dtype: float64
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N