0% found this document useful (0 votes)
12 views10 pages

Data Mining

This case study applies data mining techniques to the Online Retail Dataset to identify customer segments, product associations, and anomalies in transactions. Key methods include clustering for customer segmentation, association rule mining for product affinity, and anomaly detection to flag unusual purchases. The insights gained can inform marketing strategies, customer retention efforts, and inventory management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

Data Mining

This case study applies data mining techniques to the Online Retail Dataset to identify customer segments, product associations, and anomalies in transactions. Key methods include clustering for customer segmentation, association rule mining for product affinity, and anomaly detection to flag unusual purchases. The insights gained can inform marketing strategies, customer retention efforts, and inventory management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Mining Case Study: Online Retail Dataset

1. Introduction & Objective

The goal of this case study is to apply data mining techniques to the Online Retail
Dataset available on the UCI Machine Learning Repository. This dataset contains
transactional data from a UK-based online retail store between 2010 and 2011.

Objective:
To identify purchasing patterns, customer segments, and product associations that
can inform business strategies such as customer retention, marketing campaigns, and
inventory management.

2. Dataset Overview

• Source: UCI Machine Learning Repository


• Type: Tabular
• Size: ~500,000 rows, 8 columns
• Fields:
o InvoiceNo
o StockCode
o Description
o Quantity
o InvoiceDate
o UnitPrice
o CustomerID
o Country

3. Case Study Components

a. Problem Statement

The main business problems are:

• Customer segmentation: Who are the valuable customers?


• Product affinity: What items are often bought together?
• Anomaly detection: Are there suspicious or outlier transactions?

b. Data Preprocessing

• Removed null CustomerID rows


• Filtered only positive Quantity and UnitPrice
• Created a TotalPrice field = Quantity × UnitPrice
• Converted InvoiceDate to datetime and extracted time features
• Handled duplicates and inconsistent text data

c. Data Mining Techniques

1. Clustering (Customer Segmentation using RFM and K-Means):


o RFM = Recency, Frequency, Monetary value
o Scaled data using MinMaxScaler
o Optimal K found using Elbow and Silhouette methods
2. Association Rule Mining (Apriori Algorithm):
o Converted transactions into basket format
o Generated frequent itemsets
o Derived rules with lift > 1 for actionable insights
3. Anomaly Detection (Isolation Forest):
o Trained on TotalPrice, Quantity, and UnitPrice
o Detected unusually large or irregular purchases

d. Evaluation

• Clustering:
o Silhouette Score used for evaluation (optimal K = 4)
• Association Rules:
o Metrics: Support, Confidence, and Lift
• Anomaly Detection:
o Manual inspection of flagged records

e. Insights & Interpretation

• Customer Clusters:
o High-value customers buy frequently and spend more
o Some clusters are seasonal or country-specific
• Association Rules:
Gift sets and party supplies are often bought together
o
o Useful for bundle promotions
• Anomalies:
o Outliers correspond to data entry errors or bulk purchases

4. Deliverables

A. Jupyter Notebook

Includes:

• Data preprocessing code


• Visualizations (e.g., customer segments via PCA, heatmaps of associations)
• Model building and evaluation steps

B. Short Report (this document)

Covers:

• Introduction & Objective


• Dataset Overview
• Techniques Used
• Results & Visuals
• Insights & Conclusion

Here is the complete content for Jupyter Notebook:

1. Important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler


from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.ensemble import IsolationForest
from mlxtend.frequent_patterns import apriori, association_rules

import warnings
warnings.filterwarnings("ignore")

2. Load dataset
df = pd.read_excel('Online Retail.xlsx')
df.head()

3. Data Preprocessing

# Remove missing Customer IDs


df.dropna(subset=['CustomerID'], inplace=True)

# Filter positive quantity and price


df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create total price


df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Convert InvoiceDate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Drop duplicates
df.drop_duplicates(inplace=True)

# Preview
df.info()

4. RFM Analysis for Customer Segmentation

import datetime as dt

snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)


rfm = df.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'TotalPrice': 'sum'
}).rename(columns={'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'TotalPrice': 'Monetary'})

# Normalize
scaler = MinMaxScaler()
rfm_scaled = scaler.fit_transform(rfm)

# KMeans
sil = []
for k in range(2, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(rfm_scaled)
sil.append(silhouette_score(rfm_scaled, kmeans.labels_))

# Plot silhouette score


plt.plot(range(2, 10), sil, marker='o')
plt.title("Silhouette Score for K")
plt.xlabel("K")
plt.ylabel("Score")
plt.show()

# Fit best K
k_opt = sil.index(max(sil)) + 2
kmeans = KMeans(n_clusters=k_opt, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

5. Visualize Customer Segments

sns.scatterplot(data=rfm, x='Recency', y='Monetary', hue='Cluster',


palette='viridis')
plt.title("Customer Segmentation (Recency vs Monetary)")
plt.show()
6. Association Rule Mining
basket = (df[df['Country'] == "United Kingdom"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))

basket = basket.applymap(lambda x: 1 if x > 0 else 0)

frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True)


rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

rules.sort_values('lift', ascending=False).head()

7. Anomaly Detection

features = df[['Quantity', 'UnitPrice', 'TotalPrice']]


iso = IsolationForest(contamination=0.01, random_state=42)
df['Anomaly'] = iso.fit_predict(features)

# Show anomalies
df[df['Anomaly'] == -1].head()

8. Results & Key Visuals

# Cluster count
sns.countplot(x='Cluster', data=rfm)
plt.title("Customer Distribution per Cluster")
plt.show()

# Top Association Rules


rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head()

9. Insights & Conclusion


- High-value customers were identified using K-Means on RFM scores. This
can help in targeting loyalty programs.
- Association rules showed product combinations with strong buying
patterns, helpful for promotions and bundling.
- Anomalies were mostly extreme transactions, some of which could be
errors or bulk purchases.
- This analysis offers practical insights for marketing and operations based on
real transactional data.

Data Mining Case Study Report: Online Retail Dataset

1. Introduction & Objective

In the age of digital commerce, understanding customer behavior through data is


crucial for businesses to stay competitive. The aim of this case study is to apply data
mining techniques to a real-world online retail dataset and extract meaningful
insights that can help in customer segmentation, marketing strategies, and inventory
management.

Objective:
Analyze transaction data using clustering, association rule mining, and anomaly
detection techniques to identify:

• Valuable customer groups


• Frequently co-purchased items
• Unusual or suspicious transactions

2. Dataset Overview

• Source: UCI Machine Learning Repository


• Dataset: Online Retail
• Type: Tabular
• Size: ~500,000 rows, 8 columns
• Fields:
o InvoiceNo: Unique invoice number
o StockCode: Product code
o Description: Product name
o Quantity: Number of units purchased
o InvoiceDate: Date and time of purchase
o UnitPrice: Price per unit in GBP
o CustomerID: Identifier for each customer
o Country: Country of transaction

Cleaning & Preprocessing Steps:

• Removed missing CustomerID entries


• Filtered out negative or zero values for Quantity and UnitPrice
• Added a TotalPrice field: Quantity × UnitPrice
• Extracted time-based features and handled duplicates

3. Techniques Used

A. Customer Segmentation using Clustering (RFM Analysis + K-Means)

We used Recency, Frequency, and Monetary (RFM) values to assess customer


behavior. After scaling the data, K-Means clustering with optimal clusters (K=4)
was applied. The Silhouette Score method determined the best K.

B. Association Rule Mining (Apriori Algorithm)

By converting transaction data into a basket format, the Apriori algorithm was used
to discover frequently bought itemsets. Association rules with high lift values were
identified to inform bundling and promotion strategies.

C. Anomaly Detection (Isolation Forest)

To detect unusual purchases, the Isolation Forest method was applied to Quantity,
UnitPrice, and TotalPrice features. Outliers were flagged for further review.

4. Results & Key Visuals


Customer Segments:

• Cluster 0: Low frequency, low spending customers


• Cluster 1: High-value customers (frequent, recent, and high spenders)
• Cluster 2: Moderate activity and recent buyers
• Cluster 3: Infrequent, older buyers

Visualization:

• Scatterplot: Recency vs Monetary, color-coded by cluster


• Countplot: Number of customers per cluster

Association Rules:

• High confidence and lift observed in:


o Gift sets often bought with party items
o Specific holiday decorations commonly purchased together

Top Rule Example:

• {Party Balloons} → {Gift Tags}


o Support: 0.012
o Confidence: 0.65
o Lift: 3.1

Anomaly Detection:

• Anomalies mostly include:


o Large bulk orders
o Possibly fraudulent or error-prone transactions

5. Insights & Conclusion

• Customer Segmentation helps tailor marketing and prioritize valuable


clients.
• Product Affinity rules are ideal for cross-selling, discount offers, or bundle
promotions.
• Anomalies flagged can help in fraud detection or quality control of data.
Using data mining techniques on retail transaction data delivers valuable insights
into customer behavior, product relationships, and operational irregularities. These
insights can directly enhance marketing efforts, improve sales strategies, and
strengthen data quality control.

You might also like