0% found this document useful (0 votes)

12 views10 pages

Data Mining

This case study applies data mining techniques to the Online Retail Dataset to identify customer segments, product associations, and anomalies in transactions. Key methods include clustering for customer segmentation, association rule mining for product affinity, and anomaly detection to flag unusual purchases. The insights gained can inform marketing strategies, customer retention efforts, and inventory management.

Uploaded by

muhammad abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views10 pages

Data Mining

Uploaded by

muhammad abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Data Mining Case Study: Online Retail Dataset

1. Introduction & Objective

The goal of this case study is to apply data mining techniques to the Online Retail
Dataset available on the UCI Machine Learning Repository. This dataset contains
transactional data from a UK-based online retail store between 2010 and 2011.

Objective:
To identify purchasing patterns, customer segments, and product associations that
can inform business strategies such as customer retention, marketing campaigns, and
inventory management.

2. Dataset Overview

• Source: UCI Machine Learning Repository

• Type: Tabular
• Size: ~500,000 rows, 8 columns
• Fields:
o InvoiceNo
o StockCode
o Description
o Quantity
o InvoiceDate
o UnitPrice
o CustomerID
o Country

3. Case Study Components

a. Problem Statement

The main business problems are:

• Customer segmentation: Who are the valuable customers?

• Product affinity: What items are often bought together?
• Anomaly detection: Are there suspicious or outlier transactions?

b. Data Preprocessing

• Removed null CustomerID rows

• Filtered only positive Quantity and UnitPrice
• Created a TotalPrice field = Quantity × UnitPrice
• Converted InvoiceDate to datetime and extracted time features
• Handled duplicates and inconsistent text data

c. Data Mining Techniques

1. Clustering (Customer Segmentation using RFM and K-Means):

o RFM = Recency, Frequency, Monetary value
o Scaled data using MinMaxScaler
o Optimal K found using Elbow and Silhouette methods
2. Association Rule Mining (Apriori Algorithm):
o Converted transactions into basket format
o Generated frequent itemsets
o Derived rules with lift > 1 for actionable insights
3. Anomaly Detection (Isolation Forest):
o Trained on TotalPrice, Quantity, and UnitPrice
o Detected unusually large or irregular purchases

d. Evaluation

• Clustering:
o Silhouette Score used for evaluation (optimal K = 4)
• Association Rules:
o Metrics: Support, Confidence, and Lift
• Anomaly Detection:
o Manual inspection of flagged records

e. Insights & Interpretation

• Customer Clusters:
o High-value customers buy frequently and spend more
o Some clusters are seasonal or country-specific
• Association Rules:
Gift sets and party supplies are often bought together
o
o Useful for bundle promotions
• Anomalies:
o Outliers correspond to data entry errors or bulk purchases

4. Deliverables

A. Jupyter Notebook

Includes:

• Data preprocessing code

• Visualizations (e.g., customer segments via PCA, heatmaps of associations)
• Model building and evaluation steps

B. Short Report (this document)

Covers:

• Introduction & Objective

• Dataset Overview
• Techniques Used
• Results & Visuals
• Insights & Conclusion

Here is the complete content for Jupyter Notebook:

1. Important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.ensemble import IsolationForest
from mlxtend.frequent_patterns import apriori, association_rules

import warnings
warnings.filterwarnings("ignore")

2. Load dataset
df = pd.read_excel('Online Retail.xlsx')
df.head()

3. Data Preprocessing

# Remove missing Customer IDs

df.dropna(subset=['CustomerID'], inplace=True)

# Filter positive quantity and price

df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create total price

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Convert InvoiceDate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Drop duplicates
df.drop_duplicates(inplace=True)

# Preview
df.info()

4. RFM Analysis for Customer Segmentation

import datetime as dt

snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

rfm = df.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'TotalPrice': 'sum'
}).rename(columns={'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'TotalPrice': 'Monetary'})

# Normalize
scaler = MinMaxScaler()
rfm_scaled = scaler.fit_transform(rfm)

# KMeans
sil = []
for k in range(2, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(rfm_scaled)
sil.append(silhouette_score(rfm_scaled, kmeans.labels_))

# Plot silhouette score

plt.plot(range(2, 10), sil, marker='o')
plt.title("Silhouette Score for K")
plt.xlabel("K")
plt.ylabel("Score")
plt.show()

# Fit best K
k_opt = sil.index(max(sil)) + 2
kmeans = KMeans(n_clusters=k_opt, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

5. Visualize Customer Segments

sns.scatterplot(data=rfm, x='Recency', y='Monetary', hue='Cluster',

palette='viridis')
plt.title("Customer Segmentation (Recency vs Monetary)")
plt.show()
6. Association Rule Mining
basket = (df[df['Country'] == "United Kingdom"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))

basket = basket.applymap(lambda x: 1 if x > 0 else 0)

frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)

rules.sort_values('lift', ascending=False).head()

7. Anomaly Detection

features = df[['Quantity', 'UnitPrice', 'TotalPrice']]

iso = IsolationForest(contamination=0.01, random_state=42)
df['Anomaly'] = iso.fit_predict(features)

# Show anomalies
df[df['Anomaly'] == -1].head()

8. Results & Key Visuals

# Cluster count
sns.countplot(x='Cluster', data=rfm)
plt.title("Customer Distribution per Cluster")
plt.show()

# Top Association Rules

rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head()

9. Insights & Conclusion

- High-value customers were identified using K-Means on RFM scores. This
can help in targeting loyalty programs.
- Association rules showed product combinations with strong buying
patterns, helpful for promotions and bundling.
- Anomalies were mostly extreme transactions, some of which could be
errors or bulk purchases.
- This analysis offers practical insights for marketing and operations based on
real transactional data.

Data Mining Case Study Report: Online Retail Dataset

1. Introduction & Objective

In the age of digital commerce, understanding customer behavior through data is

crucial for businesses to stay competitive. The aim of this case study is to apply data
mining techniques to a real-world online retail dataset and extract meaningful
insights that can help in customer segmentation, marketing strategies, and inventory
management.

Objective:
Analyze transaction data using clustering, association rule mining, and anomaly
detection techniques to identify:

• Valuable customer groups

• Frequently co-purchased items
• Unusual or suspicious transactions

2. Dataset Overview

• Source: UCI Machine Learning Repository

• Dataset: Online Retail
• Type: Tabular
• Size: ~500,000 rows, 8 columns
• Fields:
o InvoiceNo: Unique invoice number
o StockCode: Product code
o Description: Product name
o Quantity: Number of units purchased
o InvoiceDate: Date and time of purchase
o UnitPrice: Price per unit in GBP
o CustomerID: Identifier for each customer
o Country: Country of transaction

Cleaning & Preprocessing Steps:

• Removed missing CustomerID entries

• Filtered out negative or zero values for Quantity and UnitPrice
• Added a TotalPrice field: Quantity × UnitPrice
• Extracted time-based features and handled duplicates

3. Techniques Used

A. Customer Segmentation using Clustering (RFM Analysis + K-Means)

We used Recency, Frequency, and Monetary (RFM) values to assess customer

behavior. After scaling the data, K-Means clustering with optimal clusters (K=4)
was applied. The Silhouette Score method determined the best K.

B. Association Rule Mining (Apriori Algorithm)

By converting transaction data into a basket format, the Apriori algorithm was used
to discover frequently bought itemsets. Association rules with high lift values were
identified to inform bundling and promotion strategies.

C. Anomaly Detection (Isolation Forest)

To detect unusual purchases, the Isolation Forest method was applied to Quantity,
UnitPrice, and TotalPrice features. Outliers were flagged for further review.

4. Results & Key Visuals

Customer Segments:

• Cluster 0: Low frequency, low spending customers

• Cluster 1: High-value customers (frequent, recent, and high spenders)
• Cluster 2: Moderate activity and recent buyers
• Cluster 3: Infrequent, older buyers

Visualization:

• Scatterplot: Recency vs Monetary, color-coded by cluster

• Countplot: Number of customers per cluster

Association Rules:

• High confidence and lift observed in:

o Gift sets often bought with party items
o Specific holiday decorations commonly purchased together

Top Rule Example:

• {Party Balloons} → {Gift Tags}

o Support: 0.012
o Confidence: 0.65
o Lift: 3.1

Anomaly Detection:

• Anomalies mostly include:

o Large bulk orders
o Possibly fraudulent or error-prone transactions

5. Insights & Conclusion

• Customer Segmentation helps tailor marketing and prioritize valuable

clients.
• Product Affinity rules are ideal for cross-selling, discount offers, or bundle
promotions.
• Anomalies flagged can help in fraud detection or quality control of data.
Using data mining techniques on retail transaction data delivers valuable insights
into customer behavior, product relationships, and operational irregularities. These
insights can directly enhance marketing efforts, improve sales strategies, and
strengthen data quality control.

Customer Segmentation Project
No ratings yet
Customer Segmentation Project
16 pages
DM Lab Report
No ratings yet
DM Lab Report
13 pages
Aishwarya Surapuram Resume
No ratings yet
Aishwarya Surapuram Resume
1 page
OpenText TeamSite LiveSite OpenDeploy 21.4.2 Release Notes
No ratings yet
OpenText TeamSite LiveSite OpenDeploy 21.4.2 Release Notes
24 pages
Optimizing Promotion Strategies With Business Intelligence, Customer Segmentation, and Market Basket Analysis
No ratings yet
Optimizing Promotion Strategies With Business Intelligence, Customer Segmentation, and Market Basket Analysis
37 pages
Data Science Project
No ratings yet
Data Science Project
10 pages
WSRR Handbook PDF
No ratings yet
WSRR Handbook PDF
940 pages
DAB 303 Project 2
No ratings yet
DAB 303 Project 2
12 pages
DW&DM PROJECT Sawan
No ratings yet
DW&DM PROJECT Sawan
14 pages
Aldy Budhi Iskandar - PPT Final Project
No ratings yet
Aldy Budhi Iskandar - PPT Final Project
34 pages
E-Commerce Customer Segmentation Via Unsupervised Machine Learning
No ratings yet
E-Commerce Customer Segmentation Via Unsupervised Machine Learning
7 pages
Build A Model For Customer Segmentation by Clustering
No ratings yet
Build A Model For Customer Segmentation by Clustering
1 page
Customer 360
No ratings yet
Customer 360
14 pages
Customer Segmentation in Python
No ratings yet
Customer Segmentation in Python
71 pages
PDF Custome Segmentation
No ratings yet
PDF Custome Segmentation
18 pages
Tasks For Students-1
No ratings yet
Tasks For Students-1
3 pages
Advance Database
No ratings yet
Advance Database
15 pages
Customer Segmentation
No ratings yet
Customer Segmentation
7 pages
Dikshant
No ratings yet
Dikshant
1 page
Concept Map 1
No ratings yet
Concept Map 1
1 page
How To Free Up Space On Your Iphone
No ratings yet
How To Free Up Space On Your Iphone
5 pages
MRA MS Week 1
No ratings yet
MRA MS Week 1
11 pages
DWDM Report
No ratings yet
DWDM Report
6 pages
CRM Analytics - RFM Model (New)
No ratings yet
CRM Analytics - RFM Model (New)
13 pages
Phase 3
No ratings yet
Phase 3
28 pages
Data Analysis
No ratings yet
Data Analysis
10 pages
Java IO
No ratings yet
Java IO
10 pages
DWDM PPT
No ratings yet
DWDM PPT
13 pages
Tasks For Students
No ratings yet
Tasks For Students
4 pages
File 2620
No ratings yet
File 2620
24 pages
ILANTENRALVBDA
No ratings yet
ILANTENRALVBDA
11 pages
Rithika Content
No ratings yet
Rithika Content
25 pages
Stqa Imp Question Paper
No ratings yet
Stqa Imp Question Paper
42 pages
Data Mining
No ratings yet
Data Mining
7 pages
Customer Segmentation New
No ratings yet
Customer Segmentation New
11 pages
MRA MS Week 1
No ratings yet
MRA MS Week 1
11 pages
Henry MRA1 Project PDF
No ratings yet
Henry MRA1 Project PDF
33 pages
EMC VNX Series: Release 7.1
No ratings yet
EMC VNX Series: Release 7.1
32 pages
Marketing and Retail Analytics - Project 1
No ratings yet
Marketing and Retail Analytics - Project 1
34 pages
AML Assignment 1 1
No ratings yet
AML Assignment 1 1
4 pages
Suwarti - Final Project
No ratings yet
Suwarti - Final Project
20 pages
1 s2.0 S1319157819309802 Main
No ratings yet
1 s2.0 S1319157819309802 Main
8 pages
Irjet V11i5300
No ratings yet
Irjet V11i5300
5 pages
Phase 1
No ratings yet
Phase 1
4 pages
Mini-Project - Churn Analysis .
No ratings yet
Mini-Project - Churn Analysis .
15 pages
Customer Segmentation Using RFM Analysis: Overview
No ratings yet
Customer Segmentation Using RFM Analysis: Overview
11 pages
Msla Og
No ratings yet
Msla Og
25 pages
Agglomerative Clustering - Customer Segmentation Term Paper
No ratings yet
Agglomerative Clustering - Customer Segmentation Term Paper
14 pages
Rithika
No ratings yet
Rithika
16 pages
Final Ca
No ratings yet
Final Ca
10 pages
Olist Kasyapa
No ratings yet
Olist Kasyapa
22 pages
Customer Segmentation Using Machine Learning Model
No ratings yet
Customer Segmentation Using Machine Learning Model
12 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Advanced E-Commerce Business Questions and Analytical Hints
From Everand
Advanced E-Commerce Business Questions and Analytical Hints
Zemelak Goraga
No ratings yet
Adm Final
No ratings yet
Adm Final
7 pages
Securing Water and Wastewater Utilities - Risk & Security
No ratings yet
Securing Water and Wastewater Utilities - Risk & Security
18 pages
MRA Project - Shehroz Khan
67% (3)
MRA Project - Shehroz Khan
19 pages
S Swetha - Task 1
No ratings yet
S Swetha - Task 1
8 pages
DSML - Project Report - Group 3
No ratings yet
DSML - Project Report - Group 3
17 pages
SS Teamproject Documentation
No ratings yet
SS Teamproject Documentation
33 pages
K-Means Clustering For Customer Segmentation - A Practical Example - Kimberly Coffey, PH.D - PDF
100% (2)
K-Means Clustering For Customer Segmentation - A Practical Example - Kimberly Coffey, PH.D - PDF
41 pages
CSUDS Project
No ratings yet
CSUDS Project
13 pages
IJCRT2212570
No ratings yet
IJCRT2212570
4 pages
A0205e-1 Cetrics DBBackupTool UserManual
No ratings yet
A0205e-1 Cetrics DBBackupTool UserManual
16 pages
RFM Assignment
No ratings yet
RFM Assignment
2 pages
Solution
No ratings yet
Solution
4 pages
Online Event Management System
No ratings yet
Online Event Management System
4 pages
Caching Challenges and Strategies
No ratings yet
Caching Challenges and Strategies
7 pages
Chen2012 Article DataMiningForTheOnlineRetailIn
No ratings yet
Chen2012 Article DataMiningForTheOnlineRetailIn
12 pages
WQD7005 Case Study - 17219402
No ratings yet
WQD7005 Case Study - 17219402
21 pages
It Glossary English
No ratings yet
It Glossary English
9 pages
International Journal of Distributed and Parallel Systems
No ratings yet
International Journal of Distributed and Parallel Systems
2 pages
RFM How To Automatically Segment Customers Using Purchase Data and A Few Lines of Python
No ratings yet
RFM How To Automatically Segment Customers Using Purchase Data and A Few Lines of Python
8 pages
Free Data Visualization Tutorial - Data Visualization With Excel - Crash Course - Udemy
No ratings yet
Free Data Visualization Tutorial - Data Visualization With Excel - Crash Course - Udemy
4 pages
Dice Resume CV Bhandari A
No ratings yet
Dice Resume CV Bhandari A
8 pages
Data Input
No ratings yet
Data Input
12 pages
E-Governance Project:: E-Courts: Bombay High Court - Panaji Bench
No ratings yet
E-Governance Project:: E-Courts: Bombay High Court - Panaji Bench
17 pages
Gaurav Upadhyay ML Project
No ratings yet
Gaurav Upadhyay ML Project
8 pages
What Is The Main Difference Between GSM and GSM-R - News Incs
No ratings yet
What Is The Main Difference Between GSM and GSM-R - News Incs
6 pages
Chapter 6
No ratings yet
Chapter 6
59 pages
Business Analytics Course
No ratings yet
Business Analytics Course
11 pages
Product Road Map Website
No ratings yet
Product Road Map Website
1 page
Customer Analytics Retail Project
No ratings yet
Customer Analytics Retail Project
8 pages
Data Buffer: Applications
No ratings yet
Data Buffer: Applications
2 pages
Lab 9
No ratings yet
Lab 9
3 pages
Taira Tetra Server: The Power of Modern TETRA
No ratings yet
Taira Tetra Server: The Power of Modern TETRA
2 pages
0 - Introduction To Cybersecurity Risk Management
No ratings yet
0 - Introduction To Cybersecurity Risk Management
9 pages
Screenshot 2023-08-15 at 12.16.26 AM
No ratings yet
Screenshot 2023-08-15 at 12.16.26 AM
1 page
Type Casting in Java
No ratings yet
Type Casting in Java
12 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining Case Study: Online Retail Dataset

1. Introduction & Objective

• Source: UCI Machine Learning Repository

3. Case Study Components

The main business problems are:

• Customer segmentation: Who are the valuable customers?

• Removed null CustomerID rows

c. Data Mining Techniques

1. Clustering (Customer Segmentation using RFM and K-Means):

e. Insights & Interpretation

• Data preprocessing code

B. Short Report (this document)

• Introduction & Objective

Here is the complete content for Jupyter Notebook:

from sklearn.preprocessing import MinMaxScaler

# Remove missing Customer IDs

# Filter positive quantity and price

# Create total price

4. RFM Analysis for Customer Segmentation

snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

# Plot silhouette score

5. Visualize Customer Segments

sns.scatterplot(data=rfm, x='Recency', y='Monetary', hue='Cluster',

basket = basket.applymap(lambda x: 1 if x > 0 else 0)

frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True)

features = df[['Quantity', 'UnitPrice', 'TotalPrice']]

8. Results & Key Visuals

# Top Association Rules

9. Insights & Conclusion

Data Mining Case Study Report: Online Retail Dataset

1. Introduction & Objective

In the age of digital commerce, understanding customer behavior through data is

• Valuable customer groups

• Source: UCI Machine Learning Repository

Cleaning & Preprocessing Steps:

• Removed missing CustomerID entries

A. Customer Segmentation using Clustering (RFM Analysis + K-Means)

We used Recency, Frequency, and Monetary (RFM) values to assess customer

B. Association Rule Mining (Apriori Algorithm)

C. Anomaly Detection (Isolation Forest)

4. Results & Key Visuals

• Cluster 0: Low frequency, low spending customers

• Scatterplot: Recency vs Monetary, color-coded by cluster

• High confidence and lift observed in:

Top Rule Example:

• {Party Balloons} → {Gift Tags}

• Anomalies mostly include:

5. Insights & Conclusion

• Customer Segmentation helps tailor marketing and prioritize valuable

You might also like