Data Mining
Data Mining
The goal of this case study is to apply data mining techniques to the Online Retail
Dataset available on the UCI Machine Learning Repository. This dataset contains
transactional data from a UK-based online retail store between 2010 and 2011.
Objective:
To identify purchasing patterns, customer segments, and product associations that
can inform business strategies such as customer retention, marketing campaigns, and
inventory management.
2. Dataset Overview
a. Problem Statement
b. Data Preprocessing
d. Evaluation
• Clustering:
o Silhouette Score used for evaluation (optimal K = 4)
• Association Rules:
o Metrics: Support, Confidence, and Lift
• Anomaly Detection:
o Manual inspection of flagged records
• Customer Clusters:
o High-value customers buy frequently and spend more
o Some clusters are seasonal or country-specific
• Association Rules:
Gift sets and party supplies are often bought together
o
o Useful for bundle promotions
• Anomalies:
o Outliers correspond to data entry errors or bulk purchases
4. Deliverables
A. Jupyter Notebook
Includes:
Covers:
1. Important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
2. Load dataset
df = pd.read_excel('Online Retail.xlsx')
df.head()
3. Data Preprocessing
# Convert InvoiceDate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# Drop duplicates
df.drop_duplicates(inplace=True)
# Preview
df.info()
import datetime as dt
# Normalize
scaler = MinMaxScaler()
rfm_scaled = scaler.fit_transform(rfm)
# KMeans
sil = []
for k in range(2, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(rfm_scaled)
sil.append(silhouette_score(rfm_scaled, kmeans.labels_))
# Fit best K
k_opt = sil.index(max(sil)) + 2
kmeans = KMeans(n_clusters=k_opt, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)
rules.sort_values('lift', ascending=False).head()
7. Anomaly Detection
# Show anomalies
df[df['Anomaly'] == -1].head()
# Cluster count
sns.countplot(x='Cluster', data=rfm)
plt.title("Customer Distribution per Cluster")
plt.show()
Objective:
Analyze transaction data using clustering, association rule mining, and anomaly
detection techniques to identify:
2. Dataset Overview
3. Techniques Used
By converting transaction data into a basket format, the Apriori algorithm was used
to discover frequently bought itemsets. Association rules with high lift values were
identified to inform bundling and promotion strategies.
To detect unusual purchases, the Isolation Forest method was applied to Quantity,
UnitPrice, and TotalPrice features. Outliers were flagged for further review.
Visualization:
Association Rules:
Anomaly Detection: