0% found this document useful (0 votes)

7 views15 pages

Sales Data Clustering

The document outlines a sales data clustering analysis using Python, focusing on a dataset containing sales information. It includes data preprocessing, visualization of sales distributions, and the application of K-Means and hierarchical clustering techniques to identify patterns in the data. Key metrics such as inertia, silhouette scores, and various clustering indices are calculated to evaluate the clustering performance.

Uploaded by

Kpranit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views15 pages

Sales Data Clustering

Uploaded by

Kpranit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

sales-data-clustering

September 27, 2024

[46]: # This Python 3 environment comes with many helpful analytics libraries␣
↪installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/

↪docker-python

# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will list␣
↪all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that␣

↪gets preserved as output when you create a version using "Save & Run All"

# You can also write temporary files to /kaggle/temp/, but they won't be saved␣
↪outside of the current session

/kaggle/input/sample-sales-data/sales_data_sample.csv

[47]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

1
df = pd.read_csv("/kaggle/input/sample-sales-data/sales_data_sample.csv",␣
↪encoding='latin1')

[48]: df.head()

[48]: ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES \

0 10107 30 95.70 2 2871.00
1 10121 34 81.35 5 2765.90
2 10134 41 94.74 2 3884.34
3 10145 45 83.26 6 3746.70
4 10159 49 100.00 14 5205.27

ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID … \

0 2/24/2003 0:00 Shipped 1 2 2003 …
1 5/7/2003 0:00 Shipped 2 5 2003 …
2 7/1/2003 0:00 Shipped 3 7 2003 …
3 8/25/2003 0:00 Shipped 3 8 2003 …
4 10/10/2003 0:00 Shipped 4 10 2003 …

ADDRESSLINE1 ADDRESSLINE2 CITY STATE \

0 897 Long Airport Avenue NaN NYC NY
1 59 rue de l'Abbaye NaN Reims NaN
2 27 rue du Colonel Pierre Avia NaN Paris NaN
3 78934 Hillside Dr. NaN Pasadena CA
4 7734 Strong St. NaN San Francisco CA

POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE

0 10022 USA NaN Yu Kwai Small
1 51100 France EMEA Henriot Paul Small
2 75508 France EMEA Da Cunha Daniel Medium
3 90003 USA NaN Young Julie Medium
4 NaN USA NaN Brown Julie Medium

[5 rows x 25 columns]

[49]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64

2
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB

[50]: df.isnull().sum()

[50]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074

3
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64

[51]: plt.figure(figsize=(8,6))
sns.histplot(df['SALES'], kde=True, color='blue', bins=30)
plt.title('Sales Distribution')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.show()

[52]: plt.figure(figsize=(8,6))
sns.scatterplot(x='QUANTITYORDERED', y='SALES', data=df, hue='DEALSIZE',␣
↪palette='coolwarm')

plt.title('Quantity Ordered vs Sales')

plt.xlabel('Quantity Ordered')
plt.ylabel('Sales')
plt.show()

4
[53]: plt.figure(figsize=(8,6))
sns.boxplot(x='PRICEEACH', data=df, color='lightcoral')
plt.title('Price Each Distribution')
plt.xlabel('Price Each')
plt.show()

5
[54]: plt.figure(figsize=(10,6))
sns.barplot(x='PRODUCTLINE', y='SALES', data=df, estimator=sum, ci=None,␣
↪palette='viridis')

plt.title('Total Sales per Product Line')

plt.xticks(rotation=45)
plt.ylabel('Total Sales')
plt.show()

6
[55]: plt.figure(figsize=(8,6))
sales_per_year = df.groupby('YEAR_ID')['SALES'].sum().reset_index()
sns.lineplot(x='YEAR_ID', y='SALES', data=sales_per_year, marker='o',␣
↪color='dodgerblue')

plt.title('Yearly Sales Trend')

plt.xlabel('Year')
plt.ylabel('Total Sales')
plt.show()

7
[56]: plt.figure(figsize=(8,6))
sns.boxplot(x='DEALSIZE', y='SALES', data=df, palette='Set2')
plt.title('Deal Size Impact on Sales')
plt.xlabel('Deal Size')
plt.ylabel('Sales')
plt.show()

8
[57]: plt.figure(figsize=(12,6))
top_countries = df.groupby('COUNTRY')['SALES'].sum().nlargest(10).reset_index()
sns.barplot(x='COUNTRY', y='SALES', data=top_countries, palette='magma')
plt.title('Top 10 Countries by Total Sales')
plt.xticks(rotation=45)
plt.ylabel('Total Sales')
plt.show()

9
[58]: # Select numerical features for clustering
features = ['QUANTITYORDERED', 'PRICEEACH', 'SALES', 'MSRP']
X = df[features]

[59]: # Handle missing values (if any)

X = X.dropna()

# Standardize the data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

[60]: # Calculate WCSS for different cluster sizes

wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)

# Plot the Elbow Curve

plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

10
[61]: # Choose k based on the elbow method, e.g., k=3
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Add the cluster labels to the original dataframe

df['Cluster'] = clusters

# Visualize the clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=clusters,␣
↪palette='Set1')

plt.title('K-Means Clustering Results')

plt.show()

11
[62]: # Generate linkage matrix using the 'ward' method
linked = linkage(X_scaled, method='ward')

# Plot the dendrogram

plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()

12
[63]: # Choose the number of clusters, e.g., 3
hierarchical_clustering = AgglomerativeClustering(n_clusters=3,␣
↪affinity='euclidean', linkage='ward')

hc_clusters = hierarchical_clustering.fit_predict(X_scaled)

# Add the cluster labels to the original dataframe

df['HC_Cluster'] = hc_clusters

# Visualize the clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=hc_clusters,␣
↪palette='Set1')

plt.title('Hierarchical Clustering Results')

plt.show()

13
[64]: # Inertia for K-Means
inertia = kmeans.inertia_
print(f'Inertia: {inertia}')

from sklearn.metrics import silhouette_score

# Silhouette Score for K-Means
kmeans_silhouette = silhouette_score(X_scaled, clusters)
print(f'Silhouette Score (K-Means): {kmeans_silhouette}')
# Silhouette Score for Hierarchical Clustering
hc_silhouette = silhouette_score(X_scaled, hc_clusters)
print(f'Silhouette Score (Hierarchical): {hc_silhouette}')

from sklearn.metrics import davies_bouldin_score

# Davies-Bouldin Index for K-Means
kmeans_dbi = davies_bouldin_score(X_scaled, clusters)
print(f'Davies-Bouldin Index (K-Means): {kmeans_dbi}')
# Davies-Bouldin Index for Hierarchical Clustering
hc_dbi = davies_bouldin_score(X_scaled, hc_clusters)
print(f'Davies-Bouldin Index (Hierarchical): {hc_dbi}')

14
from sklearn.metrics import calinski_harabasz_score
# Calinski-Harabasz Index for K-Means
kmeans_ch = calinski_harabasz_score(X_scaled, clusters)
print(f'Calinski-Harabasz Index (K-Means): {kmeans_ch}')
# Calinski-Harabasz Index for Hierarchical Clustering
hc_ch = calinski_harabasz_score(X_scaled, hc_clusters)
print(f'Calinski-Harabasz Index (Hierarchical): {hc_ch}')

Inertia: 4766.017133863959
Silhouette Score (K-Means): 0.3504523735351092
Silhouette Score (Hierarchical): 0.31527494740967016
Davies-Bouldin Index (K-Means): 1.0091895570978797
Davies-Bouldin Index (Hierarchical): 0.9888199869306714
Calinski-Harabasz Index (K-Means): 1930.6761983442068
Calinski-Harabasz Index (Hierarchical): 1672.3553032169289

The Genius Wave: Unlock Your Brain's Full Potential in Just 7 Minutes!
0% (1)
The Genius Wave: Unlock Your Brain's Full Potential in Just 7 Minutes!
10 pages
The List of 1000 Slack Communities by Standuply PDF
100% (1)
The List of 1000 Slack Communities by Standuply PDF
42 pages
Mini Project Clustering
50% (2)
Mini Project Clustering
33 pages
Human Resource Planning
No ratings yet
Human Resource Planning
27 pages
Aim High 6 Vocabulary Test 1
No ratings yet
Aim High 6 Vocabulary Test 1
2 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Customer Segmentation Report
No ratings yet
Customer Segmentation Report
8 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
Project Data Mining (AMAN YADAV)
No ratings yet
Project Data Mining (AMAN YADAV)
12 pages
EDA Plots Code
No ratings yet
EDA Plots Code
13 pages
Compute2
No ratings yet
Compute2
10 pages
EcommerceAnalysis 1680541297
No ratings yet
EcommerceAnalysis 1680541297
11 pages
Japanese vs. American Business Culture
No ratings yet
Japanese vs. American Business Culture
12 pages
Casos de ML Unsupervised Daniel Ames Camayo
No ratings yet
Casos de ML Unsupervised Daniel Ames Camayo
20 pages
Place Value: Freebie
No ratings yet
Place Value: Freebie
9 pages
Set 2
No ratings yet
Set 2
19 pages
Kmeansclustering Sales Dataset
No ratings yet
Kmeansclustering Sales Dataset
6 pages
IDM Assignment
No ratings yet
IDM Assignment
15 pages
Untitled Document-2-1-13-7-11.4
No ratings yet
Untitled Document-2-1-13-7-11.4
5 pages
Intro Qugates
No ratings yet
Intro Qugates
4 pages
FMLASS3Q7 - Jupyter Notebook
No ratings yet
FMLASS3Q7 - Jupyter Notebook
6 pages
AAM 7th Prac
No ratings yet
AAM 7th Prac
4 pages
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
Siddhesh Asati: #Group: B (ML)
No ratings yet
Siddhesh Asati: #Group: B (ML)
9 pages
Extreme Loading Evaluation
No ratings yet
Extreme Loading Evaluation
39 pages
Pandas Tricks To Create A DataFrame From An Existing One
No ratings yet
Pandas Tricks To Create A DataFrame From An Existing One
14 pages
Exercise6 Solution
No ratings yet
Exercise6 Solution
8 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
23CC554
No ratings yet
23CC554
10 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
K Means Clustering Customer Clustering
No ratings yet
K Means Clustering Customer Clustering
7 pages
Importing Libraries: Import As Import As Import As Import As Import From Import
No ratings yet
Importing Libraries: Import As Import As Import As Import As Import From Import
12 pages
SA Virukill Manual
100% (1)
SA Virukill Manual
18 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
SPPUML6
No ratings yet
SPPUML6
9 pages
Task 6
No ratings yet
Task 6
14 pages
BI Developer Test Medium
No ratings yet
BI Developer Test Medium
8 pages
GRL - EX - 4 (1) .Ipynb - Colaboratory
No ratings yet
GRL - EX - 4 (1) .Ipynb - Colaboratory
7 pages
ML Lab
No ratings yet
ML Lab
8 pages
Practical 5
No ratings yet
Practical 5
6 pages
Copy of Paper 4 Powerpoint
No ratings yet
Copy of Paper 4 Powerpoint
8 pages
Nutrient Cycle Worksheet
50% (2)
Nutrient Cycle Worksheet
8 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
Clustering Algorithms CheatSheet 1710438661
No ratings yet
Clustering Algorithms CheatSheet 1710438661
6 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Bone Suplement Market Segmentation
No ratings yet
Bone Suplement Market Segmentation
20 pages
Nabh Assessor Guide For Accreditation Programme For Clinic-Allopathy
No ratings yet
Nabh Assessor Guide For Accreditation Programme For Clinic-Allopathy
12 pages
Data Science
No ratings yet
Data Science
2 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
ML 5
No ratings yet
ML 5
11 pages
Implement K-Means Clustering.: Preprocessing
No ratings yet
Implement K-Means Clustering.: Preprocessing
8 pages
Class 7 Maths EnglishMedium
100% (1)
Class 7 Maths EnglishMedium
248 pages
Ass6 (DMDS)
No ratings yet
Ass6 (DMDS)
7 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
NumPy and Pandas Step
No ratings yet
NumPy and Pandas Step
9 pages
D3 Docs
No ratings yet
D3 Docs
6 pages
Pandas Notes
No ratings yet
Pandas Notes
8 pages
K Means
No ratings yet
K Means
5 pages
Tugas Clustering - 132021012 - Kevin Gazkia Naufal
No ratings yet
Tugas Clustering - 132021012 - Kevin Gazkia Naufal
6 pages
Speaking Lesson Plan
No ratings yet
Speaking Lesson Plan
7 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Lecture Radiation
No ratings yet
Lecture Radiation
16 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
A Longitudinal R&D
No ratings yet
A Longitudinal R&D
21 pages
3bhs112321 Zab E44 A Sign&Par Acs1000 Msah44xx Rev A
No ratings yet
3bhs112321 Zab E44 A Sign&Par Acs1000 Msah44xx Rev A
326 pages
Guides
No ratings yet
Guides
23 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
Ferrock - A Stronger, Greener Alternative To Concrete
No ratings yet
Ferrock - A Stronger, Greener Alternative To Concrete
4 pages
DMV - 1 - Jupyter Notebook
No ratings yet
DMV - 1 - Jupyter Notebook
4 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Add and Compare One Digit Worksheet 1 PDF
No ratings yet
Add and Compare One Digit Worksheet 1 PDF
2 pages
Tybcom Practical It Subject
100% (1)
Tybcom Practical It Subject
23 pages
Summary The Chapter 4 of Organizational Behavior
No ratings yet
Summary The Chapter 4 of Organizational Behavior
3 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
Lesson 1 Workability Skills
100% (1)
Lesson 1 Workability Skills
43 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Tutorial 01 Answer
No ratings yet
Tutorial 01 Answer
2 pages
ProblemSheet06MT3073 CRV
No ratings yet
ProblemSheet06MT3073 CRV
2 pages
Intelligent Active Suspensing System For Two Wheeler
No ratings yet
Intelligent Active Suspensing System For Two Wheeler
4 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
MTK Catcher and Memory Dump Procedure
No ratings yet
MTK Catcher and Memory Dump Procedure
6 pages
School Form 1 School Register For Senior High School (SF1-SHS)
No ratings yet
School Form 1 School Register For Senior High School (SF1-SHS)
24 pages
Introduction To Probability
100% (28)
Introduction To Probability
520 pages
Fce Reading The Taste of Food
100% (2)
Fce Reading The Taste of Food
2 pages
Critical Security Studies
100% (8)
Critical Security Studies
404 pages

Sales Data Clustering

Uploaded by

Sales Data Clustering

Uploaded by

sales-data-clustering

September 27, 2024

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/

# For example, here's several helpful packages to load

import numpy as np # linear algebra

# Input data files are available in the read-only "../input/" directory

# You can write up to 20GB to the current directory (/kaggle/working/) that␣

[47]: import pandas as pd

[48]: ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES \

ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID … \

ADDRESSLINE1 ADDRESSLINE2 CITY STATE \

POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE

plt.title('Quantity Ordered vs Sales')

plt.title('Total Sales per Product Line')

plt.title('Yearly Sales Trend')

[59]: # Handle missing values (if any)

# Standardize the data

[60]: # Calculate WCSS for different cluster sizes

# Plot the Elbow Curve

# Add the cluster labels to the original dataframe

# Visualize the clusters

plt.title('K-Means Clustering Results')

# Plot the dendrogram

# Add the cluster labels to the original dataframe

# Visualize the clusters

plt.title('Hierarchical Clustering Results')

from sklearn.metrics import silhouette_score

from sklearn.metrics import davies_bouldin_score

You might also like