0% found this document useful (0 votes)
10 views6 pages

Kmeansclustering Sales Dataset

Uploaded by

tryhackkme123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Kmeansclustering Sales Dataset

Uploaded by

tryhackkme123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

kmeansclustering-sales-dataset

November 6, 2024

[1]: import pandas as pd

C:\Users\ASUS\AppData\Local\Temp\ipykernel_24628\4080736814.py:1:
DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of
pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better
interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

import pandas as pd

[2]: # Read the Dataset


dataframe = pd.read_csv("sales_data_sample.csv", encoding="ISO-8859-1")

[3]: # Create a Copy of the Dataset, we will work on this Copy


df = dataframe

[4]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object

1
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB

[5]: # Drop the Unnecessary Columns


df = df[['ORDERLINENUMBER', 'SALES']]

[6]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERLINENUMBER 2823 non-null int64
1 SALES 2823 non-null float64
dtypes: float64(1), int64(1)
memory usage: 44.2 KB

[7]: df.isna().sum()

[7]: ORDERLINENUMBER 0
SALES 0
dtype: int64

[8]: # Standard Preprocessing

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaled_values = scaler.fit_transform(df.values)

# This tries to make the Mean 0 and the Standard Deviation as 1

2
[9]: # Import KMeansClustering

from sklearn.cluster import KMeans

[18]: # Finding k with the Elbow Method


# Within Cluster Sum of Squares of Distances

wcss = []

for i in range(1,11):
model = KMeans(n_clusters=i)
model.fit_predict(scaled_values)
wcss.append(model.inertia_)

wcss
# The inertia is computed as the sum of squared distances from each data point␣
↪to the center of its assigned cluster

[18]: [5646.0,
3598.6969488881828,
2087.4819726029436,
1737.9042147878802,
1394.3494502342178,
1122.143072102116,
1000.5032931492917,
868.2475253342712,
794.2912840469619,
735.634443264441]

[32]: # Plot the Elbow Plot


import matplotlib.pyplot as plt

plt.plot(range(1,11),wcss,'ro-')
plt.show()

3
[12]: # K = 7 seems to be a better choice for k

[13]: kmeans_model = KMeans(n_clusters=7)

[14]: cluster = kmeans_model.fit_predict(scaled_values)

[33]: # import warnings


# warnings.filterwarnings('ignore')
df['Cluster'] = cluster

[37]: from sklearn.cluster import AgglomerativeClustering


import scipy.cluster.hierarchy as sch

# Dendrogram to visualize hierarchical clustering


plt.figure(figsize=(10, 7))
dendrogram = sch.dendrogram(sch.linkage(scaled_values, method='ward'))
plt.title("Dendrogram for Hierarchical Clustering")
plt.xlabel("Samples")
plt.ylabel("Euclidean distances")
plt.show()

# Implement Agglomerative Clustering

4
agg_cluster = AgglomerativeClustering(n_clusters=7, metric='euclidean',␣
↪linkage='ward')

y_agg = agg_cluster.fit_predict(scaled_values)

[38]: y_agg

[38]: array([0, 0, 4, …, 4, 0, 2], dtype=int64)

[34]: df

[34]: ORDERLINENUMBER SALES Cluster


0 2 2871.00 6
1 5 2765.90 6
2 2 3884.34 2
3 6 3746.70 2
4 14 5205.27 1
… … … …
2818 15 2244.40 3
2819 1 3978.51 2
2820 4 5417.57 5
2821 1 2116.16 6

5
2822 9 3079.44 0

[2823 rows x 3 columns]

[35]: plt.scatter(df['ORDERLINENUMBER'], df['SALES'],c=df['Cluster'])

[35]: <matplotlib.collections.PathCollection at 0x2407312e9f0>

[ ]:

You might also like