Kmeansclustering Sales Dataset
Kmeansclustering Sales Dataset
November 6, 2024
C:\Users\ASUS\AppData\Local\Temp\ipykernel_24628\4080736814.py:1:
DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of
pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better
interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
[4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
1
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB
[6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERLINENUMBER 2823 non-null int64
1 SALES 2823 non-null float64
dtypes: float64(1), int64(1)
memory usage: 44.2 KB
[7]: df.isna().sum()
[7]: ORDERLINENUMBER 0
SALES 0
dtype: int64
2
[9]: # Import KMeansClustering
wcss = []
for i in range(1,11):
model = KMeans(n_clusters=i)
model.fit_predict(scaled_values)
wcss.append(model.inertia_)
wcss
# The inertia is computed as the sum of squared distances from each data point␣
↪to the center of its assigned cluster
[18]: [5646.0,
3598.6969488881828,
2087.4819726029436,
1737.9042147878802,
1394.3494502342178,
1122.143072102116,
1000.5032931492917,
868.2475253342712,
794.2912840469619,
735.634443264441]
plt.plot(range(1,11),wcss,'ro-')
plt.show()
3
[12]: # K = 7 seems to be a better choice for k
4
agg_cluster = AgglomerativeClustering(n_clusters=7, metric='euclidean',␣
↪linkage='ward')
y_agg = agg_cluster.fit_predict(scaled_values)
[38]: y_agg
[34]: df
5
2822 9 3079.44 0
[ ]: