0% found this document useful (0 votes)
39 views10 pages

K Medoids

This document discusses using k-medoids clustering to cluster housing data. The data is first preprocessed by standardizing it. KMedoids clustering from the scikit-learn-extra library is then applied to cluster the data into 3 clusters. The cluster labels are then added back to the original data frame.

Uploaded by

prerna sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views10 pages

K Medoids

This document discusses using k-medoids clustering to cluster housing data. The data is first preprocessed by standardizing it. KMedoids clustering from the scikit-learn-extra library is then applied to cluster the data into 3 clusters. The cluster labels are then added back to the original data frame.

Uploaded by

prerna sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

k-medoids

February 29, 2024

[1]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

[9]: df = pd.read_csv("kc_house_data.csv")
df.head()

[9]: id date price bedrooms bathrooms sqft_living \


0 7129300520 20141013T000000 221900.0 3 1.00 1180
1 6414100192 20141209T000000 538000.0 3 2.25 2570
2 5631500400 20150225T000000 180000.0 2 1.00 770
3 2487200875 20141209T000000 604000.0 4 3.00 1960
4 1954400510 20150218T000000 510000.0 3 2.00 1680

sqft_lot floors waterfront view … grade sqft_above sqft_basement \


0 5650 1.0 0 0 … 7 1180 0
1 7242 2.0 0 0 … 7 2170 400
2 10000 1.0 0 0 … 6 770 0
3 5000 1.0 0 0 … 7 1050 910
4 8080 1.0 0 0 … 8 1680 0

yr_built yr_renovated zipcode lat long sqft_living15 \


0 1955 0 98178 47.5112 -122.257 1340
1 1951 1991 98125 47.7210 -122.319 1690
2 1933 0 98028 47.7379 -122.233 2720
3 1965 0 98136 47.5208 -122.393 1360
4 1987 0 98074 47.6168 -122.045 1800

sqft_lot15
0 5650
1 7639
2 8062
3 5000
4 7503

1
[5 rows x 21 columns]

[10]: df.describe()

[10]: id price bedrooms bathrooms sqft_living \


count 2.161300e+04 2.161300e+04 21613.000000 21613.000000 21613.000000
mean 4.580302e+09 5.400881e+05 3.370842 2.114757 2079.899736
std 2.876566e+09 3.671272e+05 0.930062 0.770163 918.440897
min 1.000102e+06 7.500000e+04 0.000000 0.000000 290.000000
25% 2.123049e+09 3.219500e+05 3.000000 1.750000 1427.000000
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000

sqft_lot floors waterfront view condition \


count 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000
mean 1.510697e+04 1.494309 0.007542 0.234303 3.409430
std 4.142051e+04 0.539989 0.086517 0.766318 0.650743
min 5.200000e+02 1.000000 0.000000 0.000000 1.000000
25% 5.040000e+03 1.000000 0.000000 0.000000 3.000000
50% 7.618000e+03 1.500000 0.000000 0.000000 3.000000
75% 1.068800e+04 2.000000 0.000000 0.000000 4.000000
max 1.651359e+06 3.500000 1.000000 4.000000 5.000000

grade sqft_above sqft_basement yr_built yr_renovated \


count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 7.656873 1788.390691 291.509045 1971.005136 84.402258
std 1.175459 828.090978 442.575043 29.373411 401.679240
min 1.000000 290.000000 0.000000 1900.000000 0.000000
25% 7.000000 1190.000000 0.000000 1951.000000 0.000000
50% 7.000000 1560.000000 0.000000 1975.000000 0.000000
75% 8.000000 2210.000000 560.000000 1997.000000 0.000000
max 13.000000 9410.000000 4820.000000 2015.000000 2015.000000

zipcode lat long sqft_living15 sqft_lot15


count 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 98077.939805 47.560053 -122.213896 1986.552492 12768.455652
std 53.505026 0.138564 0.140828 685.391304 27304.179631
min 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 98033.000000 47.471000 -122.328000 1490.000000 5100.000000
50% 98065.000000 47.571800 -122.230000 1840.000000 7620.000000
75% 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

[11]: df.drop(15870, axis = 0, inplace = True)


df.reset_index(drop=True, inplace = True)

2
df.shape

[11]: (21612, 21)

[12]: df[df.columns[df.isnull().sum()>0]].isnull().sum()

[12]: Series([], dtype: float64)

[13]: pip install scikit-learn-extra

Collecting scikit-learn-extra
Downloading scikit_learn_extra-0.3.0-cp311-cp311-win_amd64.whl.metadata (3.7
kB)
Requirement already satisfied: numpy>=1.13.3 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
scikit-learn-extra) (1.26.0)
Requirement already satisfied: scipy>=0.19.1 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
scikit-learn-extra) (1.11.4)
Requirement already satisfied: scikit-learn>=0.23.0 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
scikit-learn-extra) (1.3.2)
Requirement already satisfied: joblib>=1.1.1 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
scikit-learn>=0.23.0->scikit-learn-extra) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
scikit-learn>=0.23.0->scikit-learn-extra) (3.2.0)
Downloading scikit_learn_extra-0.3.0-cp311-cp311-win_amd64.whl (340 kB)
---------------------------------------- 0.0/340.5 kB ? eta -:--:--
- -------------------------------------- 10.2/340.5 kB ? eta -:--:--
- -------------------------------------- 10.2/340.5 kB ? eta -:--:--
--- ----------------------------------- 30.7/340.5 kB 325.1 kB/s eta 0:00:01
------------- ------------------------ 122.9/340.5 kB 798.9 kB/s eta 0:00:01
---------------------------------------- 340.5/340.5 kB 1.8 MB/s eta 0:00:00
Installing collected packages: scikit-learn-extra
Successfully installed scikit-learn-extra-0.3.0
Note: you may need to restart the kernel to use updated packages.

[14]: df.drop(['date', 'id'], axis = 1, inplace = True)


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Clus_dataSet = scaler.fit_transform(df)
Clus_dataSet

[14]: array([[-0.86668617, -0.40692359, -1.44745951, …, -0.30611525,


-0.94339773, -0.26072358],

3
[-0.00567521, -0.40692359, 0.17558163, …, -0.74637458,
-0.43272969, -0.18787744],
[-0.98081575, -1.50829275, -1.44745951, …, -0.13569228,
1.07009338, -0.17238527],
…,
[-0.37584455, -1.50829275, -1.77206774, …, -0.60435544,
-1.41029422, -0.39414664],
[-0.38156737, -0.40692359, 0.50018986, …, 1.02886466,
-0.84126412, -0.42051628],
[-0.58585659, -1.50829275, -1.77206774, …, -0.60435544,
-1.41029422, -0.41795257]])

[15]: from sklearn_extra.cluster import KMedoids


kmedoids = KMedoids(n_clusters=3).fit(Clus_dataSet)

[16]: df.insert(0, 'kmedoids Cluster Labels', kmedoids.labels_)


df.head()

[16]: kmedoids Cluster Labels price bedrooms bathrooms sqft_living \


0 1 221900.0 3 1.00 1180
1 0 538000.0 3 2.25 2570
2 2 180000.0 2 1.00 770
3 0 604000.0 4 3.00 1960
4 2 510000.0 3 2.00 1680

sqft_lot floors waterfront view condition grade sqft_above \


0 5650 1.0 0 0 3 7 1180
1 7242 2.0 0 0 3 7 2170
2 10000 1.0 0 0 3 6 770
3 5000 1.0 0 0 5 7 1050
4 8080 1.0 0 0 3 8 1680

sqft_basement yr_built yr_renovated zipcode lat long \


0 0 1955 0 98178 47.5112 -122.257
1 400 1951 1991 98125 47.7210 -122.319
2 0 1933 0 98028 47.7379 -122.233
3 910 1965 0 98136 47.5208 -122.393
4 0 1987 0 98074 47.6168 -122.045

sqft_living15 sqft_lot15
0 1340 5650
1 1690 7639
2 2720 8062
3 1360 5000
4 1800 7503

4
[17]: X = df.loc[:, df.columns != 'kmedoids Cluster Labels']
X.head()

[17]: price bedrooms bathrooms sqft_living sqft_lot floors waterfront \


0 221900.0 3 1.00 1180 5650 1.0 0
1 538000.0 3 2.25 2570 7242 2.0 0
2 180000.0 2 1.00 770 10000 1.0 0
3 604000.0 4 3.00 1960 5000 1.0 0
4 510000.0 3 2.00 1680 8080 1.0 0

view condition grade sqft_above sqft_basement yr_built yr_renovated \


0 0 3 7 1180 0 1955 0
1 0 3 7 2170 400 1951 1991
2 0 3 6 770 0 1933 0
3 0 5 7 1050 910 1965 0
4 0 3 8 1680 0 1987 0

zipcode lat long sqft_living15 sqft_lot15


0 98178 47.5112 -122.257 1340 5650
1 98125 47.7210 -122.319 1690 7639
2 98028 47.7379 -122.233 2720 8062
3 98136 47.5208 -122.393 1360 5000
4 98074 47.6168 -122.045 1800 7503

[18]: from sklearn import preprocessing


X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

[18]: array([[-0.86668617, -0.40692359, -1.44745951, -0.97984121, -0.22832648,


-0.91546593, -0.08717466, -0.3057672 , -0.62914619, -0.55885272,
-0.73474634, -0.65864212, -0.5449314 , -0.21013346, 1.87013949,
-0.35252787, -0.30611525, -0.94339773, -0.26072358],
[-0.00567521, -0.40692359, 0.17558163, 0.53360192, -0.18989137,
0.93645991, -0.08717466, -0.3057672 , -0.62914619, -0.55885272,
0.46079706, 0.2451683 , -0.68111108, 4.74656291, 0.87957332,
1.16160686, -0.74637458, -0.43272969, -0.18787744],
[-0.98081575, -1.50829275, -1.44745951, -1.42625249, -0.12330593,
-0.91546593, -0.08717466, -0.3057672 , -0.62914619, -1.40959054,
-1.22987038, -0.65864212, -1.29391966, -0.21013346, -0.93334967,
1.28357482, -0.13569228, 1.07009338, -0.17238527],
[ 0.17409931, 0.69444556, 1.14940631, -0.13057096, -0.2440192 ,
-0.91546593, -0.08717466, -0.3057672 , 2.44468843, -0.55885272,
-0.89173689, 1.39752658, -0.20448219, -0.21013346, 1.08516253,
-0.28324429, -1.2718454 , -0.9142167 , -0.2845295 ],
[-0.08194318, -0.40692359, -0.1490266 , -0.4354372 , -0.16965983,
-0.91546593, -0.08717466, -0.3057672 , -0.62914619, 0.29188511,
-0.13093654, -0.65864212, 0.54450607, -0.21013346, -0.07361299,

5
0.40959143, 1.19928763, -0.27223402, -0.19285837]])

[19]: y = df["kmedoids Cluster Labels"]


y.head()

[19]: 0 1
1 0
2 2
3 0
4 2
Name: kmedoids Cluster Labels, dtype: int64

[20]: pip install pywaffle

Collecting pywaffleNote: you may need to restart the kernel to use updated
packages.

Downloading pywaffle-1.1.0-py2.py3-none-any.whl.metadata (2.6 kB)


Collecting fontawesomefree (from pywaffle)
Downloading fontawesomefree-6.5.1-py3-none-any.whl.metadata (824 bytes)
Requirement already satisfied: matplotlib in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
pywaffle) (3.8.0)
Requirement already satisfied: contourpy>=1.0.1 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (1.1.1)
Requirement already satisfied: cycler>=0.10 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (4.43.1)
Requirement already satisfied: kiwisolver>=1.0.1 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (1.4.5)
Requirement already satisfied: numpy<2,>=1.21 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (1.26.0)
Requirement already satisfied: packaging>=20.0 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (23.2)
Requirement already satisfied: pillow>=6.2.0 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (10.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (3.1.1)

6
Requirement already satisfied: python-dateutil>=2.7 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
matplotlib->pywaffle) (2.8.2)
Requirement already satisfied: six>=1.5 in
c:\users\lenovo\appdata\local\programs\python\python311\lib\site-packages (from
python-dateutil>=2.7->matplotlib->pywaffle) (1.16.0)
Downloading pywaffle-1.1.0-py2.py3-none-any.whl (30 kB)
Downloading fontawesomefree-6.5.1-py3-none-any.whl (25.6 MB)
---------------------------------------- 0.0/25.6 MB ? eta -:--:--
---------------------------------------- 0.0/25.6 MB 991.0 kB/s eta 0:00:26
--------------------------------------- 0.3/25.6 MB 4.2 MB/s eta 0:00:07
- -------------------------------------- 0.8/25.6 MB 6.5 MB/s eta 0:00:04
-- ------------------------------------- 1.3/25.6 MB 7.7 MB/s eta 0:00:04
--- ------------------------------------ 2.4/25.6 MB 10.9 MB/s eta 0:00:03
---- ----------------------------------- 3.2/25.6 MB 11.9 MB/s eta 0:00:02
----- ---------------------------------- 3.8/25.6 MB 12.1 MB/s eta 0:00:02
------- -------------------------------- 4.6/25.6 MB 12.7 MB/s eta 0:00:02
------- -------------------------------- 5.1/25.6 MB 12.6 MB/s eta 0:00:02
-------- ------------------------------- 5.7/25.6 MB 12.6 MB/s eta 0:00:02
---------- ----------------------------- 6.4/25.6 MB 12.9 MB/s eta 0:00:02
----------- ---------------------------- 7.1/25.6 MB 12.6 MB/s eta 0:00:02
----------- ---------------------------- 7.7/25.6 MB 12.9 MB/s eta 0:00:02
------------- -------------------------- 8.5/25.6 MB 13.0 MB/s eta 0:00:02
-------------- ------------------------- 9.0/25.6 MB 13.1 MB/s eta 0:00:02
-------------- ------------------------- 9.0/25.6 MB 13.1 MB/s eta 0:00:02
-------------- ------------------------- 9.0/25.6 MB 13.1 MB/s eta 0:00:02
---------------- ----------------------- 10.7/25.6 MB 13.9 MB/s eta 0:00:02
----------------- ---------------------- 11.3/25.6 MB 14.2 MB/s eta 0:00:02
------------------ --------------------- 11.8/25.6 MB 13.6 MB/s eta 0:00:02
------------------- -------------------- 12.3/25.6 MB 13.1 MB/s eta 0:00:02
------------------- -------------------- 12.8/25.6 MB 13.1 MB/s eta 0:00:01
--------------------- ------------------ 13.5/25.6 MB 12.8 MB/s eta 0:00:01
--------------------- ------------------ 13.9/25.6 MB 12.9 MB/s eta 0:00:01
---------------------- ----------------- 14.6/25.6 MB 12.6 MB/s eta 0:00:01
----------------------- ---------------- 14.9/25.6 MB 12.6 MB/s eta 0:00:01
------------------------ --------------- 15.8/25.6 MB 12.6 MB/s eta 0:00:01
------------------------- -------------- 16.0/25.6 MB 12.1 MB/s eta 0:00:01
------------------------- -------------- 16.6/25.6 MB 12.4 MB/s eta 0:00:01
--------------------------- ------------ 17.4/25.6 MB 12.1 MB/s eta 0:00:01
--------------------------- ------------ 17.9/25.6 MB 12.1 MB/s eta 0:00:01
---------------------------- ----------- 18.5/25.6 MB 11.9 MB/s eta 0:00:01
----------------------------- ---------- 19.0/25.6 MB 11.7 MB/s eta 0:00:01
------------------------------ --------- 19.7/25.6 MB 13.1 MB/s eta 0:00:01
------------------------------- -------- 20.4/25.6 MB 12.6 MB/s eta 0:00:01
-------------------------------- ------- 21.0/25.6 MB 12.1 MB/s eta 0:00:01
--------------------------------- ------ 21.6/25.6 MB 12.4 MB/s eta 0:00:01
---------------------------------- ----- 22.2/25.6 MB 12.4 MB/s eta 0:00:01
----------------------------------- ---- 22.9/25.6 MB 12.4 MB/s eta 0:00:01

7
------------------------------------ --- 23.6/25.6 MB 12.6 MB/s eta 0:00:01
------------------------------------- -- 24.2/25.6 MB 12.8 MB/s eta 0:00:01
-------------------------------------- - 24.9/25.6 MB 13.1 MB/s eta 0:00:01
--------------------------------------- 25.6/25.6 MB 13.1 MB/s eta 0:00:01
--------------------------------------- 25.6/25.6 MB 13.1 MB/s eta 0:00:01
--------------------------------------- 25.6/25.6 MB 13.1 MB/s eta 0:00:01
--------------------------------------- 25.6/25.6 MB 13.1 MB/s eta 0:00:01
--------------------------------------- 25.6/25.6 MB 13.1 MB/s eta 0:00:01
--------------------------------------- 25.6/25.6 MB 13.1 MB/s eta 0:00:01
---------------------------------------- 25.6/25.6 MB 9.6 MB/s eta 0:00:00
Installing collected packages: fontawesomefree, pywaffle
Successfully installed fontawesomefree-6.5.1 pywaffle-1.1.0

[21]: Count = df.groupby(["kmedoids Cluster Labels"], as_index=False).


↪count()[["kmedoids Cluster Labels", "price"]]

Count.columns = ["kmedoids Cluster Labels", "Count"]


Count

[21]: kmedoids Cluster Labels Count


0 0 6776
1 1 7964
2 2 6872

[23]: from pywaffle import Waffle


fig = plt.figure(
FigureClass=Waffle,
figsize=(6, 8),
rows=5,
values=list(Count.Count/150),
colors=("magenta", "yellow", "cyan"),
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
icons='sticky-note', icon_size=18,
icon_legend=True,
title={'label': 'Number of Houses in each K-medoids Cluster', 'loc':␣
↪'center'},

labels=list(Count['kmedoids Cluster Labels']))

8
9
[24]: labels = df.groupby(["kmedoids Cluster Labels"], as_index=False).
↪mean()[["kmedoids Cluster Labels", "price"]]

labels

[24]: kmedoids Cluster Labels price


0 0 783367.394923
1 1 327354.355977
2 2 546731.293510

[ ]:

10

You might also like