This project analyzes the AMESHOUSING3 dataset to identify clusters of houses based on their features and sale prices using clustering techniques. The analysis revealed four distinct groups of houses, with overall quality, size, and year built being significant indicators of pricing. The client is encouraged to use these insights for tailored pricing strategies to enhance market segmentation and sales performance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views8 pages
ML Lab
This project analyzes the AMESHOUSING3 dataset to identify clusters of houses based on their features and sale prices using clustering techniques. The analysis revealed four distinct groups of houses, with overall quality, size, and year built being significant indicators of pricing. The client is encouraged to use these insights for tailored pricing strategies to enhance market segmentation and sales performance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 8
418125, 9:7 PM
In [2]:
lab 11 mi-Jupyter Notebook
INTRODUCTION
The real estate market often includes houses with varying characteristics,
making it challenging to analyze patterns across sales. The client seeks to
identify natural groupings within the AMESHOUSING3 dataset to better
understand the types of houses sold. This project aims to apply clustering
techniques to housing feature data to identify meaningful clusters and
relate them to the sale price of houses. Understanding these clusters can
help in pricing, marketing, and investment decisions in the real estate
domain.
WM #Load the dataset and perform initial data cleaning and exploration.
import pandas as pd
# Load the dataset
df = pd.read_csv("ameshousing3.csv"
df.head()
out [2]:
In [4]:
Obs PID Lot Area House Style Ovorall_ual Overall Cond Yoar_Built Heating
04 s27t27150 4920 4Story e 5 2001
12 527145080 5005 18tory 8 5 1092
23. 527425090 10500 18tory 4 5 tet
34 526228285 3203 18tory 7 5 2006
45 s28250100 750 stvl 7 5 2000
5 rows « 31 columns
6D »
WM #checking for the missing values
df. isnul1().sum().sort_values(ascending=False) .head(10)
df = d#.dropna()
localhost 8888inotebookslab 11 mlipynbit
84118125, 9:37 PM.
In [5]:
out [5]:
In [7]:
In [9]:
In [12]:
In [
LU)
”
”
lab 11 mi-Jupyter Notebook
df .describe()
Fireplaces Garage_Area .. Basement Area Full Bathroom Half Bathroom Tots
2 286.0000 266.000000 266.000000 266.000000 266,000000 266.000
9 0.413534 406,394737 903,462406 1.887970 0.283158 1742
1 0877932 132.794856 387.743516 0.670403, 0.457965, o.sese
0.000000 164.0000 0.000000, 1.000000 0.000000, 1.000¢
2 0.000000 288,000000 714,000000 +,900000 0.000000 1.000¢
2 0.000000 400.0000 ‘924.5000 2.000000, 0.000000 2.000¢
) 1,000 504,000000 1152.750000 2,000000 1.900000 2,000¢
2 2.000000 —902.,000000 +1645.000000 4.900000 2.000000, 4.t00¢
4 ED »
select relevant numeric variables based on their significance.
features = [‘Lot_Area’, ‘Overall Qual’, ‘Year_Built', ‘Gr_Liv Area’)
df_selected = df[features + [‘SalePrice']]
#ormalize the selected features for clustering.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler. fit_transform(df_selected[features])
#Convert the data to ARFF format to use in WEKA.
from scipy.io import arff
import numpy as np
# Convert to DataFrane first to save to ARFF
df_scaled = pd.DataFrane(scaled_data, colunns-features)
df_scaled.to_csv("housing_scaled.csv", index=False)
# Manually convert this CSV to ARFF using WEKA's CSVLoader
Load the ARFF file into WEKA and apply the k-means clustering algorithm
localhost 8888inotebookslab 11 mlipynbitsars825, 9:7 PM lab 1 mi- duper Notebook
In [13]: M #Visualize the distributions of key features.
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df['Gr_Liv Area’], kde-True)
plt.title("Distribution of Above Ground Living Area”)
plt.show()
Distribution of Above Ground Living Area
Count
200 1000 1200
Gr_Liv_Area
Cluster Analysis Process
localhost 8888/notebookslab 11 mlipynb418125, 9:7 PM
In [16]:
In [17]:
lab 11 mi-Jupyter Notebook
from sklearn.cluster import KMeans
# Choose number of clusters (e.g., k=4)
kmeans = kMeans(n_clusters=4, random_state=42)
clusters = kmeans. fit_predict(scaled_data)
# Add cluster Labels back to your original dataframe
df_selected[ ‘Cluster'] = clusters
C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan
ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
C: \Users\Admin\AppData\Local \Temp\ipykernel_14692\11853177.. py:
WithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
: Setting
See the caveats in the documentation: https: //pandas.pydata.org/pandas-d
ocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy (http
s://pandas..pydata.org/pandas-docs/stable/user_guide/indexing.html#return
ing-a-view-versus-a-copy)
df_selected[ ‘Cluster’] = clusters
# See the average values for each feature within each cluster
print (df_selected.groupby( ‘Cluster’ ).mean())
LotArea Overall_qual Year suilt Gr_livArea SalleP
rice
Cluster
e 7381.975000 5.800000 1929.15000@ 1285.075000 126851.35
e000
1 7844,548387 «4.623656 1953.666667 924.8602 115198.92
4731
2 11530.171429 5.557143 1967.128571 1249.242857 157602..14
2857
3 6001.301587 6.761905 1995.714286 1251.190476 171197.31
7460
localhost 8888inotebookslab 11 mlipyabil
ais4118125, 9:37 PM. lab 11 mi-Jupyter Notebook
In [18]: MW sns.boxplot(x="Cluster', y="SalePrice’, data-df_selected)
plt.title("Sale Price Distribution by Cluster")
plt.show()
Sale Price Distribution by Cluster
300000
‘
250000 —
’
$
200000,
SalePrice
|
|
150000
100000 a SE
50000 ‘
¢
° 1 2 3
Cluster
localhost 8888inotebookslab 11 mlipynbit418125, 9:7 PM
In [19]:
lab 11 mi-Jupyter Notebook
M inertia
k_range
is
range(1, 10)
for k in k_range:
km = kKMeans(n_clusters=
km. fit (scaled_data)
inertia. append(km.inertia_)
random_state=42)
# Plot the elbow curve
plt.plot(k_range, inertia, marker-
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method - Optimal k")
plt.show()
0")
localhost 8888inotebookslab 11 mlipyabil
58418125, 9:7 PM
lab 11 mi-Jupyter Notebook
C:\users\Adnin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: Futurewarning: The default value of “n_init” will chan
ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init’ explicitly to sup
press the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan
ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init will chan
ge from 16 to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init’ will chan
ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init’ explicitly to sup
press the warning
super()._check_parans_vs_input(X, default_n_init=10)
C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan
ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init” explicitly to sup
press the warning
super()._check_parans_vs_input(X, default_n_init=10)
C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of "n_init” will chan
ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: Futurewarning: The default value of “n_init” will chan
ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan
ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster
\_kmeans.py:1412: FutureWarning: The default value of “n_init’ will chan
ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup
press the warning
super()._check_params_vs_input(x, default_n_init=10)
localhost 8888inotebookslab 11 mlipyabil
718418125, 9:7 PM lab 11 mi-Jupyter Notebook
Elbow Method - Optimal K
1100
1000
900
800
700
Inertia
600
500
400
300
1 2 3 4 5 6 7 8 9
Number of Clusters
After evaluating the cluster results for different k values, the model with
provided the most distinct and interpretable clusters. The selection is
based on intra-cluster distance, cluster sizes, and meaningful group
differentiation.
The cluster analysis identified four main groups of houses with distinct
characteristics and sale price profiles. The findings show that overall
quality, size, and year built are strong indicators of pricing. The client
is advised to use these insights to tailor pricing strategies for different
property types. Implementing these findings can enhance market segmentation
and improve sales performance.
Inf}: W
localhost 8888inotebookslab 11 mlipynbit
ais