0% found this document useful (0 votes)
3 views8 pages

ML Lab

This project analyzes the AMESHOUSING3 dataset to identify clusters of houses based on their features and sale prices using clustering techniques. The analysis revealed four distinct groups of houses, with overall quality, size, and year built being significant indicators of pricing. The client is encouraged to use these insights for tailored pricing strategies to enhance market segmentation and sales performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

ML Lab

This project analyzes the AMESHOUSING3 dataset to identify clusters of houses based on their features and sale prices using clustering techniques. The analysis revealed four distinct groups of houses, with overall quality, size, and year built being significant indicators of pricing. The client is encouraged to use these insights for tailored pricing strategies to enhance market segmentation and sales performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 8
418125, 9:7 PM In [2]: lab 11 mi-Jupyter Notebook INTRODUCTION The real estate market often includes houses with varying characteristics, making it challenging to analyze patterns across sales. The client seeks to identify natural groupings within the AMESHOUSING3 dataset to better understand the types of houses sold. This project aims to apply clustering techniques to housing feature data to identify meaningful clusters and relate them to the sale price of houses. Understanding these clusters can help in pricing, marketing, and investment decisions in the real estate domain. WM #Load the dataset and perform initial data cleaning and exploration. import pandas as pd # Load the dataset df = pd.read_csv("ameshousing3.csv" df.head() out [2]: In [4]: Obs PID Lot Area House Style Ovorall_ual Overall Cond Yoar_Built Heating 04 s27t27150 4920 4Story e 5 2001 12 527145080 5005 18tory 8 5 1092 23. 527425090 10500 18tory 4 5 tet 34 526228285 3203 18tory 7 5 2006 45 s28250100 750 stvl 7 5 2000 5 rows « 31 columns 6D » WM #checking for the missing values df. isnul1().sum().sort_values(ascending=False) .head(10) df = d#.dropna() localhost 8888inotebookslab 11 mlipynbit 8 4118125, 9:37 PM. In [5]: out [5]: In [7]: In [9]: In [12]: In [ LU) ” ” lab 11 mi-Jupyter Notebook df .describe() Fireplaces Garage_Area .. Basement Area Full Bathroom Half Bathroom Tots 2 286.0000 266.000000 266.000000 266.000000 266,000000 266.000 9 0.413534 406,394737 903,462406 1.887970 0.283158 1742 1 0877932 132.794856 387.743516 0.670403, 0.457965, o.sese 0.000000 164.0000 0.000000, 1.000000 0.000000, 1.000¢ 2 0.000000 288,000000 714,000000 +,900000 0.000000 1.000¢ 2 0.000000 400.0000 ‘924.5000 2.000000, 0.000000 2.000¢ ) 1,000 504,000000 1152.750000 2,000000 1.900000 2,000¢ 2 2.000000 —902.,000000 +1645.000000 4.900000 2.000000, 4.t00¢ 4 ED » select relevant numeric variables based on their significance. features = [‘Lot_Area’, ‘Overall Qual’, ‘Year_Built', ‘Gr_Liv Area’) df_selected = df[features + [‘SalePrice']] #ormalize the selected features for clustering. from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled_data = scaler. fit_transform(df_selected[features]) #Convert the data to ARFF format to use in WEKA. from scipy.io import arff import numpy as np # Convert to DataFrane first to save to ARFF df_scaled = pd.DataFrane(scaled_data, colunns-features) df_scaled.to_csv("housing_scaled.csv", index=False) # Manually convert this CSV to ARFF using WEKA's CSVLoader Load the ARFF file into WEKA and apply the k-means clustering algorithm localhost 8888inotebookslab 11 mlipynbit sars825, 9:7 PM lab 1 mi- duper Notebook In [13]: M #Visualize the distributions of key features. import seaborn as sns import matplotlib.pyplot as plt sns.histplot(df['Gr_Liv Area’], kde-True) plt.title("Distribution of Above Ground Living Area”) plt.show() Distribution of Above Ground Living Area Count 200 1000 1200 Gr_Liv_Area Cluster Analysis Process localhost 8888/notebookslab 11 mlipynb 418125, 9:7 PM In [16]: In [17]: lab 11 mi-Jupyter Notebook from sklearn.cluster import KMeans # Choose number of clusters (e.g., k=4) kmeans = kMeans(n_clusters=4, random_state=42) clusters = kmeans. fit_predict(scaled_data) # Add cluster Labels back to your original dataframe df_selected[ ‘Cluster'] = clusters C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) C: \Users\Admin\AppData\Local \Temp\ipykernel_14692\11853177.. py: WithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead : Setting See the caveats in the documentation: https: //pandas.pydata.org/pandas-d ocs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy (http s://pandas..pydata.org/pandas-docs/stable/user_guide/indexing.html#return ing-a-view-versus-a-copy) df_selected[ ‘Cluster’] = clusters # See the average values for each feature within each cluster print (df_selected.groupby( ‘Cluster’ ).mean()) LotArea Overall_qual Year suilt Gr_livArea SalleP rice Cluster e 7381.975000 5.800000 1929.15000@ 1285.075000 126851.35 e000 1 7844,548387 «4.623656 1953.666667 924.8602 115198.92 4731 2 11530.171429 5.557143 1967.128571 1249.242857 157602..14 2857 3 6001.301587 6.761905 1995.714286 1251.190476 171197.31 7460 localhost 8888inotebookslab 11 mlipyabil ais 4118125, 9:37 PM. lab 11 mi-Jupyter Notebook In [18]: MW sns.boxplot(x="Cluster', y="SalePrice’, data-df_selected) plt.title("Sale Price Distribution by Cluster") plt.show() Sale Price Distribution by Cluster 300000 ‘ 250000 — ’ $ 200000, SalePrice | | 150000 100000 a SE 50000 ‘ ¢ ° 1 2 3 Cluster localhost 8888inotebookslab 11 mlipynbit 418125, 9:7 PM In [19]: lab 11 mi-Jupyter Notebook M inertia k_range is range(1, 10) for k in k_range: km = kKMeans(n_clusters= km. fit (scaled_data) inertia. append(km.inertia_) random_state=42) # Plot the elbow curve plt.plot(k_range, inertia, marker- plt.xlabel("Number of Clusters") plt.ylabel("Inertia") plt.title("Elbow Method - Optimal k") plt.show() 0") localhost 8888inotebookslab 11 mlipyabil 58 418125, 9:7 PM lab 11 mi-Jupyter Notebook C:\users\Adnin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: Futurewarning: The default value of “n_init” will chan ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init’ explicitly to sup press the warning super()._check_params_vs_input(X, default_n_init=10) C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init will chan ge from 16 to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init’ will chan ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init’ explicitly to sup press the warning super()._check_parans_vs_input(X, default_n_init=10) C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init” explicitly to sup press the warning super()._check_parans_vs_input(X, default_n_init=10) C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of "n_init” will chan ge from 1@ to ‘auto’ in 1.4, Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: Futurewarning: The default value of “n_init” will chan ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) C:\Users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init” will chan ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) C:\users\Admin\Desktop\Untitled Folder\Lib\site-packages\sklearn\cluster \_kmeans.py:1412: FutureWarning: The default value of “n_init’ will chan ge from 1@ to ‘auto’ in 1.4. Set the value of “n_init” explicitly to sup press the warning super()._check_params_vs_input(x, default_n_init=10) localhost 8888inotebookslab 11 mlipyabil 718 418125, 9:7 PM lab 11 mi-Jupyter Notebook Elbow Method - Optimal K 1100 1000 900 800 700 Inertia 600 500 400 300 1 2 3 4 5 6 7 8 9 Number of Clusters After evaluating the cluster results for different k values, the model with provided the most distinct and interpretable clusters. The selection is based on intra-cluster distance, cluster sizes, and meaningful group differentiation. The cluster analysis identified four main groups of houses with distinct characteristics and sale price profiles. The findings show that overall quality, size, and year built are strong indicators of pricing. The client is advised to use these insights to tailor pricing strategies for different property types. Implementing these findings can enhance market segmentation and improve sales performance. Inf}: W localhost 8888inotebookslab 11 mlipynbit ais

You might also like