Housing Main
Housing Main
In [ ]: import pandas as pd
import numpy as np
In [ ]: housing_df = pd.read_csv('housing_data.csv')
print(housing_df.head())
Miles (dist. between school and house) Rent Price per Month Sell Price
0 463 7401 74446632
1 210 9259 76199794
2 157 16469 16249579
3 133 18096 24291317
4 10 9923 50273384
Next, check for types of data in the dataset, and see if there are any missing values.
In [ ]: print(housing_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 1000 non-null int64
1 No. of Rooms 1000 non-null int64
2 No. of Bathrooms 1000 non-null int64
3 Location 1000 non-null object
4 Miles (dist. between school and house) 1000 non-null int64
5 Rent Price per Month 1000 non-null int64
6 Sell Price 1000 non-null int64
dtypes: int64(6), object(1)
memory usage: 54.8+ KB
None
In [ ]: print(housing_df.describe())
Area No. of Rooms No. of Bathrooms \
count 1000.000000 1000.000000 1000.0
mean 1763.241000 1.974000 1.0
std 704.717323 0.814855 0.0
min 501.000000 1.000000 1.0
25% 1170.000000 1.000000 1.0
50% 1753.000000 2.000000 1.0
75% 2366.250000 3.000000 1.0
max 2997.000000 3.000000 1.0
Miles (dist. between school and house) Rent Price per Month \
count 1000.000000 1000.000000
mean 255.405000 13133.528000
std 142.346449 4106.514878
min 10.000000 6018.000000
25% 133.000000 9600.250000
50% 259.500000 13210.000000
75% 378.250000 16844.750000
max 498.000000 19993.000000
Sell Price
count 1.000000e+03
mean 4.207750e+07
std 2.164932e+07
min 6.113936e+06
25% 2.343184e+07
50% 4.284373e+07
75% 6.118787e+07
max 7.998578e+07
Area 0
No. of Rooms 0
No. of Bathrooms 0
Location 0
Miles (dist. between school and house) 0
Rent Price per Month 0
Sell Price 0
dtype: int64
After asserting there are no missing values, we could then proceed to data analysis by applying Matplotlib.pyplot and
Seaborn for data visualization.
To begin with, we place our emphasis on the columns No. of Rooms , No. of Bathrooms , and Location . With pie charts, we
observe that all residential properties range from 1 to 3 rooms and 1 bathroom, each of which is evenly distributed among the
three locations, city center, suburb, and rural area.
# title
ax[0].set_title('No. of Rooms', fontsize=12
, fontweight='bold', color='navy')
# title
ax[1].set_title('No. of Bathrooms', fontsize=12
, fontweight='bold', color='navy')
# title
ax[2].set_title('Location', fontsize=12
, fontweight='bold', color='navy')
plt.show()
For further analysis, we observe the distribution of the column Miles from School with a histogram, perceiving that the
distances between the residential properties and the school are evenly distributed between 0 to 500 miles. Similar phenomenon is
observed in the column Rent Price per Month , showing that the rent prices are evenly distributed between 6000 to 20000
dollars per month.
In [ ]: # draw a bar chart for the miles from the city center
# from 0-50, 50-100, 100-150, 150-200, 200-250, 250-300, 300-350, 350-400, 400-450, 450-500
fig, ax = plt.subplots(figsize=(15, 5))
# title
ax.set_title('Miles from Schools', fontsize=18
, fontweight='bold', color='navy')
plt.show()
In [ ]: # draw a bar chart for the rent price
# from 6000-7000, 7000-8000, 8000-9000, 9000-10000, 10000-11000, 11000-12000, 12000-13000, 13000-14000, 14000-15000,
keys = rent_categories.keys().tolist()
values = rent_categories.tolist()
# title
ax.set_title('Rent Price per Month', fontsize=12
, fontweight='bold', color='navy')
plt.show()
To furtherly analyze the relationship between the rent price and the distance from the school, we could apply a scatter plot to
observe the correlation between the two variables. We could also apply some colors to the scatter plot to differentiate the three
locations, city center, suburb, and rural area. Note that data with the column No. of Rooms not equal to 1 are filtered to avoid
the influence of the number of rooms on the rent price.
As below, there doesn't exist a strong correlation between the rent price and the distance from the school, and the rent prices are
evenly distributed among the three locations, city center, suburb, and rural area. Similar conclusion could be drawn from the scatter
plot of the relationship between the selling price and the distance from the school.
In [ ]: # draw a scatter plot between the rent price and the miles from the city center
color_map = {
"City Center": "lightcoral",
"Suburb": "lightgreen",
"Rural": "lightblue"
}
# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(one_room['Miles (dist. between school and house)'],
one_room['Rent Price per Month'],
c=one_room["Location"].map(color_map))
# Add legend
ax.legend(handles=legend_handles, title="Location",bbox_to_anchor=(1.01, 1), loc='upper left')
plt.show()
In [ ]: # draw a scatter plot between the rent price and the miles from the city center
color_map = {
"City Center": "lightcoral",
"Suburb": "lightgreen",
"Rural": "lightblue"
}
# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(one_room['Miles (dist. between school and house)'],
one_room['Sell Price'],
c=one_room["Location"].map(color_map))
# Add legend
ax.legend(handles=legend_handles, title="Location", bbox_to_anchor=(1.01, 1), loc='upper left')
plt.show()
In [ ]: # show the relationship between the sell price and the location
# as the type of box plots
plt.show()
To quantify the relationship between each two columns in the dataset, we create a heatmap to visualize the correlation between the
columns. Surprisingly, all pairs of features seems to be independent of each other, as the correlation coefficients are all close to 0.
With the above conclusion obtained, I decide not to take any further steps upon receiving this dataset before commencing data
analysis.
In [ ]: # draw a heatmap for the correlation matrix
import seaborn as sns
plt.show()
**Q2. If you are to inquire about Q1 from ChatGPT or Bing, what
responses will you receive? Do you find them reasonable? If not, how will you rectify it?**
Below is the response of ChatGPT regarding steps to perform when conducting a data analysis:
Before commencing data analysis on the housing dataset, it's essential to perform several steps to ensure data quality and
understand the characteristics of the dataset. Here are the steps I would take:
1. Data Cleaning:
Check for missing values: Look for any missing or null values in the dataset and decide on how to handle them. Depending
on the amount of missing data, options include imputation, deletion, or leaving them as is if they are negligible.
Check for duplicates: Identify and remove any duplicate entries in the dataset, if present.
Check for outliers: Examine the distribution of each numerical variable to identify any outliers that may skew the analysis.
Decide on whether to remove or transform outliers based on domain knowledge and statistical analysis.
2. Data Exploration:
Summary statistics: Calculate descriptive statistics such as mean, median, standard deviation, minimum, and maximum
values for numerical variables. This provides an initial understanding of the data distribution.
Visualizations: Create visualizations such as histograms, box plots, or scatter plots to explore the distribution and
relationships between variables. For example, scatter plots can help visualize the relationship between sell price and other
variables like area, number of rooms, and number of bathrooms.
3. Feature Engineering:
Create new features if necessary: Derive new features that might be useful for analysis or modeling. For example, calculate
the price per square foot/meter by dividing sell price by the area.
Encode categorical variables: If there are categorical variables like 'Location', encode them into numerical values using
techniques such as one-hot encoding or label encoding.
4. Data Transformation:
Normalize or standardize numerical variables if needed: Depending on the algorithm chosen for analysis, scaling numerical
variables might be necessary to ensure all variables contribute equally to the analysis.
Transform skewed variables: If any numerical variables are heavily skewed, consider applying transformations such as
logarithmic or square root transformations to make the distribution more symmetrical.
5. Data Splitting:
If the dataset is intended for predictive modeling, split the data into training and testing sets to evaluate the performance
of the model accurately.
6. Documentation:
Document any changes made to the dataset during the cleaning, exploration, and feature engineering stages. This
documentation helps maintain transparency and reproducibility of the analysis.
By following these steps, we can ensure that the dataset is well-prepared for analysis, leading to more accurate insights and reliable
modeling results.
I consider it reasonable for ChatGPT to provide the above response, as it covers the essential steps required to prepare the
dataset for analysis. However, the response is more general and covers a broader range of data analysis tasks. For the specific
housing dataset provided, the steps I outlined in my initial response are sufficient for the analysis. If I were to rectify it, I would
provide a more concise and focused response tailored to the housing dataset.
Q3. If you are restricted to renting a house, which one or ones will you select, and why?
When considering renting a house, I would consider several factors such as location, rent price, distance from the school, and the
number of rooms and bathrooms. Based on the dataset, I would consider the following criteria:
1. Location: I would prefer a house located in the city center due to the convenience of access to amenities, public
transportation, and proximity to schools and workplaces.
2. Miles from School: I would prefer a house that is 10 miles or less from the school to minimize commuting time and
transportation costs.
3. Rent Price per Month: I would select the house with minimal rent price per month, as it would be more cost-effective and
allow for better budget management.
4. Number of Rooms and Bathrooms: I would prefer the house with as many rooms and bathrooms as needed for my family
size and lifestyle.
In [ ]: # draw a scatter plot between the rent price and the miles from the city center
color_map = {
3: "lightcoral",
2: "lightgreen",
1: "lightblue"
}
# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(housing_new['Miles (dist. between school and house)'],
housing_new['Rent Price per Month'],
c=housing_new["No. of Rooms"].map(color_map))
# Add legend
ax.legend(handles=legend_handles, title="No. of Rooms",bbox_to_anchor=(1.01, 1), loc='upper left')
plt.show()
After making tradoffs between the renting price and the distance from the school, I decided to choose the point with gold color in
the scatter plot, with renting price around 9000 dodllars and distance from the school around 19 miles. This point is located in the
city center, and it has 2 rooms and 1 bathroom, which is the most suitable for me.
In [ ]: housing_new[(housing_new["Rent Price per Month"] <= 10000) & (housing_new["Miles (dist. between school and house)"] <
Out[ ]: No. of No. of Miles (dist. between school and Rent Price per
Area Location Sell Price
Rooms Bathrooms house) Month
City
785 2041 2 1 19 8912 27709264
Center
To evaluate the decision between renting and purchasing a house, I create a new column Sell Rent Ratio by dividing the Sell
Price by the Rent Price per Month then multiplying 12. The Sell Rent Ratio represents the number of years it would
take to pay off the house if the rent price is used to pay off the house. A lower Sell Rent Ratio indicates a better investment
opportunity.
After calculating the Sell Rent Ratio , I filtered the data with criterias below:
These criterias are chosen because I would prefer to purchase a house with a good investment opportunity with location being the
suburb and close to the school.
# draw a scatter plot between the sell rent ratio and the miles from the city center
color_map = {
3: "lightcoral",
2: "lightgreen",
1: "lightblue"
}
# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(housing_new['Miles (dist. between school and house)'],
housing_new['Sell Rent Ratio'],
c=housing_new["No. of Rooms"].map(color_map))
# Add legend
ax.legend(handles=legend_handles, title="No. of Rooms",bbox_to_anchor=(1.01, 1), loc='upper left')
plt.show()
In [ ]: # print the house with index 928
housing_df.iloc[928]
**Q5. Are there any properties with rent or selling prices that
As shown in the below box plots, neither the rent price nor the selling price has any outliers, as there are no data points that fall
below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
In [ ]: # show the relationship between the sell price and the location
# as the type of box plots
plt.show()