0% found this document useful (0 votes)
35 views23 pages

Housing Main

The document discusses analyzing a housing dataset using Python. It shows importing the dataset, checking for missing values and data types, then creating pie charts and histograms to visualize the distribution of rooms, bathrooms, locations, miles from school, and rent prices. Scatter plots are used to examine correlations between rent price and distance from school and other variables while differentiating locations.

Uploaded by

hamburgerhenry13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views23 pages

Housing Main

The document discusses analyzing a housing dataset using Python. It shows importing the dataset, checking for missing values and data types, then creating pie charts and histograms to visualize the distribution of rooms, bathrooms, locations, miles from school, and rent prices. Scatter plots are used to examine correlations between rent price and distance from school and other variables while differentiating locations.

Uploaded by

hamburgerhenry13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Homework 1

B10702053 會計三 黃少凱


1. Housing Dataset
**Q1. What steps will you take upon receiving this dataset before

commencing data analysis?**

First, import the dataset as pandas dataframe.

In [ ]: import pandas as pd
import numpy as np

In [ ]: housing_df = pd.read_csv('housing_data.csv')
print(housing_df.head())

Area No. of Rooms No. of Bathrooms Location \


0 1360 1 1 Rural
1 1794 3 1 Suburb
2 1630 2 1 Suburb
3 1595 1 1 Suburb
4 2138 1 1 Suburb

Miles (dist. between school and house) Rent Price per Month Sell Price
0 463 7401 74446632
1 210 9259 76199794
2 157 16469 16249579
3 133 18096 24291317
4 10 9923 50273384

Next, check for types of data in the dataset, and see if there are any missing values.

In [ ]: print(housing_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 1000 non-null int64
1 No. of Rooms 1000 non-null int64
2 No. of Bathrooms 1000 non-null int64
3 Location 1000 non-null object
4 Miles (dist. between school and house) 1000 non-null int64
5 Rent Price per Month 1000 non-null int64
6 Sell Price 1000 non-null int64
dtypes: int64(6), object(1)
memory usage: 54.8+ KB
None

In [ ]: print(housing_df.describe())
Area No. of Rooms No. of Bathrooms \
count 1000.000000 1000.000000 1000.0
mean 1763.241000 1.974000 1.0
std 704.717323 0.814855 0.0
min 501.000000 1.000000 1.0
25% 1170.000000 1.000000 1.0
50% 1753.000000 2.000000 1.0
75% 2366.250000 3.000000 1.0
max 2997.000000 3.000000 1.0

Miles (dist. between school and house) Rent Price per Month \
count 1000.000000 1000.000000
mean 255.405000 13133.528000
std 142.346449 4106.514878
min 10.000000 6018.000000
25% 133.000000 9600.250000
50% 259.500000 13210.000000
75% 378.250000 16844.750000
max 498.000000 19993.000000

Sell Price
count 1.000000e+03
mean 4.207750e+07
std 2.164932e+07
min 6.113936e+06
25% 2.343184e+07
50% 4.284373e+07
75% 6.118787e+07
max 7.998578e+07

In [ ]: # look at any missing values


print(housing_df.isnull().sum())

Area 0
No. of Rooms 0
No. of Bathrooms 0
Location 0
Miles (dist. between school and house) 0
Rent Price per Month 0
Sell Price 0
dtype: int64
After asserting there are no missing values, we could then proceed to data analysis by applying Matplotlib.pyplot and
Seaborn for data visualization.

To begin with, we place our emphasis on the columns No. of Rooms , No. of Bathrooms , and Location . With pie charts, we
observe that all residential properties range from 1 to 3 rooms and 1 bathroom, each of which is evenly distributed among the
three locations, city center, suburb, and rural area.

In [ ]: # draw a pie chart for the number of rooms from 0 to 3


import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

rooms = housing_df['No. of Rooms']


keys = rooms.value_counts().keys().tolist()
keys = [f"{i} rooms" for i in keys]
values = rooms.value_counts().tolist()

custom_colors = ['lightgreen','lightskyblue','lightcoral', 'gold']

ax[0].pie(values, labels=['']*len(keys), autopct='%1.1f%%',


pctdistance=0.8, textprops={'color':'w', 'weight':'bold', 'size':10},
wedgeprops=dict(width=0.4, edgecolor='w'), colors=custom_colors)

# show the legends


ax[0].legend(keys, loc='lower left')

# title
ax[0].set_title('No. of Rooms', fontsize=12
, fontweight='bold', color='navy')

bathrooms = housing_df['No. of Bathrooms']


keys = bathrooms.value_counts().keys().tolist()
keys = [f"{i} bathrooms" for i in keys]
values = bathrooms.value_counts().tolist()

ax[1].pie(values, labels=['']*len(keys), autopct='%1.1f%%',


pctdistance=0.8, textprops={'color':'w', 'weight':'bold', 'size':10},
wedgeprops=dict(width=0.4, edgecolor='w'), colors=custom_colors)

# show the legends


ax[1].legend(keys, loc='lower left')

# title
ax[1].set_title('No. of Bathrooms', fontsize=12
, fontweight='bold', color='navy')

# show the categories of location


location = housing_df['Location']
keys = location.value_counts().keys().tolist()
values = location.value_counts().tolist()

ax[2].pie(values, labels=['']*len(keys), autopct='%1.1f%%',


pctdistance=0.8, textprops={'color':'w', 'weight':'bold', 'size':10},
wedgeprops=dict(width=0.4, edgecolor='w'), colors=custom_colors)

# show the legends


ax[2].legend(keys, loc='lower left')

# title
ax[2].set_title('Location', fontsize=12
, fontweight='bold', color='navy')

plt.show()
For further analysis, we observe the distribution of the column Miles from School with a histogram, perceiving that the
distances between the residential properties and the school are evenly distributed between 0 to 500 miles. Similar phenomenon is
observed in the column Rent Price per Month , showing that the rent prices are evenly distributed between 6000 to 20000
dollars per month.

In [ ]: # draw a bar chart for the miles from the city center
# from 0-50, 50-100, 100-150, 150-200, 200-250, 250-300, 300-350, 350-400, 400-450, 450-500
fig, ax = plt.subplots(figsize=(15, 5))

miles = housing_df['Miles (dist. between school and house)']


miles_categories = pd.cut(miles, bins=[0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500]).value_counts().sort_index
keys = miles_categories.keys().tolist()
values = miles_categories.tolist()

keys = [f"{i.left}-{i.right} miles" for i in keys]

ax.bar(keys, values, width=0.4, color=plt.cm.Set3(np.arange(len(keys))))

# get rid of the top and right spines


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# title
ax.set_title('Miles from Schools', fontsize=18
, fontweight='bold', color='navy')

plt.show()
In [ ]: # draw a bar chart for the rent price
# from 6000-7000, 7000-8000, 8000-9000, 9000-10000, 10000-11000, 11000-12000, 12000-13000, 13000-14000, 14000-15000,

fig, ax = plt.subplots(figsize=(20, 5))

rent = housing_df['Rent Price per Month']


rent_categories = pd.cut(rent, bins=[6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000,

keys = rent_categories.keys().tolist()
values = rent_categories.tolist()

keys = [f"${int(i.left / 1000)}K-${int(i.right / 1000)}K" for i in keys]

ax.bar(keys, values, width=0.4, color=plt.cm.tab20c(np.arange(len(keys))))

# get rid of the top and right spines


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# title
ax.set_title('Rent Price per Month', fontsize=12
, fontweight='bold', color='navy')
plt.show()

To furtherly analyze the relationship between the rent price and the distance from the school, we could apply a scatter plot to
observe the correlation between the two variables. We could also apply some colors to the scatter plot to differentiate the three
locations, city center, suburb, and rural area. Note that data with the column No. of Rooms not equal to 1 are filtered to avoid
the influence of the number of rooms on the rent price.

As below, there doesn't exist a strong correlation between the rent price and the distance from the school, and the rent prices are
evenly distributed among the three locations, city center, suburb, and rural area. Similar conclusion could be drawn from the scatter
plot of the relationship between the selling price and the distance from the school.

In [ ]: # draw a scatter plot between the rent price and the miles from the city center
color_map = {
"City Center": "lightcoral",
"Suburb": "lightgreen",
"Rural": "lightblue"
}

fig, ax = plt.subplots(figsize=(10, 5))


one_room = housing_df[(housing_df['No. of Rooms'] == 1) &
(housing_df['No. of Bathrooms'] == 1)]

# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(one_room['Miles (dist. between school and house)'],
one_room['Rent Price per Month'],
c=one_room["Location"].map(color_map))

# show the legends


# Create legend handles and labels
legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
markerfacecolor=color, label=label)
for label, color in color_map.items()]

# Add legend
ax.legend(handles=legend_handles, title="Location",bbox_to_anchor=(1.01, 1), loc='upper left')

# Title and labels


ax.set_title('Rent Price per Month vs. Miles from Schools', fontsize=12, fontweight='bold', color='navy')
ax.set_xlabel('Miles from Schools')
ax.set_ylabel('Rent Price per Month')

plt.show()
In [ ]: # draw a scatter plot between the rent price and the miles from the city center
color_map = {
"City Center": "lightcoral",
"Suburb": "lightgreen",
"Rural": "lightblue"
}

fig, ax = plt.subplots(figsize=(10, 5))


one_room = housing_df[(housing_df['No. of Rooms'] == 1) &
(housing_df['No. of Bathrooms'] == 1)]

# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(one_room['Miles (dist. between school and house)'],
one_room['Sell Price'],
c=one_room["Location"].map(color_map))

# show the legends


# Create legend handles and labels
legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
markerfacecolor=color, label=label)
for label, color in color_map.items()]

# Add legend
ax.legend(handles=legend_handles, title="Location", bbox_to_anchor=(1.01, 1), loc='upper left')

# Title and labels


ax.set_title('Sell Price vs. Miles from Schools', fontsize=12, fontweight='bold', color='navy')
ax.set_xlabel('Miles from Schools')
ax.set_ylabel('Sell Price')

plt.show()
In [ ]: # show the relationship between the sell price and the location
# as the type of box plots

fig, ax = plt.subplots(1, 3, figsize=(15, 5))


locations = housing_df['Location'].unique()

for i, loc in enumerate(locations):


now_df = housing_df[housing_df['Location'] == loc]
ax[i].boxplot(now_df['Sell Price'], patch_artist=True,
boxprops=dict(facecolor='lightblue'))

ax[i].set_title(f'Sell Price in {loc}', fontsize=12, fontweight='bold', color='navy')


ax[i].set_xticklabels([''])

plt.show()

To quantify the relationship between each two columns in the dataset, we create a heatmap to visualize the correlation between the
columns. Surprisingly, all pairs of features seems to be independent of each other, as the correlation coefficients are all close to 0.

With the above conclusion obtained, I decide not to take any further steps upon receiving this dataset before commencing data
analysis.
In [ ]: # draw a heatmap for the correlation matrix
import seaborn as sns

# convert the location to numerical values


housing_new = housing_df.copy()
housing_new['Location'] = housing_new['Location'].map({'City Center': 0, 'Suburb': 1, 'Rural': 2})

fig, ax = plt.subplots(figsize=(10, 5))


corr = housing_new.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', ax=ax)

plt.show()
**Q2. If you are to inquire about Q1 from ChatGPT or Bing, what

responses will you receive? Do you find them reasonable? If not, how will you rectify it?**
Below is the response of ChatGPT regarding steps to perform when conducting a data analysis:

Before commencing data analysis on the housing dataset, it's essential to perform several steps to ensure data quality and
understand the characteristics of the dataset. Here are the steps I would take:

1. Data Cleaning:

Check for missing values: Look for any missing or null values in the dataset and decide on how to handle them. Depending
on the amount of missing data, options include imputation, deletion, or leaving them as is if they are negligible.
Check for duplicates: Identify and remove any duplicate entries in the dataset, if present.
Check for outliers: Examine the distribution of each numerical variable to identify any outliers that may skew the analysis.
Decide on whether to remove or transform outliers based on domain knowledge and statistical analysis.
2. Data Exploration:

Summary statistics: Calculate descriptive statistics such as mean, median, standard deviation, minimum, and maximum
values for numerical variables. This provides an initial understanding of the data distribution.
Visualizations: Create visualizations such as histograms, box plots, or scatter plots to explore the distribution and
relationships between variables. For example, scatter plots can help visualize the relationship between sell price and other
variables like area, number of rooms, and number of bathrooms.
3. Feature Engineering:

Create new features if necessary: Derive new features that might be useful for analysis or modeling. For example, calculate
the price per square foot/meter by dividing sell price by the area.
Encode categorical variables: If there are categorical variables like 'Location', encode them into numerical values using
techniques such as one-hot encoding or label encoding.
4. Data Transformation:

Normalize or standardize numerical variables if needed: Depending on the algorithm chosen for analysis, scaling numerical
variables might be necessary to ensure all variables contribute equally to the analysis.
Transform skewed variables: If any numerical variables are heavily skewed, consider applying transformations such as
logarithmic or square root transformations to make the distribution more symmetrical.
5. Data Splitting:
If the dataset is intended for predictive modeling, split the data into training and testing sets to evaluate the performance
of the model accurately.
6. Documentation:

Document any changes made to the dataset during the cleaning, exploration, and feature engineering stages. This
documentation helps maintain transparency and reproducibility of the analysis.

By following these steps, we can ensure that the dataset is well-prepared for analysis, leading to more accurate insights and reliable
modeling results.

I consider it reasonable for ChatGPT to provide the above response, as it covers the essential steps required to prepare the
dataset for analysis. However, the response is more general and covers a broader range of data analysis tasks. For the specific
housing dataset provided, the steps I outlined in my initial response are sufficient for the analysis. If I were to rectify it, I would
provide a more concise and focused response tailored to the housing dataset.

Q3. If you are restricted to renting a house, which one or ones will you select, and why?

When considering renting a house, I would consider several factors such as location, rent price, distance from the school, and the
number of rooms and bathrooms. Based on the dataset, I would consider the following criteria:

1. Location: I would prefer a house located in the city center due to the convenience of access to amenities, public
transportation, and proximity to schools and workplaces.

2. Miles from School: I would prefer a house that is 10 miles or less from the school to minimize commuting time and
transportation costs.

3. Rent Price per Month: I would select the house with minimal rent price per month, as it would be more cost-effective and
allow for better budget management.

4. Number of Rooms and Bathrooms: I would prefer the house with as many rooms and bathrooms as needed for my family
size and lifestyle.

In [ ]: # draw a scatter plot between the rent price and the miles from the city center
color_map = {
3: "lightcoral",
2: "lightgreen",
1: "lightblue"
}

fig, ax = plt.subplots(figsize=(10, 5))


housing_new = housing_df[(housing_df['Location'] == "City Center") & (housing_df['Miles (dist. between school and hou

# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(housing_new['Miles (dist. between school and house)'],
housing_new['Rent Price per Month'],
c=housing_new["No. of Rooms"].map(color_map))

# show the legends


# Create legend handles and labels
legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
markerfacecolor=color, label=label)
for label, color in color_map.items()]

# Add legend
ax.legend(handles=legend_handles, title="No. of Rooms",bbox_to_anchor=(1.01, 1), loc='upper left')

# change the point with index 785 to gold


plt.scatter(housing_new['Miles (dist. between school and house)'][785],
housing_new['Rent Price per Month'][785],
color='gold', s=100, edgecolor='black')

# Title and labels


ax.set_title('Rent Price per Month vs. Miles from Schools', fontsize=12, fontweight='bold', color='navy')
ax.set_xlabel('Miles from Schools')
ax.set_ylabel('Rent Price per Month')

plt.show()
After making tradoffs between the renting price and the distance from the school, I decided to choose the point with gold color in
the scatter plot, with renting price around 9000 dodllars and distance from the school around 19 miles. This point is located in the
city center, and it has 2 rooms and 1 bathroom, which is the most suitable for me.

In [ ]: housing_new[(housing_new["Rent Price per Month"] <= 10000) & (housing_new["Miles (dist. between school and house)"] <

Out[ ]: No. of No. of Miles (dist. between school and Rent Price per
Area Location Sell Price
Rooms Bathrooms house) Month

City
785 2041 2 1 19 8912 27709264
Center

**Q4. Assuming you have enough funds to purchase a house,


will you opt to continue renting or proceed with a purchase? If renting, which one will you choose? If buying, which one will you
select? Why?**

To evaluate the decision between renting and purchasing a house, I create a new column Sell Rent Ratio by dividing the Sell
Price by the Rent Price per Month then multiplying 12. The Sell Rent Ratio represents the number of years it would
take to pay off the house if the rent price is used to pay off the house. A lower Sell Rent Ratio indicates a better investment
opportunity.

After calculating the Sell Rent Ratio , I filtered the data with criterias below:

1. Sell Rent Ratio less than 50


2. Location is Suburb
3. Miles from School as close to 0 as possible

These criterias are chosen because I would prefer to purchase a house with a good investment opportunity with location being the
suburb and close to the school.

In [ ]: # create a new column indicating selling price / rent price / 12


housing_new = housing_df.copy()
housing_new["Sell Rent Ratio"] = housing_new["Sell Price"] / (housing_new["Rent Price per Month"] * 12)

# draw a scatter plot between the sell rent ratio and the miles from the city center
color_map = {
3: "lightcoral",
2: "lightgreen",
1: "lightblue"
}

fig, ax = plt.subplots(figsize=(10, 5))


housing_new = housing_new[(housing_new['Location'] == "Suburb") & (housing_new['Sell Rent Ratio'] <= 50)]

# ax.scatter(one_room['Miles (dist. between school and house)'], one_room['Rent Price per Month'], color='lightcoral
scatter = ax.scatter(housing_new['Miles (dist. between school and house)'],
housing_new['Sell Rent Ratio'],
c=housing_new["No. of Rooms"].map(color_map))

# show the legends


# Create legend handles and labels
legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
markerfacecolor=color, label=label)
for label, color in color_map.items()]

# change the point with index 928 to gold


plt.scatter(housing_new['Miles (dist. between school and house)'][928],
housing_new['Sell Rent Ratio'][928],
color='gold', s=100, edgecolor='black')

# Add legend
ax.legend(handles=legend_handles, title="No. of Rooms",bbox_to_anchor=(1.01, 1), loc='upper left')

# title and labels


ax.set_title('Sell Rent Ratio vs. Miles from Schools', fontsize=12, fontweight='bold', color='navy')

ax.set_xlabel('Miles from Schools')


ax.set_ylabel('Sell Rent Ratio')

plt.show()
In [ ]: # print the house with index 928
housing_df.iloc[928]

Out[ ]: Area 2158


No. of Rooms 3
No. of Bathrooms 1
Location Suburb
Miles (dist. between school and house) 25
Rent Price per Month 16445
Sell Price 7196311
Name: 928, dtype: object

**Q5. Are there any properties with rent or selling prices that

seem unusually high or low? Why?**


To identify properties with unusually high or low rent or selling prices, I would examine the distribution of rent and selling prices
using box plots. As defined by the Interquartile Range (IQR), any data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
can be considered as outliers. With showfliers=True , the box plot would show the outliers in the dataset.

As shown in the below box plots, neither the rent price nor the selling price has any outliers, as there are no data points that fall
below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

In [ ]: # show the relationship between the sell price and the location
# as the type of box plots

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

ax[0].boxplot(housing_df['Rent Price per Month'], patch_artist=True,


boxprops=dict(facecolor='lightblue'), showfliers=True)

ax[0].set_title(f'Rent Price per Month', fontsize=12, fontweight='bold', color='navy')


ax[0].set_xticklabels([''])

ax[1].boxplot(housing_df['Sell Price'], patch_artist=True,


boxprops=dict(facecolor='lightblue'), showfliers=True)
ax[1].set_title(f'Sell Price', fontsize=12, fontweight='bold', color='navy')
ax[1].set_xticklabels([''])

plt.show()

You might also like