0% found this document useful (0 votes)
70 views13 pages

Information Regarding Sales Made in Real Estate in A Tabular Format

The document discusses analyzing a real estate sales dataset from Kaggle to understand which features most influence house prices. It loads and cleans the CSV data, removes transactions before 2012.9 and checks for null values. Simple statistics are calculated on the cleaned data, like average transaction year and min/max prices. The document also examines the range of latitudes and longitudes of properties and plans to visualize the data to identify correlations between features and house prices.

Uploaded by

frankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views13 pages

Information Regarding Sales Made in Real Estate in A Tabular Format

The document discusses analyzing a real estate sales dataset from Kaggle to understand which features most influence house prices. It loads and cleans the CSV data, removes transactions before 2012.9 and checks for null values. Simple statistics are calculated on the cleaned data, like average transaction year and min/max prices. The document also examines the range of latitudes and longitudes of properties and plans to visualize the data to identify correlations between features and house prices.

Uploaded by

frankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

In 

[1]: import pandas as pd


import csv

The dataset is taken from Kaggle.com. It consists of


information regarding sales made in real estate in a
tabular format.

Features such as the transaction date, house age, nearest Metro


station distance, number of convenience stores as well as its location
have been given, with the final column being the House price per unit
area.

The objective I want to potray here is; on what features does the
house price heavily depend on?

In [36]: df = pd.read_csv('estate.csv')
df.head(100)

Out[36]:
Y
house
X1 X2 X3 distance to the X4 number of
X5 X6 price
No transaction house nearest MRT convenience
latitude longitude of
date age station stores
unit
area

0 1 2012.917 32.0 84.87882 10 24.98298 121.54024 37.9

1 2 2012.917 19.5 306.59470 9 24.98034 121.53951 42.2

2 3 2013.583 13.3 561.98450 5 24.98746 121.54391 47.3

3 4 2013.500 13.3 561.98450 5 24.98746 121.54391 54.8

4 5 2012.833 5.0 390.56840 5 24.97937 121.54245 43.1

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Y
house
X1 X2 X3 distance to the X4 number of
X5 X6 price
No transaction house nearest MRT convenience
latitude longitude of
date age station stores
unit
area

... ... ... ... ... ... ... ... ...

95 96 2012.917 8.0 104.81010 5 24.96674 121.54067 51.8

96 97 2013.417 6.4 90.45606 9 24.97433 121.54310 59.5

97 98 2013.083 28.4 617.44240 3 24.97746 121.53299 34.6

98 99 2013.417 16.4 289.32480 5 24.98203 121.54348 51.0

99 100 2013.417 6.4 90.45606 9 24.97433 121.54310 62.2

100 rows × 8 columns

Let's do some minor operations on the dataset and see if the dataset needs cleaning!

In [37]: len(df)

Out[37]: 414

Let's remove transactions that occured before a certain age:

In [38]: df = df[df['X1 transaction date'] > 2012.900]

In [39]: len(df)

Out[39]: 326

As we can see, the dataset has now reduced

Our next objective is to see if there are any null values in the dataset

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [40]: df_cleared = df[df.notnull()]

In [41]: len(df_cleared)

Out[41]: 326

Thankfully, we see that we have no null values and we can proceed with our simple statistics

Let's try out some simple statistics to know more about our data

In [42]: #Simple Statistics

#Let's start with the average time around with the transactions were he
ld!

count = 0
for i in df_cleared['X1 transaction date']:
count += i

print(count/len(df_cleared))

#another way

df_cleared['X1 transaction date'].mean()

2013.25641411043

Out[42]: 2013.2564141104294

In [43]: #let's see the max and min price of unit area

mn = df_cleared['Y house price of unit area'][0]


mx = 0

for i in df_cleared['Y house price of unit area']:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
if mx < i:
mx = i
if mn > i:
mn = i

print(mx, mn)

#another way:
print(df_cleared['Y house price of unit area'].max(), df_cleared['Y hou
se price of unit area'].min())

117.5 7.6
117.5 7.6

In [49]: #let's see if the locations (Longitude and lattitude) differ much
#X5 latitude X6 longitude
mx = 0
mn = df_cleared['X5 latitude'].max()

for i in df_cleared['X5 latitude']:


if mx < i:
mx = i
if mn > i:
mn = i

print(mx, mn, df['X5 latitude'].mean())

mx = 0
mn = df_cleared['X6 longitude'].max()

for i in df_cleared['X6 longitude']:


if mx < i:
mx = i
if mn > i:
mn = i

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
print(mx, mn, df['X6 longitude'].mean())

25.01459 24.93207 24.969077699386506


121.56626999999999 121.47353000000001 121.53358107361963

This reveals a rather nice information, that our estates are mostly situated mostly near each
other

Now its time to visualize our data to establish correlations

In [50]: #time for some visualizations


import matplotlib.pyplot as plt
%matplotlib notebook

In [51]: #Let's see if we can find a relation between age and price per unit are
a
plt.figure()
plt.scatter(df['X2 house age'], df['Y house price of unit area'])
plt.xlabel('House Age')
plt.ylabel('Price per unit area')
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Its not clear from just one relation, but upon closer inspection, we see that as the age of the
house increases, the majority of the prices lie on a somewhat lower value than young houses,
obviously with existing exceptions.

Let's try with another relation; number of convenience stores to the


price per unit area

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [52]: #let's see if we can find a better relationship with the number of con
venience stores to the price per unit area:
plt.figure()
plt.scatter(df['X4 number of convenience stores'], df['Y house price of
unit area'])
plt.xlabel('number of convenience stores')
plt.ylabel('Price per unit area')
plt.grid()

As we can see, a solid relation cannot be established with this relation.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Let's try with another relation; the nearest metro station distance to
the price per unit area

In [53]: plt.figure()
plt.scatter(df['X3 distance to the nearest MRT station'], df['Y house p
rice of unit area'])
plt.xlabel('distance to the nearest MRT station')
plt.ylabel('Price per unit area')
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Here, we have a SOLID correlation derived between the two chosen features, as we can see the
less the distance, the higher the price.

Let's see a histogram plot for the number of convenience stores


present in our dataset
It can be a valuable information to show clients when we are talking about the type of properties
and real estate we have in general

In [54]: plt.figure()
plt.hist(df['X4 number of convenience stores'], bins = 20)
plt.xlabel('Number of Convenience stores near an area')
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
A simple yet informative plot about the about convenience stores we have near our real estates

How about a 3D plot to see a correlation. We have previously seen 3


features out of which 2 don't really have much meaning. What if we
use multiple features at the same time?

In [55]: from mpl_toolkits.mplot3d import Axes3D

In [56]: fig = plt.figure()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['X2 house age'],df['X3 distance to the nearest MRT statio
n'], df['Y house price of unit area'])
ax.set_xlabel('house age')
ax.set_ylabel('distance to the nearest MRT station')
ax.set_zlabel('house price of unit area')

Out[56]: Text(0.5, 0, 'house price of unit area')

As expected, when we utilized two features, we can see an even better correlation

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Let's wrap up this notebook with a boxplot of the locations

In [63]: plt.figure()
plt.grid()
plt.boxplot(df['X5 latitude'])
plt.grid()

In [62]: plt.figure()
plt.boxplot(df['X6 longitude'])
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Boxplots gives us a clear understanding of how SPREAD our data is and also what the outliers
are.

In [ ]:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

You might also like