Information Regarding Sales Made in Real Estate in A Tabular Format
Information Regarding Sales Made in Real Estate in A Tabular Format
The objective I want to potray here is; on what features does the
house price heavily depend on?
In [36]: df = pd.read_csv('estate.csv')
df.head(100)
Out[36]:
Y
house
X1 X2 X3 distance to the X4 number of
X5 X6 price
No transaction house nearest MRT convenience
latitude longitude of
date age station stores
unit
area
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Y
house
X1 X2 X3 distance to the X4 number of
X5 X6 price
No transaction house nearest MRT convenience
latitude longitude of
date age station stores
unit
area
Let's do some minor operations on the dataset and see if the dataset needs cleaning!
In [37]: len(df)
Out[37]: 414
In [39]: len(df)
Out[39]: 326
Our next objective is to see if there are any null values in the dataset
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [40]: df_cleared = df[df.notnull()]
In [41]: len(df_cleared)
Out[41]: 326
Thankfully, we see that we have no null values and we can proceed with our simple statistics
Let's try out some simple statistics to know more about our data
#Let's start with the average time around with the transactions were he
ld!
count = 0
for i in df_cleared['X1 transaction date']:
count += i
print(count/len(df_cleared))
#another way
2013.25641411043
Out[42]: 2013.2564141104294
In [43]: #let's see the max and min price of unit area
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
if mx < i:
mx = i
if mn > i:
mn = i
print(mx, mn)
#another way:
print(df_cleared['Y house price of unit area'].max(), df_cleared['Y hou
se price of unit area'].min())
117.5 7.6
117.5 7.6
In [49]: #let's see if the locations (Longitude and lattitude) differ much
#X5 latitude X6 longitude
mx = 0
mn = df_cleared['X5 latitude'].max()
mx = 0
mn = df_cleared['X6 longitude'].max()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
print(mx, mn, df['X6 longitude'].mean())
This reveals a rather nice information, that our estates are mostly situated mostly near each
other
In [51]: #Let's see if we can find a relation between age and price per unit are
a
plt.figure()
plt.scatter(df['X2 house age'], df['Y house price of unit area'])
plt.xlabel('House Age')
plt.ylabel('Price per unit area')
plt.grid()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Its not clear from just one relation, but upon closer inspection, we see that as the age of the
house increases, the majority of the prices lie on a somewhat lower value than young houses,
obviously with existing exceptions.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [52]: #let's see if we can find a better relationship with the number of con
venience stores to the price per unit area:
plt.figure()
plt.scatter(df['X4 number of convenience stores'], df['Y house price of
unit area'])
plt.xlabel('number of convenience stores')
plt.ylabel('Price per unit area')
plt.grid()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Let's try with another relation; the nearest metro station distance to
the price per unit area
In [53]: plt.figure()
plt.scatter(df['X3 distance to the nearest MRT station'], df['Y house p
rice of unit area'])
plt.xlabel('distance to the nearest MRT station')
plt.ylabel('Price per unit area')
plt.grid()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Here, we have a SOLID correlation derived between the two chosen features, as we can see the
less the distance, the higher the price.
In [54]: plt.figure()
plt.hist(df['X4 number of convenience stores'], bins = 20)
plt.xlabel('Number of Convenience stores near an area')
plt.grid()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
A simple yet informative plot about the about convenience stores we have near our real estates
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['X2 house age'],df['X3 distance to the nearest MRT statio
n'], df['Y house price of unit area'])
ax.set_xlabel('house age')
ax.set_ylabel('distance to the nearest MRT station')
ax.set_zlabel('house price of unit area')
As expected, when we utilized two features, we can see an even better correlation
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Let's wrap up this notebook with a boxplot of the locations
In [63]: plt.figure()
plt.grid()
plt.boxplot(df['X5 latitude'])
plt.grid()
In [62]: plt.figure()
plt.boxplot(df['X6 longitude'])
plt.grid()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Boxplots gives us a clear understanding of how SPREAD our data is and also what the outliers
are.
In [ ]:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD