0% found this document useful (0 votes)

70 views13 pages

Information Regarding Sales Made in Real Estate in A Tabular Format

The document discusses analyzing a real estate sales dataset from Kaggle to understand which features most influence house prices. It loads and cleans the CSV data, removes transactions before 2012.9 and checks for null values. Simple statistics are calculated on the cleaned data, like average transaction year and min/max prices. The document also examines the range of latitudes and longitudes of properties and plans to visualize the data to identify correlations between features and house prices.

Uploaded by

frankh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views13 pages

Information Regarding Sales Made in Real Estate in A Tabular Format

Uploaded by

frankh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

In

[1]: import pandas as pd

import csv

The dataset is taken from Kaggle.com. It consists of

information regarding sales made in real estate in a
tabular format.

Features such as the transaction date, house age, nearest Metro

station distance, number of convenience stores as well as its location
have been given, with the final column being the House price per unit
area.

The objective I want to potray here is; on what features does the
house price heavily depend on?

In [36]: df = pd.read_csv('estate.csv')
df.head(100)

Out[36]:
Y
house
X1 X2 X3 distance to the X4 number of
X5 X6 price
No transaction house nearest MRT convenience
latitude longitude of
date age station stores
unit
area

0 1 2012.917 32.0 84.87882 10 24.98298 121.54024 37.9

1 2 2012.917 19.5 306.59470 9 24.98034 121.53951 42.2

2 3 2013.583 13.3 561.98450 5 24.98746 121.54391 47.3

3 4 2013.500 13.3 561.98450 5 24.98746 121.54391 54.8

4 5 2012.833 5.0 390.56840 5 24.97937 121.54245 43.1

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Y
house
X1 X2 X3 distance to the X4 number of
X5 X6 price
No transaction house nearest MRT convenience
latitude longitude of
date age station stores
unit
area

... ... ... ... ... ... ... ... ...

95 96 2012.917 8.0 104.81010 5 24.96674 121.54067 51.8

96 97 2013.417 6.4 90.45606 9 24.97433 121.54310 59.5

97 98 2013.083 28.4 617.44240 3 24.97746 121.53299 34.6

98 99 2013.417 16.4 289.32480 5 24.98203 121.54348 51.0

99 100 2013.417 6.4 90.45606 9 24.97433 121.54310 62.2

100 rows × 8 columns

Let's do some minor operations on the dataset and see if the dataset needs cleaning!

In [37]: len(df)

Out[37]: 414

Let's remove transactions that occured before a certain age:

In [38]: df = df[df['X1 transaction date'] > 2012.900]

In [39]: len(df)

Out[39]: 326

As we can see, the dataset has now reduced

Our next objective is to see if there are any null values in the dataset

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [40]: df_cleared = df[df.notnull()]

In [41]: len(df_cleared)

Out[41]: 326

Thankfully, we see that we have no null values and we can proceed with our simple statistics

Let's try out some simple statistics to know more about our data

In [42]: #Simple Statistics

#Let's start with the average time around with the transactions were he
ld!

count = 0
for i in df_cleared['X1 transaction date']:
count += i

print(count/len(df_cleared))

#another way

df_cleared['X1 transaction date'].mean()

2013.25641411043

Out[42]: 2013.2564141104294

In [43]: #let's see the max and min price of unit area

mn = df_cleared['Y house price of unit area'][0]

mx = 0

for i in df_cleared['Y house price of unit area']:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
if mx < i:
mx = i
if mn > i:
mn = i

print(mx, mn)

#another way:
print(df_cleared['Y house price of unit area'].max(), df_cleared['Y hou
se price of unit area'].min())

117.5 7.6
117.5 7.6

In [49]: #let's see if the locations (Longitude and lattitude) differ much
#X5 latitude X6 longitude
mx = 0
mn = df_cleared['X5 latitude'].max()

for i in df_cleared['X5 latitude']:

if mx < i:
mx = i
if mn > i:
mn = i

print(mx, mn, df['X5 latitude'].mean())

mx = 0
mn = df_cleared['X6 longitude'].max()

for i in df_cleared['X6 longitude']:

if mx < i:
mx = i
if mn > i:
mn = i

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
print(mx, mn, df['X6 longitude'].mean())

25.01459 24.93207 24.969077699386506

121.56626999999999 121.47353000000001 121.53358107361963

This reveals a rather nice information, that our estates are mostly situated mostly near each
other

Now its time to visualize our data to establish correlations

In [50]: #time for some visualizations

import matplotlib.pyplot as plt
%matplotlib notebook

In [51]: #Let's see if we can find a relation between age and price per unit are
a
plt.figure()
plt.scatter(df['X2 house age'], df['Y house price of unit area'])
plt.xlabel('House Age')
plt.ylabel('Price per unit area')
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Its not clear from just one relation, but upon closer inspection, we see that as the age of the
house increases, the majority of the prices lie on a somewhat lower value than young houses,
obviously with existing exceptions.

Let's try with another relation; number of convenience stores to the

price per unit area

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [52]: #let's see if we can find a better relationship with the number of con
venience stores to the price per unit area:
plt.figure()
plt.scatter(df['X4 number of convenience stores'], df['Y house price of
unit area'])
plt.xlabel('number of convenience stores')
plt.ylabel('Price per unit area')
plt.grid()

As we can see, a solid relation cannot be established with this relation.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Let's try with another relation; the nearest metro station distance to
the price per unit area

In [53]: plt.figure()
plt.scatter(df['X3 distance to the nearest MRT station'], df['Y house p
rice of unit area'])
plt.xlabel('distance to the nearest MRT station')
plt.ylabel('Price per unit area')
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Here, we have a SOLID correlation derived between the two chosen features, as we can see the
less the distance, the higher the price.

Let's see a histogram plot for the number of convenience stores

present in our dataset
It can be a valuable information to show clients when we are talking about the type of properties
and real estate we have in general

In [54]: plt.figure()
plt.hist(df['X4 number of convenience stores'], bins = 20)
plt.xlabel('Number of Convenience stores near an area')
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
A simple yet informative plot about the about convenience stores we have near our real estates

How about a 3D plot to see a correlation. We have previously seen 3

features out of which 2 don't really have much meaning. What if we
use multiple features at the same time?

In [55]: from mpl_toolkits.mplot3d import Axes3D

In [56]: fig = plt.figure()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['X2 house age'],df['X3 distance to the nearest MRT statio
n'], df['Y house price of unit area'])
ax.set_xlabel('house age')
ax.set_ylabel('distance to the nearest MRT station')
ax.set_zlabel('house price of unit area')

Out[56]: Text(0.5, 0, 'house price of unit area')

As expected, when we utilized two features, we can see an even better correlation

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Let's wrap up this notebook with a boxplot of the locations

In [63]: plt.figure()
plt.grid()
plt.boxplot(df['X5 latitude'])
plt.grid()

In [62]: plt.figure()
plt.boxplot(df['X6 longitude'])
plt.grid()

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Boxplots gives us a clear understanding of how SPREAD our data is and also what the outliers
are.

In [ ]:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

Delhi House Price Prediction 1692019997
No ratings yet
Delhi House Price Prediction 1692019997
34 pages
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
House Price Prediction: # Importing Necessary Libraries
No ratings yet
House Price Prediction: # Importing Necessary Libraries
18 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
Eda Project
No ratings yet
Eda Project
28 pages
IndianHouses 1695069727
No ratings yet
IndianHouses 1695069727
7 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
SAS Base Dumps
100% (6)
SAS Base Dumps
31 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Capstone Project 6 April
No ratings yet
Capstone Project 6 April
64 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
ML - Datascience Manual
No ratings yet
ML - Datascience Manual
64 pages
Unit 2
No ratings yet
Unit 2
78 pages
Report
No ratings yet
Report
40 pages
Data Analysis With Python - Jupyter Notebook
No ratings yet
Data Analysis With Python - Jupyter Notebook
10 pages
House Price Prediction
No ratings yet
House Price Prediction
63 pages
Dawit House
No ratings yet
Dawit House
49 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
33 pages
AZ CDISC Implementation
100% (1)
AZ CDISC Implementation
38 pages
Bi El
No ratings yet
Bi El
26 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
MiniProject BI
No ratings yet
MiniProject BI
16 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
BCA 5th Sem Lab (ML)
No ratings yet
BCA 5th Sem Lab (ML)
20 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Task 1 - Data Analytics in Python
No ratings yet
Task 1 - Data Analytics in Python
15 pages
Create Power BI Visuals by Using Python
100% (1)
Create Power BI Visuals by Using Python
10 pages
Advanced Visualization For Data Scientists With Matplotlib
No ratings yet
Advanced Visualization For Data Scientists With Matplotlib
38 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
1684918425867
No ratings yet
1684918425867
14 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Project PDF
No ratings yet
Project PDF
13 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
Laboratory Eercise 4.1 - Del Pilar
No ratings yet
Laboratory Eercise 4.1 - Del Pilar
9 pages
Final
No ratings yet
Final
14 pages
Assignment2 DataViz
No ratings yet
Assignment2 DataViz
11 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
Real Estate Valuation Data Set: Section Order
No ratings yet
Real Estate Valuation Data Set: Section Order
17 pages
00 Data Wrangling
No ratings yet
00 Data Wrangling
10 pages
Emllab
No ratings yet
Emllab
6 pages
Boston Dataset
No ratings yet
Boston Dataset
6 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab
No ratings yet
Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab
6 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
Project 4 - House Price Prediction - Ipynb - Colab
No ratings yet
Project 4 - House Price Prediction - Ipynb - Colab
5 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Tarea - Prediccion de Casas en California
No ratings yet
Tarea - Prediccion de Casas en California
5 pages
Main - Py Text File
No ratings yet
Main - Py Text File
5 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
Coding
No ratings yet
Coding
7 pages
Mexico City Price Prediction
No ratings yet
Mexico City Price Prediction
5 pages
Machine Learning Life Cycle Report
No ratings yet
Machine Learning Life Cycle Report
2 pages
Implementation of CDISC Standards: Presented by Sandeep (Raj) Juneja, ASG Inc., Cary, NC
No ratings yet
Implementation of CDISC Standards: Presented by Sandeep (Raj) Juneja, ASG Inc., Cary, NC
18 pages
Sas Interview Questions
No ratings yet
Sas Interview Questions
15 pages
Data Sam Pepek Su A Mika S Us Kontrol
No ratings yet
Data Sam Pepek Su A Mika S Us Kontrol
21 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
SAS Material
No ratings yet
SAS Material
75 pages
Manual QCA With R - v170407
No ratings yet
Manual QCA With R - v170407
64 pages
AllCheatSheets Stata v15 PDF
No ratings yet
AllCheatSheets Stata v15 PDF
6 pages
SAP Analytics Cloud Help: Warning
No ratings yet
SAP Analytics Cloud Help: Warning
157 pages
1 Subject: Tanagra 1.4.28 R 2.7.2 Knime 1.3.5 Orange 1.0B2 Rapidminer Community Edition
No ratings yet
1 Subject: Tanagra 1.4.28 R 2.7.2 Knime 1.3.5 Orange 1.0B2 Rapidminer Community Edition
39 pages
Sms Cgwave
No ratings yet
Sms Cgwave
14 pages
Final Project Making Predictions From Data-Course 2: October 6, 2020
No ratings yet
Final Project Making Predictions From Data-Course 2: October 6, 2020
20 pages
Birt Simple Crosstab
No ratings yet
Birt Simple Crosstab
4 pages
A Robust Dynamic Data Masking Transformation Approach To Safeguard Sensitive Data
No ratings yet
A Robust Dynamic Data Masking Transformation Approach To Safeguard Sensitive Data
5 pages
Assignment EDA Casestudy11
No ratings yet
Assignment EDA Casestudy11
20 pages
Using R Commandeer For Data Analysis
No ratings yet
Using R Commandeer For Data Analysis
25 pages
Data Exploration
No ratings yet
Data Exploration
4 pages
CEDA Basic Training - 20180425
No ratings yet
CEDA Basic Training - 20180425
114 pages
Unit I - 1.3 - Datasets For Machine Learning at CSJMU - 6 Slides Handouts
No ratings yet
Unit I - 1.3 - Datasets For Machine Learning at CSJMU - 6 Slides Handouts
2 pages
A. Describe in Detail The Advantages and Disadvantages of Renting Versus Owning A Home
No ratings yet
A. Describe in Detail The Advantages and Disadvantages of Renting Versus Owning A Home
2 pages
SPSS Guide and Normal Values
No ratings yet
SPSS Guide and Normal Values
83 pages
International Trade Data (HS, 92) Data Dictionary
No ratings yet
International Trade Data (HS, 92) Data Dictionary
1 page
University of The People Course Bus 2204 Topic: Personal Financial Planning Instructor: Madam Schaffert
No ratings yet
University of The People Course Bus 2204 Topic: Personal Financial Planning Instructor: Madam Schaffert
4 pages
Written Assignment
No ratings yet
Written Assignment
7 pages
Supplementary Materials
No ratings yet
Supplementary Materials
2 pages
NTang - Kilog Lite 2015 - 13 12 16
No ratings yet
NTang - Kilog Lite 2015 - 13 12 16
29 pages
University of The People BUS 2201 - AY2021-T2 Principles of Marketing Written Assignment Unit 1 Instructor DR Linda Howe Date: November 14, 2020
No ratings yet
University of The People BUS 2201 - AY2021-T2 Principles of Marketing Written Assignment Unit 1 Instructor DR Linda Howe Date: November 14, 2020
5 pages
Written Assignment Unit 1: Business Net Types University of The People BUS 2202 E-Commerce Instructor Richard Cline 16 November, 2020
No ratings yet
Written Assignment Unit 1: Business Net Types University of The People BUS 2202 E-Commerce Instructor Richard Cline 16 November, 2020
5 pages
Step 1: Finding The Data Set: "Amazon - Reviews - Multilingual - UK - v1 - 00.tsv - GZ" 'RT' "Utf8"
No ratings yet
Step 1: Finding The Data Set: "Amazon - Reviews - Multilingual - UK - v1 - 00.tsv - GZ" 'RT' "Utf8"
4 pages
Importance of Voting Essay
No ratings yet
Importance of Voting Essay
5 pages
What Makes A Good Abstract
No ratings yet
What Makes A Good Abstract
3 pages
Data Management in Stata
No ratings yet
Data Management in Stata
19 pages
Package Parmsurvfit': R Topics Documented
No ratings yet
Package Parmsurvfit': R Topics Documented
13 pages
Written Assignment Unit 7: Abstract
No ratings yet
Written Assignment Unit 7: Abstract
3 pages
Written Assignment Unit 7: Abstract
No ratings yet
Written Assignment Unit 7: Abstract
3 pages
This Study Resource Was: Module 2 - Assignment 2
No ratings yet
This Study Resource Was: Module 2 - Assignment 2
3 pages
Bus 2201: Principles of Marketing
No ratings yet
Bus 2201: Principles of Marketing
2 pages
Correlations: Correlations /variables X Y /print Twotail Nosig /missing Pairwise
No ratings yet
Correlations: Correlations /variables X Y /print Twotail Nosig /missing Pairwise
8 pages
Correlations: Hasil Validitas Dan Reabilitas
No ratings yet
Correlations: Hasil Validitas Dan Reabilitas
10 pages
Kabita 708 Work File
No ratings yet
Kabita 708 Work File
3 pages
Documentation Variables Definition L Is
No ratings yet
Documentation Variables Definition L Is
6 pages
Zhr1000 BDC
No ratings yet
Zhr1000 BDC
3 pages
Journal. Retrieved From: References
No ratings yet
Journal. Retrieved From: References
1 page
Procedural Surface: Exploring Texture Generation and Analysis in Computer Vision
From Everand
Procedural Surface: Exploring Texture Generation and Analysis in Computer Vision
Fouad Sabry
No ratings yet