House Rent Prediction EDA
House Rent Prediction EDA
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
Loading dataset
In [2]:
df=pd.read_csv("/content/House_Rent_Dataset.csv")
df.shape
Out[2]:
(4746, 12)
In [3]:
df.head(5)
Out[3]:
Phool
2022- 1 out of Super Semi-
1 2 20000 800 Bagan, Kolkata Bachelors/Fam
05-13 3 Area Furnished
Kankurgachi
Salt Lake
2022- 1 out of Super Semi-
2 2 17000 1000 City Sector Kolkata Bachelors/Fam
05-16 3 Area Furnished
2
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Posted On 4746 non-null object
1 BHK 4746 non-null int64
2 Rent 4746 non-null int64
3 Size 4746 non-null int64
4 Floor 4746 non-null object
5 Area Type 4746 non-null object
6 Area Locality 4746 non-null object
7 City 4746 non-null object
8 Furnishing Status 4746 non-null object
9 Tenant Preferred 4746 non-null object
10 Bathroom 4746 non-null int64
11 Point of Contact 4746 non-null object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB
In [5]:
df.describe()
Out[5]:
In [6]:
df.columns
Out[6]:
In [7]:
df.isnull().sum()
Out[7]:
Posted On 0
BHK 0
Rent 0
Size 0
Floor 0
Area Type 0
Area Locality 0
City 0
Furnishing Status 0
Tenant Preferred 0
Bathroom 0
Point of Contact 0
dtype: int64
In [8]:
len(df.columns)
Out[8]:
12
In [9]:
columns = df.columns
counts = [len(df['Posted On'].unique()),len(df['BHK'].unique()),len(df['Rent'].unique
()),len(df['Size'].unique()),len(df['Floor'].unique()),
len(df['Area Type'].unique()),len(df['Area Locality'].unique()),len(df['Cit
y'].unique()),len(df['Furnishing Status'].unique()),
len(df['Tenant Preferred'].unique()), len(df['Bathroom'].unique()), len(df['P
oint of Contact'].unique())]
fig, ax = plt.subplots(figsize=(15,9))
x=np.arange(len(columns))
ax.set_ylabel('Unique Counts')
ax.set_xlabel('Features in Dataset')
ax.set_title('Number of Unique Values for each feature')
ax.set_xticks(x)
ax.set_xticklabels(columns)
width = 0.35
In [10]:
df.dtypes
Out[10]:
Posted On object
BHK int64
Rent int64
Size int64
Floor object
Area Type object
Area Locality object
City object
Furnishing Status object
Tenant Preferred object
Bathroom int64
Point of Contact object
dtype: object
In [11]:
ax = sns.countplot(x="BHK", data=df)
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))
plt.show()
Observations:
1. Most of the houses that are available for rent are 2BHK houses.
2. 6BHK has the least availability for renting.
In [12]:
df['Rent'].describe()
Out[12]:
count 4.746000e+03
mean 3.499345e+04
std 7.810641e+04
min 1.200000e+03
25% 1.000000e+04
50% 1.600000e+04
75% 3.300000e+04
max 3.500000e+06
Name: Rent, dtype: float64
Observations:
The Rent feature in your dataset appears to have a wide range of values, with a minimum value of 1,200 and
a maximum value of 3,500,000. This can cause problems when building machine learning models because
many algorithms use the scale of the features to make predictions.
In [13]:
df.head(5)
Out[13]:
Phool
2022- 1 out of Super Semi-
1 2 20000 800 Bagan, Kolkata Bachelors/Fam
05-13 3 Area Furnished
Kankurgachi
Salt Lake
2022- 1 out of Super Semi-
2 2 17000 1000 City Sector Kolkata Bachelors/Fam
05-16 3 Area Furnished
2
In [14]:
df.columns
Out[14]:
In [19]:
sns.distplot(df['Rent'])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6210c39400>
In [15]:
sns.catplot(x="BHK", y="Rent",kind="box",data=df)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x7f6222d9c3a0>
In [16]:
In [17]:
df['Rent'].describe()
Out[17]:
count 4.746000e+03
mean 3.499345e+04
std 7.810641e+04
min 1.200000e+03
25% 1.000000e+04
50% 1.600000e+04
75% 3.300000e+04
max 3.500000e+06
Name: Rent, dtype: float64
In [20]:
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f621346d6a0>
In [21]:
df.drop(['Rent'],axis=1,inplace=True)
In [22]:
sns.catplot(x="BHK", y="Rent_log",kind="box",data=df)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x7f6213a8ad00>
Observations:
1. Lower and Higher end outliers are observed for 1BHK,2BHK,3BHK Rooms.
2. Higher End Outliers are observed for 6BHK Rooms too.
In [16]:
sns.barplot('City','Rent_log',data=df,hue='BHK')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3fdd340>
Observations:
1. Kolkata, Chennai, Hyderabad has upto 6 BHK houses available for rent.
2. Mumbai and Delhi have houses available upto 5 BHK.
3. Bangalore has upto 4 BHK houses available.
4. For all sorts of houses that are available upto 5 BHK, Mumbai has the highest rent.
5. Whereas Kolkata has lowest rent.
6. 6 BHK houses have highest rent in Chennai.
In [17]:
sns.barplot('City','Size',data=df,hue='BHK')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3e81f10>
Observations:
1. Kolkata has least sizes of houses available for renting(includes all sorts of houses).
2. Chennai has maximum size of 5 BHK houses available.
3. Whereas Bangalore has maximum size of 4 BHK houses available.
4. Hyderabad and Bangalore almost has maximum size of 3 BHK houses available.
5. Hyderabad has maximum size of 2 BHK houses available.
6. Hyderabad also has maximum size of 1 BHK houses available.
In [18]:
sns.barplot('Area Type','Rent_log',data=df,hue='BHK')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3d3c490>
Observations:
In [19]:
sns.barplot('Area Type','Size',data=df,hue='BHK')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3ecd9d0>
Observations:
In [20]:
sns.barplot('Furnishing Status','Rent',hue='BHK',data=df)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3bd5f10>
Observations:
1. For houses uptil 4 BHK, Semi-Furnished and Furnished houses have almost similar rent.
2. However, For 5 BHK Houses, Furnished houses have higher rent.
3. For 6 BHK houses, Semi-furnished houses have higher rent.
In [21]:
sns.barplot('Furnishing Status','Size',hue='BHK',data=df)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3b25250>
Observations:
In [22]:
sns.barplot('Bathroom','Rent_log',data=df,hue='BHK')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3afb250>
Observations:
1. Maximum rent has been observed if the number of Bathrooms in the house is 5.
2. Houses with 1 bathroom appear to have least rent.
In [23]:
sns.barplot('Bathroom','Size',data=df,hue='BHK')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d38c47f0>
Observations:
1. After looking at the graph, it can be said that as the size increases, number of bathrooms also increases.
2. However, this might not hold true in some cases. It can be seen that size is not getting affected even if
the number of bathrooms is higher(No of Bathrooms=6).
In [24]:
sns.barplot('Point of Contact','Rent_log',hue='BHK',data=df)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d3ede220>
Observations:
1. If the Point of Contact is an Agent, then the rent is higher for each houses.
In [28]:
sns.barplot('BHK','Rent_log',hue='City',data=df)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21d35b2f70>
Observations
Preferred Tenant
In [29]:
tenant=df['Tenant Preferred'].value_counts()
plt.figure(figsize=(10,5))
plt.pie(tenant.values,labels=tenant.index,autopct="%1.2f%%");
Observations:
In [30]:
furn=df['Furnishing Status'].value_counts()
plt.figure(figsize=(10,5))
plt.pie(furn.values,labels=furn.index,autopct="%1.2f%%");
Observations:
In [31]:
poc=df['Point of Contact'].value_counts()
plt.figure(figsize=(10,5))
plt.pie(poc.values,labels=poc.index,autopct="%1.2f%%");
Observations:
In [32]:
at=df['Area Type'].value_counts()
plt.figure(figsize=(10,5))
plt.pie(at.values,labels=at.index,autopct="%1.2f%%");
Observations:
In [33]:
ct=df['City'].value_counts()
plt.figure(figsize=(10,5))
plt.pie(ct.values,labels=ct.index,autopct="%1.2f%%");
Observations
1. Mumbai is the most popular city for renting houses, followed by Chennai, Bangalore and Hyderabad.
2. Kolkata is the least popular city for renting houses.
In [34]:
al=df['Area Locality'].value_counts()
plt.figure(figsize=(10,5))
plt.pie(al.values[:10],labels=al.index[:10],autopct="%1.2f%%");
Observations
The above pie chart shows popular area localities for renting houses.
In [35]:
In [44]:
df.head(5)
Out[44]:
Phool
2022- 1 out of Super Semi-
1 2 800 Bagan, Kolkata Bachelors/Family
05-13 3 Area Furnished
Kankurgachi
Salt Lake
2022- 1 out of Super Semi-
2 2 1000 City Sector Kolkata Bachelors/Family
05-16 3 Area Furnished
2
In [52]:
def show_values_on_bars(axs):
def _show_on_single_plot(ax):
for p in ax.patches:
_x = p.get_x() + p.get_width() / 2
_y = p.get_y() + p.get_height()
value = '{:.2f}'.format(p.get_height())
ax.text(_x, _y, value, ha="center")
if isinstance(axs, np.ndarray):
for idx, ax in np.ndenumerate(axs):
_show_on_single_plot(ax)
else:
_show_on_single_plot(axs)
In [61]:
Observations:
In [62]:
Observations
In [60]:
Observations
Rents varying on quarters of 2022 year(Since the data was taken from magicbricks.com 2022 for year
2022)
In [66]:
Observations
In [71]:
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21cf7aad60>
Observations:
1. BHK, Size, Bathrooms are highly correlated with the Rent feature
BHK vs Rent
In [78]:
sns.scatterplot('BHK','Rent_log',data=df)
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21cf0a8bb0>
Observations
Bathrooms vs Rent
In [79]:
sns.scatterplot('Bathroom','Rent_log',data=df)
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21cee0d8e0>
Observations
Size vs Rent
In [83]:
sns.scatterplot('Size','Rent_log',data=df)
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f21cf4c47f0>
Observations
1. As house size increases, ranging of house rents also increases, but in some cases, we might need to
consider the city in which the house will be taken for rent and no of BHK it has.
2. Based on City and BHK, Rents might increase or decrease.