CS Assignment (Raam Kumar)
CS Assignment (Raam Kumar)
ON
(2023 - 2024)
Grade : XII B
VEDIC VIDYASHRAM SENIOR SECONDARY SCHOOL
CERTIFICATE
He has taken proper care and shown utmost sincerity in completion of this
project as per the guidelines issued by CBSE.
Mr.C P ENOSH, M.A, M.Phil, B.Ed, for all his substantial valuable
guidance and moral support which has helped me to patch this project
with undoubted success.
● At last, I extend thanks with all my heart to the Teaching and Non-
Page | 1
DECLARATION
duly acknowledged.
Page | 2
CONTENTS
Pg.
S.NO TITLE
no
01. ACKNOWLEDGEMENT 1
02. DECLARATION 2
05. OBJECTIVE 6
09. CODING 11
11. CONCLUSION 30
12. BIBLIOGRAPHY 31
Page | 3
PROBLEM DEFINITION
People looking to buy a new home tend to be more conservative with their
budgets and market strategies. The existing system involves calculation of
house prices without the necessary prediction about future market trends and
price increase. The goal of the paper is to predict the efficient house pricing
for real estate customers with respect to their budgets and priorities.
Page | 4
PROJECT STAGES
IMPORTING
LIBRARIES AND
DATASET
EXPLORING AND
PREPROCESSING
THE DATASET
MODEL
IMPLEMENTATIO
N
MODEL TESTING
Page | 5
OBJECTIVES
High-Level Approach:
Page | 6
EXISTING AND PROPOSED SYSTEM
EXISTING SYSTEM:
There are several approaches that can be used to determine the price of the
house, one of them is the prediction analysis. The first approach is a
quantitative prediction.
Page | 7
Mathematical relationships help us to understand many aspects of everyday
life. When such relationships are expressed with exact numbers, we gain
Additional clarity Regression is concerned with specifying the relationship
between a single numeric dependent variable and one or more numeric
independent variables. House prices increase every year, so there is a need for a
system to predict house prices in the future. House price prediction can help the
developer determine the selling price of a house and can help the customer to
arrange the right time to purchase a house.
PROPOSED SYSTEM:
Nowadays, e-education and e-learning is highly influenced. Everything is
shifting from manual to automated systems. The objective of this project is to
predict the house prices so as to minimize the problems faced by the customer.
The present method is that the customer approaches a real estate agent to
manage his/her investments and suggest suitable estates for his investments.
But this method is risky as the agent might predict wrong estates and thus
leading to loss of the customer’s investments.
The manual method which is currently used in the market is out dated and
has high risk. So as to overcome this fault, there is a need for an updated and
automated system. Data mining algorithms can be used to help investors to
invest in an appropriate estate according to their mentioned requirements. Also
the new system will be cost and time efficient. This will have simple
operations. The proposed system works on Linear Regression Algorithm.
Page | 8
REQUIREMENT
libraries:
Numpy
Pandas
Sklearn
matplotlib.plt
Seaborn
SOFTWARE :
● PYTHON 3.7
● MYSQL 5.0
HARDWARE :
Operating
Windows or Linux
System
Page | 9
WORKING DESCRIPTION
The Sequence diagram above explains the working of the system. The
proposed system is supposed to be a website with 3 objects namely:
Customer, the Web Interface and the Database Server.
The database
server also includes
the computational
mechanism described
in the algorithm.
When the customer
first enters into the
website they are
displayed with a GUI
where they can enter
inputs such as the
type of house, the
area in which it is
located etc.
A data index
searching then
provides with outputs
consisting of matching properties. Now, if the customer wants to check the
house price in future they can enter the date from the future. The system will
identify the date and categorize it in the quarters. The algorithm then will
compute the value of rate and provide the results back to the customer.
Page | 10
CODING
# -*- coding: utf-8 -*-
"""house-price-prediction-top-14-xgboost.ipynb
https://colab.research.google.com/drive/16p1a388cb30t6r0sgf6w0tahiwqewtw-
* applying exploratory data analysis and trying to get some insights about our
dataset
* Building and tuning couple models to get some stable results on predicting
housing prices
"""
import os
print(os.path.join(dirname, filename))
We’re going to start by loading the data and taking first look on it as usual. for
the column names we have great dictionary file in our dataset location so we can
get familiar with them in no time. I highly recommend looking at that before you
start working on the dataset.
"""
import pandas as pd
Page | 11
import numpy as np
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/
train.csv')
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/
test.csv')
df_train.head()
df_test.head()
df_train.shape
df_test.shape
"""As we can see that in train there are 1460 rows with 81 columns and in test
dataset 1459 rows with 80 columns. our dependent variable is **'saleprice'**"""
df_train.describe()
df_test.describe()
df_train.columns , df_test.columns
#correlation matrix
corrmat = df_train.corr()
cm = np.corrcoef(df_train[cols].values.t)
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
plt.show()
sns.distplot(df_train['saleprice']);
fig_saleprice = plt.figure(figsize=(12,5))
df_train['saleprice'] = np.log(df_train['saleprice'])'''
sns.distplot(df_train['saleprice']);
fig_saleprice2 = plt.figure(figsize=(12,5))
"""below code is used to see top 10 highly correlated columns with saleprice in
which overallqual,grlivearea,garagecars,garagearea,totalbsmtsf and 1stflrsf are
highly correlated"""
Page | 16
#below code is used to see which column is more correlated to dependent
variable so first ten columns are more correlated compare to other columns
corr = df_train.corr()["saleprice"]
corr[np.argsort(corr, axis=0)[::-1]]
"""# **outliers**
We are going to plot first 10 highly correlated columns to see how many outliers
we have in our dataset
"""
fig = plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('grlivarea', fontsize=13)
plt.show()
fig1= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('overallqual', fontsize=13)
plt.show()
fig2= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('garagecars', fontsize=13)
plt.show()
fig3= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
Page | 17
plt.xlabel('garagearea', fontsize=13)
plt.show()
fig4= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('totalbsmtsf', fontsize=13)
plt.show()
fig5= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('1stflrsf', fontsize=13)
plt.show()
fig6= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('fullbath', fontsize=13)
plt.show()
fig7= plt.subplots()
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('totrmsabvgrd', fontsize=13)
plt.show()
fig8= plt.subplots()
plt.show()
'''#deleting outliers
#scatterplot
sns.set()
sns.pairplot(df_train[columns], size = 3)
plt.show();
Here I have merged some columns to just reduce complexity I have tried with all
the columns but I didn't get this much accuracy which I am getting right now
"""
#feature engineering
df_train['totalsf'] = df_train['totalbsmtsf']+df_train['1stflrsf']+df_train['2ndflrsf']
df_train=df_train.drop(columns={'1stflrsf', '2ndflrsf','totalbsmtsf'})
df_train['wholeexterior'] = df_train['exterior1st']+df_train['exterior2nd']
df_train=df_train.drop(columns={'exterior1st','exterior2nd'})
df_train = df_train.drop(columns={'bsmtfinsf1','bsmtfinsf2'})
df_train = df_train.drop(columns={'fullbath','halfbath'})
Page | 19
df_test['totalsf'] = df_test['totalbsmtsf']+df_test['1stflrsf']+df_test['2ndflrsf']
df_test=df_test.drop(columns={'1stflrsf', '2ndflrsf','totalbsmtsf'})
df_test['wholeexterior'] = df_test['exterior1st']+df_test['exterior2nd']
df_test=df_test.drop(columns={'exterior1st','exterior2nd'})
df_test = df_test.drop(columns={'bsmtfinsf1','bsmtfinsf2'})
df_test = df_test.drop(columns={'fullbath','halfbath'})
"""**we're going to merge the datasets here before we start editing it so we don't
have to do these operations twice. Let’s call it features since it has features only.
so our data has 2919 observations and 79 features to begin with...**"""
frames = [df_train,df_test]
df = pd.concat(frames,keys=['train','test'])
"""there are 2919 observations with 76 columns. including the target variable
saleprice and id.the train set has 1460 observations while the test set has 1459
observations, the target variable saleprice is absent in test. the aim of this study is
to train a model on the train set and use it to predict the target saleprice of the test
set."""
df
df_missing=df.isnull().sum().sort_values(ascending=false)
df_missing
"""now we are separating categorical columns and numerical columns for filling
missing values"""
cat_col = df.select_dtypes(include=['object'])
cat_col.isnull().sum()
cat_col.columns
num_col.columns
"""In below cell you have your numerical columns so I just replace nan by 0. I
have also tried mode, median and mean but I got best result in 0.if you want to do
it then just fork my notebook and apply that functions. If you want that other
function's code then just comment below I will give you the code in comment
section.
# Numerical columns
"""
df['lotfrontage'] = df['lotfrontage'].fillna(value=0)
df['garageyrblt'] = df['garageyrblt'].fillna(value=0)
df['masvnrarea'] = df['masvnrarea'].fillna(value=0)
df['bsmtfullbath'] = df['bsmtfullbath'].fillna(value=0)
df['bsmthalfbath'] = df['bsmthalfbath'].fillna(value=0)
df['garagearea'] = df['garagearea'].fillna(value=0)
df['garagecars'] = df['garagecars'].fillna(value=0)
df['bsmtunfsf'] = df['bsmtunfsf'].fillna(value=0)
df['bsmt'] = df['bsmt'].fillna(value=0)
df['totalsf'] = df['totalsf'].fillna(value=0)
"""I have applied same technique as I applied in numerical columns where I put 0
and here i have replaced all the nan values with none. That means if the original
dataset have nan values, it means that the particular house is doesn't have that
thing. For example, if id no = 220 do not have garage then why we put values
that id no = 220 has a garage.
Page | 21
so i replaced them with none.
# Categorical columns
"""
df['mszoning'] = df['mszoning'].fillna(value='none')
df['garagequal'] = df['garagequal'].fillna(value='none')
df['garagecond'] = df['garagecond'].fillna(value='none')
df['garagefinish'] = df['garagefinish'].fillna(value='none')
df['garagetype'] = df['garagetype'].fillna(value='none')
df['bsmtexposure'] = df['bsmtexposure'].fillna(value='none')
df['bsmtcond'] = df['bsmtcond'].fillna(value='none')
df['bsmtqual'] = df['bsmtqual'].fillna(value='none')
df['bsmtfintype2'] = df['bsmtfintype2'].fillna(value='none')
df['bsmtfintype1'] = df['bsmtfintype1'].fillna(value='none')
df['masvnrtype'] = df['masvnrtype'].fillna(value='none')
df['utilities'] = df['utilities'].fillna(value='none')
df['functional'] = df['functional'].fillna(value='none')
df['electrical'] = df['electrical'].fillna(value='none')
df['kitchenqual'] = df['kitchenqual'].fillna(value='none')
df['saletype'] = df['saletype'].fillna(value='none')
df['wholeexterior'] = df['wholeexterior'].fillna(value='none')
cm = np.corrcoef(df_main[cols].values.t) Page | 22
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
plt.show()
eid = df_main.loc['test']
df_test = df_main.loc['test']
df_train = df_main.loc['train']
eid = eid.id
y_train = df_train['saleprice']
import xgboost
xgboost = xgboost.xgbregressor(learning_rate=0.05,
colsample_bytree = 0.5,
subsample = 0.8,
n_estimators=1000,
max_depth=5,
gamma=5)
xgboost.fit(x_train, y_train)
y_pred = xgboost.predict(df_test)
y_pred
main_submission.to_csv("submission.csv", index=false)
main_submission.head() Page | 23
OUTPUT SCREEN
***
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
plt.show()
OUTPUT:-
SalePrice
OverallQual
GrLiveArea
GarageCars
GarageArea
TotalBsmtSF
1stFlrSF
FullBath
TotRmsAbvGrd Page | 24
YearBuilt
***
#values of correlation
abs(df_train.corr()['SalePrice']).nlargest(10)
***
OUTPUT:-
SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
Name: SalePrice, dtype: float64
***
#sum of missing data
df.isnull().sum().sort_values(ascending=False)
***
OUTPUT:-
SalePrice :1459
MSZoning: 4
LotFrontage: 486
Alley: 2721
Utilities: 2 Page | 25
Exterior1st: 1
Exterior2nd: 1
MasVnrType: 24
MasVnrArea: 23
BsmtQual: 81
BsmtCond: 82
BsmtExposure: 82
BsmtFinType1: 79
BsmtFinSF1: 79
BsmtFinType2: 80
BsmtFinSF2: 1
BsmtUnfSF: 1
TotalBsmtSF: 1
Electrical: 1
BsmtFullBath: 2
BsmtHalfBath: 2
KitchenQual: 1
Functional: 2
FireplaceQu: 1420
GarageType: 157
GarageYrBlt: 159
GarageFinish: 159
GarageCars: 1
GarageArea: 1
GarageQual: 159
GarageCond: 159
PoolQC: 2909
Fence: 2348
MiscFeature:2814
MoSold: Month Sold
YrSold: Year Sold
SaleType: 1
Length: 36, dtype: int64
#encoded
df_main = pd.get_dummies(df)
df_main.shape
Page | 26
***
OUTPUT:-
(2919, 339)
***
#rmse
***
OUTPUT:-
Page | 27
xgb rmse: 0.1223501568206363
gbr rmse :0.5585375883105338
rf rmse : 0.43600854434323927
lightgbm rmse : 0.5596622356678556
SVR rmse: 0.5246953605047906
stacked rmse: 0.5026308085477498
Page | 28
CONCLUSION
In today’s real estate world, it has become tough to store such huge data
and extract them for one’s own requirement. Also, the extracted data should
be useful. The system makes optimal use of the Linear Regression Algorithm.
The system makes use of such data in the most efficient way. The linear
regression algorithm helps to fulfill customers by increasing the accuracy of
estate choice and reducing the risk of investing in an estate.
A lot’s of features that could be added to make the system more widely
acceptable. One of the major future scopes is adding estate database of more
cities which will provide the user to explore more estates and reach an
accurate decision. More factors like recession that affect the house prices shall
be added. In-depth details of every property will be added to provide ample
details of a desired estate. This will help the system to run on a larger level.
Page | 29
REFERENCES
WIKIPEDIA
HTTPS://WWW.CRIO.DO/
HTTPS://WWW.GEEKSFORGEEKS.ORG/
HTTPS://WWW.KAGGLE.COM/
HTTPS://WWW.GITHUB.COM/
*********
Page | 30