The document outlines a project focused on developing a machine learning model for house price prediction using a dataset from the Indian Institute of Technology Delhi. It details the steps involved in data preprocessing, exploratory data analysis, feature engineering, and model training using linear regression. Additionally, it covers model evaluation metrics and how to save the trained model for future use.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
27 views13 pages
Hint Sheet
The document outlines a project focused on developing a machine learning model for house price prediction using a dataset from the Indian Institute of Technology Delhi. It details the steps involved in data preprocessing, exploratory data analysis, feature engineering, and model training using linear regression. Additionally, it covers model evaluation metrics and how to save the trained model for future use.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
2125125, 9:47 AM CCheatShest.ipynb -Colab
Project Title
FOUNDATION FOR INNOVATION
AND TECHNOLOGY TRANSFER
ada wenhrar dear fecett
Indian Institute of Technology Delhi
Submitted By:
Name:
College :
ID:
Explain Goal: House Price Prediction Regression Model
We all have experienced a time when we have to look up for a new house to buy. But then the
journey begins with a lot of frauds, negotiating deals, researching the local areas and so on.
Intps:ifcolab research google.comidrveymwpjlK6sI0_{yHYG8cZG77VHVSEVBtprintMode=true a32286, 947 AM Chaatshesipyn Cola
House Price Prediction using Machine Learning So to deal with this kind of issues Today we will be
preparing a MACHINE LEARNING Based model, trained on the House Price Prediction Dataset.
y Import Liabaries
#import pandas as pd
import nunpy as np
#import matplotlib.pyplot as plt
import seaborn as sns
#import warnings
warnings. filterwarnings("ignore")
y Drive Mounting
from google.colab import drive
drive mount ('/content/drive’)
Loading the dataset file from the Location (Drive) using
pd.read_csv
df = pd.read_csv('/Housing Price Dataset - Housing.csv')
df -head()
How to show the first 5 and last 5 entiries from the sheets
?
v
dF head()
df tail()
How to find the total number of rows and column in excel
sheet ?
‘#df.shape()
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4V8HprintMode=ttue 232128125, 9:47 AM CCheatShest.ipynb -Colab
Idendify Features and labels
There are 12 featues in this data, and the target/label is predicting the price of the based on these
features.
Exploratory Data Analysis
Explain the need of EDA.
We will preprocess data and cleaning of the data.
. How to check the total number of record from the datset
. Handle missing values
How to fill missing values
. How to delete the columns/rows
How to add new column
. Label Encoding
. How to deal with duplicate values
@PNAARYONA
How to get a statistical inference of data
9. Plotting the relatip between feartures of the data
10. Feature Engineering
11, Data Visualization
12. Understanding relationship between the variables
13, Drawing Conclusion
Describe Function : Gives the statistical facts of the
dataset
df describe()
¥ *null values
#4f-null()
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue 332128125, 9:47 AM CCheatShest.ipynb -Colab
v_ Finding Null Values using [.isna().sum]
df isna().sum()
How to undertand the complete data, its column
(fearures, their datatypes), null values, etc ?
df info
Finding the Unique values in the for the column
v
furnishingstatus.
#dF[' Column_name" ] .nunique()
iF)
3
¥ How to fill a empty column in dataframe ?
sdf 'Column_nane"].fillna('Not req*, inplace=True)
By filling the numerical value with some stat oprtaion
(mean, median, mode)etc.
#mean_value = df['Anount"].mean()
# Replace NaNs in column Amount with the mean of values in the same column
#d#['Anount'] fillna(value=mean_value, inplace=True)
print (‘updated Datafrane:')
sprint (df.head)
v Droping/Deleting the empty/not required columns.
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue 432128125, 9:47 AM CCheatShest.ipynb -Colab
#drop unrelated/blank columns
#df.drop([ "Status", ‘unnamedi', 'Newcolunn’], axis=1, inplace=True)
Label_Encoding
y Converting all categorical valures in continous form.
#columns_to_transform = [‘mainroad', ‘guestroom’, ‘basement’, ‘hotwaterheating',, 'airconditior
#df[colunns_to_transforn] = df[columns_to_transform].replace({'yes’: 1, ‘no’: @})
Hdf['furnishingstatus'] = df[‘furnishingstatus'].replace({‘unfurnished': @, ‘semi-furnished’
sdf -head()
v StandardScaler
StandardScaler is a preprocessing technique in scikit-learn used for standardizing features by
removing the mean and scaling to unit variance. StandardScaler, a popular preprocessing
technique provided by scikit-learn, offers a simple yet effective method for standardizing feature
values. Let's delve deeper into the workings of StandardScaler:
Normalization Process: StandardScaler operates on the principle of normalization, where it
transforms the distribution of each feature to have a mean of zero and a standard deviation of one.
This process ensures that all features are on the same scale, preventing any single feature from
dominating the learning process due to its larger magnitude
In essence, StandardScaler is a versatile and widely used preprocessing technique that contributes
to the robustness, interpretability, and performance of machine learning models trained on diverse
datasets. Understanding its principles and application is essential for effectively preparing data for
model training and achieving reliable results in various machine-learning tasks.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#Columns = [‘price', ‘area']
adf[sc] = scaler. fit_transform(df[Columns])
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue 5132128125, 9:47 AM CCheatShest.ipynb -Colab
itdf .head()
y Check the datatypes for all the variables.
taf. dtypes
f.shape
-
Nonetror Traceback (nost recent call last)
Cipython-input-13-653337079cd8> in ()
ee Ge shape
NameError: name ‘df' is not defined
» Correlational Matrix:
A correlation matrix is a table that shows the correlation coefficients between a set of variables. It's
a tool used to identify patterns and trends in data
*What does it show? *
Correlation coefficients The correlation coefficient measures how closely two variables are related.
It can range from -1 to +1, with 0 indicating no correlation and 1 indicating a perfect prediction
Direction
A positive value indicates a positive relationship, while a negative value indicates a negative
relationship.
What's it used for?
Summarizing data: A correlation matrix can summarize a large dataset. Identifying patterns: A
correlation matrix can help identify patterns and trends in data.
Understanding relationships: A correlation matrix can help understand the relationships between
variables.
#eorr_matrix = df.corr()
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue 632128125, 9:47 AM CCheatShest.ipynb -Colab
#plt .Figure(Figsize=(18, 5))
fsns.heatmap(corr_matrix, annot=True, cmap="coolwarm” )
plt.show()
Data Visualization using seaborn and matplotlib
¥ Histogram
A histogram is a graph that shows the frequency distribution of numerical data. It's used to
represent continuous or discrete data, and is especially useful for large data sets
How a histogram works ? A histogram divides the data into groups called bins. The height of each
bin’s rectangle represents the number of data points in that bin. The width of each bin’s rectangle
represents the value of the variable.
#df hist (Figsize=(10, 10), bins=10)
Hplt. suptitle("Histograms for All Columns", fontsize=16)
fplt.show()
¥ 1. Gender count plot from data : Bar Graph
# plotting a bar chart for Gender and it's count
#ax = sns.countplot(x = ‘Gender',data = df)
#for bars in ax.containers:
# — ax.bar_label(bars)
v Pie Chart
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue
m32128125, 9:47 AM CCheatShest.ipynb -Colab
plotting a pie chart for Gender and it's count
# Calculate value counts for Gender
#gender_counts = df['Gender' ].value_counts()
# Create a pie chart using matplotlib
#plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%')
#plt.title( ‘Gender Distribution’)
#plt. show)
v Line Graph
# total number of orders from top 18 states
Hsales_state = df.groupby(['State’], as_index=False)[ ‘Orders’ ].sum().sort_values(by="Orders*
sns.set(rc={'Figure.figsize’ :(15,5)})
sns.lineplot(data = sales_state, x = 'State',y= 'Orders')
¥ Separating the featues and labesl from data frame.
Now, X stroes all the independent values, and y stores the
dependent values.
X will have all features.
y will have target value (price)
8x
ay
df.drop('price’, axis=1)
dF[ ‘price’ ]
NameError Traceback (most recent call last)
in ()
----> 1 df.head()
NameError: name ‘df! is not defined
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4V8HprintMode=ttue ans2128125, 9:47 AM CCheatShest.ipynb -Colab
Preaparing data feature set and labels for training of the
model.
#print(X.shape)
sprint (y.shape)
v Look at the some samples of X and y.
#print (x [:10])
sprint (y [:1@])
Now, the task is to split the training data and testing data
y_ for model training, by importing train-test split from
sklearn.
#from sklearn.model_selection import train_test_split
Y Creating the Training and testing data with a split
train, X test, y_train, y test = train_test_split(X, y, test_size-0.3, random_state=42)
v Loading the model from Sklearn
from sklearn.linear_model import LinearRegression
#1r_model = LinearRegression()
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue ons2128125, 9:47 AM CCheatShest.ipynb -Colab
v Train the model to the dataset using fit function
#Ir_model.fit(X_train, y_train)
Now, Test the trained model on test data using Predict
function
#1r_y_pred = 1r_model.predict(x_test)
MODEL EVALUATION : Importing the Error metrices from
sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y Calculating MSE, MAE, & R2 for regression mdel
#mse = mean_squared_error(y test, 1r_y pred)
#mae = mean_absolute_error(y_test,1r_y_pred)
#r2 = r2_score(y_test, Ir_y_pred)
sprint ("\nModel Performance Metrics:")
print (f"Mean Squared Error (MSE): {mse:.2F}")
print (#"Mean Absolute Error (MAE): {nae:.2f}")
print (#"R-Squared (R2): {r2:.24)")
¥ Confusion Matrix
#class_labels = labels_test
‘from sklearn.metrics import confusion_natrix
#plt.figure(figsize=(8,8))
Hy_pred_labels = [ np.argmax(label) for label in predicted_classes ]
icm = confusion matrix(y_test, y_pred_labels)
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue s01132128125, 9:47 AM CCheatShest.ipynb -Colab
# show cm
fsns.heatmap(cm, annot=True, fmt="d' ,xticklabels=class_labels, yticklabels=class_labels)
v Classification Report
#from sklearn.metrics import classification_report
ficr= classification_report(y test, y_pred_labels, target_names-labels_test)
print (cr)
Double-click (or enter) to edit
Y Plotting the Regression Line
splt.figure(figsize=(8, 6))
uplt-scatter(y test, 1ry pred, color="blue’, [email protected], label="Predictions')
uplt.plot({min(y_test), max(y_test)], [min(y_test), max(y_test)], color="red", linestyli
#plt_xlabel(‘Target Values’)
#plt.ylabel (‘Predicted Values")
tplt.title('Linear Regression Predicted vs Target Values’)
#plt legend()
aplt.grid(True)
#plt.show()
v Score—R Square Value
#print( ‘Training score’ ,1r_model.score(x_train,y train))
aprint( ‘Testing score’ ,1r_model.score(x_test,y_test))
¥ Importantn: How to Save a model
Pickle library is used to save a model and use this in real time.
import pickle
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4V8HprintMode=ttue 32128125, 9:47 AM CCheatShest.ipynb -Colab
v Save the model
wb! means 'write binary’ and is used for the file handle: open(‘save.p’, ‘wb’) which writes the
pickeled data into a file.
fwith open(‘model_pickle",'wb') as file:
# — pickle.dump(Ir_model, file)
from sklearn.linear_model import LinearRegression
f#1r_nodel = LinearRegression()
#1r_model.fit(X_train, y_train)
y load the model with a name
with open(‘model_pickle",'rb') as file:
# LR = pickle. load(file)
Calculate the coefficiets for the regression line for this
data.
#LR. coef_
Fy array((0.29361062, 0.04427349, 0.59793397, 0.22250551, @.21850242,
0.14958561, 0.25952496, 0.33174227, 0.36388859, 0.16271987,
0.27261478, 0.10597148])
¥ Calculate the intercept for the regression line for this data
#LR. intercept_
Sy -1.9988122975861022
Equation for Multi-Regression Model for this house price
<
prediction
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4VBHpriniModesttue rans2128125, 9:47 AM CCheatShest.ipynb -Colab
Y = atb1X+b2X2+b3X3+b4X4+b5X5+b6X6+b7X6+b8X8+b9X9+b10X10+b11X11+b12X12
Y = -1,9988 +X1 0.29361062 + X2 0.04427349 + X3 0.59793397 + X4 0.22250551 + XS
0.21850242 + X6 0.14958561 + X7 0.25952496 + X8 0,33174227 + X9 0.36388859 X10
0.16271987 + X10 0.27261478 + X11 0.10597148
#d€[ "bedrooms" ] .unique()
array([4, 3, 5, 2, 6, 1])
v Predicting house price for some user input values.
#LR. predict ([[3.000677,3,2,1,0,2,0,1,1,2,1,2]])
By array({2.08862621])
v Another way: By creating a new_var
fnew_data = [2.347980,3,2,3,1,0,1,1,1,2,1,3]
WLR. predict ([new_data])
Conclusion
Discuss results, Write down 4-5 lines about the model used, training time and testin time. Write
down advantages and solved probelm closing statements. Hence, we have learn to build our first.
machine learning based model for house price predicts based on user input.
hitpssfcola research, google. comidrveltymwpji6kést0_fyHY G9eZG77YHIvS4V8HprintMode=ttue 19113