Ds ML House Price Book
Ds ML House Price Book
com
2
Renan Moura - renanmf.com
Table of Contents
1. Preface
2. Github Repository
3. Exploratory Data Analysis
4. Data Cleaning Script
5. Machine Learning Model
6. API
7. Conclusion
3
Renan Moura - renanmf.com
Preface
I’ve been working with software for almost 10 years now and
do my best to share what I learned throughout the years.
Twitter: https://twitter.com/renanmouraf
LinkedIn: https://www.linkedin.com/in/renanmouraf
Instagram: https://www.instagram.com/renanmouraf
4
Renan Moura - renanmf.com
Github Repository
You can download the complete code in the Github
Repository.
To use the data and code in the repository, follow the steps in
the next sections.
source ./venv/bin/activate
5
Renan Moura - renanmf.com
jupyter-notebook Exploratory-Data-Analysis-House-Prices.ipynb
Then, with the Jupyter Notebook open go to Cell > Run All
to run all the commands.
6
Renan Moura - renanmf.com
python data_cleaning.py
Expected output:
python train_model.py
Expected output:
7
Renan Moura - renanmf.com
uvicorn api:app
Expected output:
source ./venv/bin/activate
python test_api.py
Expected output:
8
Renan Moura - renanmf.com
We will:
The Problem
This is the description of the problem on Kaggle:
9
Renan Moura - renanmf.com
the dataset.
Importing Libraries
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
Loading Data
10
Renan Moura - renanmf.com
Actually, we have 1169 rows in the CSV file, but the header
that describes the columns doesn’t count.
train = pd.read_csv('raw_data.csv')
train.shape
(1168, 81)
train.head(3).T
0 1 2
11
Renan Moura - renanmf.com
81 rows × 3 columns
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1168 entries, 0 to 1167
Data columns (total 81 columns):
Id 1168 non-null int64
MSSubClass 1168 non-null int64
MSZoning 1168 non-null object
LotFrontage 964 non-null float64
LotArea 1168 non-null int64
Street 1168 non-null object
Alley 70 non-null object
LotShape 1168 non-null object
LandContour 1168 non-null object
Utilities 1168 non-null object
LotConfig 1168 non-null object
LandSlope 1168 non-null object
Neighborhood 1168 non-null object
Condition1 1168 non-null object
Condition2 1168 non-null object
BldgType 1168 non-null object
12
Renan Moura - renanmf.com
13
Renan Moura - renanmf.com
train.describe().T
14
Renan Moura - renanmf.com
Data Cleaning
In this section, we will perform some Data Cleaning.
The id column
train.drop(columns=['Id'], inplace=True)
Missing values
15
Renan Moura - renanmf.com
isna() from pandas will return the missing values for each
column, then the sum() function will add them up to give you a
total.
columns_with_miss = train.isna().sum()
#filtering only the columns with at least 1 missing value
columns_with_miss = columns_with_miss[columns_with_miss!=0]
#The number of columns with missing values
print('Columns with missing values:', len(columns_with_miss))
#sorting the columns by the number of missing values descending
columns_with_miss.sort_values(ascending=False)
PoolQC 1164
MiscFeature 1129
Alley 1098
Fence 951
FireplaceQu 551
LotFrontage 204
GarageYrBlt 69
GarageType 69
GarageFinish 69
GarageQual 69
GarageCond 69
BsmtFinType2 31
BsmtExposure 31
BsmtFinType1 30
BsmtCond 30
BsmtQual 30
MasVnrArea 8
MasVnrType 8
Electrical 1
dtype: int64
16
Renan Moura - renanmf.com
# Removing columns
train.drop(columns=['PoolQC', 'MiscFeature', \
'Alley', 'Fence'], inplace=True)
17
Renan Moura - renanmf.com
train['FireplaceQu'].fillna(0, inplace=True)
train['FireplaceQu'].replace({'Po': 1, 'Fa': 2, \
'TA': 3, 'Gd': 4, 'Ex': 5}, inplace=True)
It is also worth noting how much higher the value is when the
house has an Excellent fireplace.
sns.set(style="whitegrid")
sns.barplot(x='FireplaceQu', y="SalePrice", data=train)
18
Renan Moura - renanmf.com
columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
c = list(columns_with_miss.index)
c.append('SalePrice')
train[c].corr()
columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f'Columns with missing values: {len(columns_with_miss)}')
columns_with_miss.sort_values(ascending=False)
GarageCond 69
GarageQual 69
GarageFinish 69
GarageType 69
BsmtFinType2 31
BsmtExposure 31
19
Renan Moura - renanmf.com
BsmtFinType1 30
BsmtCond 30
BsmtQual 30
MasVnrType 8
Electrical 1
dtype: int64
Categorical variables
Let’s work on the categorical variables of our dataset.
With this have only 5 columns with missing values left in our
dataset.
columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f'Columns with missing values: {len(columns_with_miss)}')
columns_with_miss.sort_values(ascending=False)
GarageCond 69
GarageQual 69
BsmtCond 30
BsmtQual 30
20
Renan Moura - renanmf.com
Electrical 1
dtype: int64
Ordinal
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
21
Renan Moura - renanmf.com
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
NA No Garage
22
Renan Moura - renanmf.com
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
NA No Garage
plt.tight_layout()
plt.show()
23
Renan Moura - renanmf.com
Nominal
Other categorical variables don’t seem to follow any clear
ordering.
cols = train.columns
num_cols = train._get_numeric_data().columns
nom_cols = list(set(cols) - set(num_cols))
print(f'Nominal columns: {len(nom_cols)}')
value_counts = {}
for c in nom_cols:
value_counts[c] = len(train[c].value_counts())
24
Renan Moura - renanmf.com
Nominal columns: 31
{'CentralAir': 2,
'Street': 2,
'Utilities': 2,
'LandSlope': 3,
'PavedDrive': 3,
'MasVnrType': 4,
'GarageFinish': 4,
'LotShape': 4,
'LandContour': 4,
'BsmtCond': 5,
'MSZoning': 5,
'Electrical': 5,
'Heating': 5,
'BldgType': 5,
'BsmtExposure': 5,
'LotConfig': 5,
'Foundation': 6,
'RoofStyle': 6,
'SaleCondition': 6,
'BsmtFinType2': 7,
'Functional': 7,
'GarageType': 7,
'BsmtFinType1': 7,
'RoofMatl': 7,
'HouseStyle': 8,
'Condition2': 8,
'SaleType': 9,
'Condition1': 9,
'Exterior1st': 15,
'Exterior2nd': 16,
'Neighborhood': 25}
25
Renan Moura - renanmf.com
nom_cols_less_than_6 = []
for c in nom_cols:
n_values = len(train[c].value_counts())
if n_values < 7:
nom_cols_less_than_6.append(c)
ncols = 3
nrows = math.ceil(len(nom_cols_less_than_6) / ncols)
f, axes = plt.subplots(nrows, ncols, figsize=(15, 30))
plt.tight_layout()
plt.show()
26
Renan Moura - renanmf.com
27
Renan Moura - renanmf.com
train['Electrical'].fillna('SBrkr', inplace=True)
Zero values
train.isin([0]).sum().sort_values(ascending=False).head(25)
PoolArea 1164
LowQualFinSF 1148
3SsnPorch 1148
MiscVal 1131
BsmtHalfBath 1097
ScreenPorch 1079
BsmtFinSF2 1033
EnclosedPorch 1007
HalfBath 727
BsmtFullBath 686
2ndFlrSF 655
WoodDeckSF 610
Fireplaces 551
FireplaceQu 551
OpenPorchSF 534
28
Renan Moura - renanmf.com
BsmtFinSF1 382
BsmtUnfSF 98
GarageCars 69
GarageArea 69
GarageCond 69
GarageQual 69
TotalBsmtSF 30
BsmtCond 30
BsmtQual 30
FullBath 8
dtype: int64
In this case, even though there are many 0’s, they have
meaning.
Outliers
We can also take a look at the outliers in the numeric
variables.
len(numerical_columns)
42
29
Renan Moura - renanmf.com
x, y = 0, 0
if y < columns-1:
y += 1
elif y == columns-1:
x += 1
y = 0
else:
y += 1
30
Renan Moura - renanmf.com
31
Renan Moura - renanmf.com
columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f'Columns with missing values: {len(columns_with_miss)}')
columns_with_miss.sort_values(ascending=False)
After cleaning the data, we are left with 73 columns out of the
initial 81.
train.shape
(1168, 73)
train.head(3).T
0 1 2
MSSubClass 20 60 30
MSZoning RL RL RM
LotArea 8414 12256 8960
Street Pave Pave Pave
LotShape Reg IR1 Reg
… … … …
MoSold 2 4 3
YrSold 2006 2010 2010
SaleType WD WD WD
32
Renan Moura - renanmf.com
73 rows × 3 columns
We can see a summary of the data showing that, for all the
1168 records, there isn’t a single missing (null) value.
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1168 entries, 0 to 1167
Data columns (total 73 columns):
MSSubClass 1168 non-null int64
MSZoning 1168 non-null object
LotArea 1168 non-null int64
Street 1168 non-null object
LotShape 1168 non-null object
LandContour 1168 non-null object
Utilities 1168 non-null object
LotConfig 1168 non-null object
LandSlope 1168 non-null object
Neighborhood 1168 non-null object
Condition1 1168 non-null object
Condition2 1168 non-null object
BldgType 1168 non-null object
HouseStyle 1168 non-null object
OverallQual 1168 non-null int64
OverallCond 1168 non-null int64
YearBuilt 1168 non-null int64
YearRemodAdd 1168 non-null int64
RoofStyle 1168 non-null object
RoofMatl 1168 non-null object
Exterior1st 1168 non-null object
Exterior2nd 1168 non-null object
MasVnrType 1168 non-null object
ExterQual 1168 non-null int64
ExterCond 1168 non-null int64
Foundation 1168 non-null object
BsmtQual 1168 non-null int64
BsmtCond 1168 non-null object
BsmtExposure 1168 non-null object
BsmtFinType1 1168 non-null object
BsmtFinSF1 1168 non-null int64
BsmtFinType2 1168 non-null object
BsmtFinSF2 1168 non-null int64
33
Renan Moura - renanmf.com
train.to_csv('train-cleaned.csv')
34
Renan Moura - renanmf.com
Conclusions
We dealt with missing values and removed the following
columns: ‘Id’, ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’,
‘LotFrontage’, ‘GarageYrBlt’, ‘MasVnrArea’.
We also:
Please note that the removed columns are not useless and
may contribute to the final model.
35
Renan Moura - renanmf.com
Code
You can save the script on a file ‘data_cleaning.py’ and
execute it directly with python3 data_cleaning.py or python
data_cleaning.py, depending on your installation.
It will also print the shape of the original data and the shape
of the new cleaned data.
import os
import pandas as pd
36
Renan Moura - renanmf.com
# Saves a copy
cleaned_data = os.path.join(output_file)
df.to_csv(cleaned_data)
return df
if __name__ == "__main__":
# Reads the file train.csv
train_file = os.path.join('train.csv')
if os.path.exists(train_file):
df = pd.read_csv(train_file)
print(f'Original Data: {df.shape}')
cleaned_df = clean_data(df)
print(f'After Cleaning: {cleaned_df.shape}')
else:
print(f'File not found {train_file}')
37
Renan Moura - renanmf.com
It means the models used 934 data point to train and 234
data points to test.
38
Renan Moura - renanmf.com
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pickle
def create_train_test_data(dataset):
# load and split the data
data_train = dataset.sample(frac=0.8, \
random_state=30).reset_index(drop=True)
data_test = \
dataset.drop(data_train.index).reset_index(drop=True)
39
Renan Moura - renanmf.com
model = Pipeline(steps=[
("label encoding", \
OneHotEncoder(handle_unknown='ignore')),
("tree model", LinearRegression())
])
model.fit(x_train, y_train)
return model
def export_model(model):
# Save the model
pkl_path = 'model.pkl'
with open(pkl_path, 'wb') as file:
pickle.dump(model, file)
print(f"Model saved at {pkl_path}")
def main():
# Load the whole data
data = pd.read_csv('cleaned_data.csv', \
keep_default_na=False, index_col=0)
# Split train/test
# Creates train.csv and test.csv
create_train_test_data(data)
40
Renan Moura - renanmf.com
if __name__ == '__main__':
main()
41
Renan Moura - renanmf.com
API
The output of the last chapter is the Machine Learning Model
that we are going to use in the API.
Class HousePriceModel
Save this script on a file named predict.py.
This file has the class HousePriceModel and is used to load the
Machine Learning model and make the predictions.
class HousePriceModel():
def __init__(self):
self.model = self.load_model()
self.preds = None
def load_model(self):
# uses the file model.pkl
pkl_filename = 'model.pkl'
try:
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
except:
print(f'Error loading the model at {pkl_filename}')
return None
return pickle_model
42
Renan Moura - renanmf.com
uvicorn api:app
Expected output:
app = FastAPI()
@app.get("/")
def root():
return {"status": "online"}
@app.post("/predict")
def predict(inputs: dict):
model = HousePriceModel()
43
Renan Moura - renanmf.com
start = datetime.today()
pred = model.predict(inputs)[0]
dur = (datetime.today() - start).total_seconds()
return pred
Expected output:
44
Renan Moura - renanmf.com
def run_prediction_from_sample():
url="http://127.0.0.1:8000/predict"
headers = {"Content-Type": "application/json", \
"Accept":"text/plain"}
if __name__ == "__main__":
run_prediction_from_sample()
45
Renan Moura - renanmf.com
Conclusion
That’s it!
Twitter: https://twitter.com/renanmouraf
Linkedin: https://www.linkedin.com/in/renanmouraf
Instagram: https://www.instagram.com/renanmouraf
46