0% found this document useful (0 votes)
81 views14 pages

Machine Learning Project Presentation

This document summarizes Samuel Odulaja's machine learning project to predict house prices using a Kaggle dataset containing information on over 1400 homes. Various regression techniques were applied to transform, engineer, and select features from the original 79 variables. Models like Ridge, Lasso, SVR, LightGBM, and Gradient Boosting were trained and evaluated on the task. The best performing model was Gradient Boosting, achieving an RMSE of 0.3848232 on the test set.

Uploaded by

shivaybhargava33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views14 pages

Machine Learning Project Presentation

This document summarizes Samuel Odulaja's machine learning project to predict house prices using a Kaggle dataset containing information on over 1400 homes. Various regression techniques were applied to transform, engineer, and select features from the original 79 variables. Models like Ridge, Lasso, SVR, LightGBM, and Gradient Boosting were trained and evaluated on the task. The best performing model was Gradient Boosting, achieving an RMSE of 0.3848232 on the test set.

Uploaded by

shivaybhargava33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine Learning

Project
Samuel Odulaja
Background

● Kaggle Dataset
○ Contains around 1400 house prices and associated predictors
● 79 explanatory variables describing aspects of residential homes in Ames, Iowa
● Using advanced regression techniques, predict the final price of each home
Concatenate training and testing features

● Concatenated features so
that we don’t have to
impute missing values,
transform features, etc.
○ Did this for both
training and test sets
● Removed houses with
ground living area greater
than 4,500 sq.ft from the
training sets
SalePrice Distribution
SalePrice Transformation

● Transformed target variable


○ y_train = np.log(train["SalePrice"])
Impute missing values

● The plot shows the


number of missing values
in columns with at least
one missing value
Engineer features

● Creating new features for dataset


○ TotalSF, TotalPorchSF, TotalBath
Categorize MSSubClass and YrSold

● From the MSSubClass


description, the levels don’t seem
to have a natural ordering
○ Represented the MSSubClass
as a categorical feature rather
than a numerical one
● Also represented YrSold as a
categorical feature
○ Allowed for a more flexible
relationship with SalePrice
Transform features

● To better highlight any recurring patterns in SalePrice, MoSold was transformed

● Also transformed highly skewed features using code below

● Used pd.get_dummies to convert all categorical values into dummy variables


Removing outliers from training data

● Fitted a linear model to the training data and removed examples with a studentized residual
greater than 3
Define random search

● Used random search to optimize hyperparameters for each of our models


● Used a 5-fold cross validation to score each iteration
Trained Models

● Overall the models did well with Gradient Boosting performing the best.
○ Ridge: 0.0778
○ Lasso: 0.0796
○ SVR: 0.0712
○ LGBM: 0.0640
○ GBM: 0.0436
Creating Predictions and RSME

● Stored the predictions of the based learners and stacked ensemble in a list
● Averaged the predictions and gave a weight of 0.13 to the based learners and .35 to the stacked
ensemble
● RSME: 0.3848232
Conclusions

● Overall the models seemed to perform well


● However the RSME seemed a little high
○ Most likely an error in the code
● In the future would improve on RSME, by using different methods.

You might also like