Presentation - Final Thesis
Presentation - Final Thesis
FINAL THESIS
PRESENTATION
M.Sc. In Construction Management
Department of Civil Engineering, Pulchowk Campus, IOE, TU, Nepal
21th November, 2023
PRESENTED BY
PRESENTED BY
ANJULI SAPKOTA
SUPERVISED BY
Er Samrakshya Karki,
ANJULI SAPKOTA (078/MScOM/002) (078/MSCOM/002)
Building Design Authority Pvt.Ltd
OUTLINES
Introduction
Problem Statement
Objectives
Literature Review
Research Methodology
Results
Conclusion
INTRODUCTION
Predicting construction expenses is crucial in the early
phases of a building project [2].
4
PROBLEM STATEMENT
Construction cost estimation relies on the knowledge of a human expert and engineers'
whose experience is frequently not verified or documented.
Incorrect cost estimation causes a variety of issues, including modification orders and
delays in the construction process [5].
Traditional technologies are unable to process and evaluate the vast volume of data
generated by the construction sector, resulting in the loss and storage of a significant
amount of data [7].
OBJECTIVES
Main Objective:
To compare the performance of different machine learning algorithms in estimating
Specific Objectives:
To identify the most significant features/input for cost prediction models of buildings.
To develop and compare the performance of different machine learning models. .
LITERATURE REVIEW
(Sonmez, 2004) discovered that regression models, on the other hand, typically required fewer
model parameters than neural networks, which had led to greater prediction performance if
the relationships between the variables are well stated.
The research done by (Cho, 2013) showed that the artificial neural network model had a
lower error rate than the multiple regression model of projected building costs.
(Kim G. H., 2013) used 197 cases for model construction and validation. remaining 20
instances for testing and discovered that the NN model provided more accurate estimation
results than the RA and SVM models.
(Badawy, 2020) conducted research where 174 actual residential projects in Egypt provided
the source of the statistics. In comparison to the ANN model and regression models, the
hybrid model's mean absolute percentage error was 10.64%, which is lower.
The selection of the topic “A Comparative Study of Machine Learning Algorithms for
Early Cost Estimation of Building Projects in Nepal.” for the thesis was driven by its
profound relevance and practicality within the context of the Nepalese construction
industry.
By considering the data from past projects, computers can learn and make accurate and
fast estimations as compared to software tools and human expertise which is tedious and
time-consuming.
12
Expert Opinion
The questionnaire was filled out by 5 experts including both contractor and consultant.
working as Consultant/Contractor.
14
Pilot Testing
A pilot test was conducted with 3 respondents to check the clarity and
comprehensibility of the questionnaire.
The respondents easily understood the questionnaire; hence, there was no difficulty in
filling up the questionnaire.
The minimum time taken to fill the questionnaire was around 10 minutes and the
maximum time taken was almost 20 minutes.
15
The summary of the respondents of the pilot test is given in the table
below:
16
Data Collection
The data collection process was very tough as cost the bidding amount is
confidential for contractors.
17
18
19
Data Preprocessing
Separated Numerical and Categorical Features (excluding the total cost of the project).
Plotted individual scatter plots for numerical features to visualize the data and identify
outliers.
Plotted scatter plots between numerical features and the total cost of the project to
analyze their relationship.
Plotted a normal distribution graph for numerical features to analyze the distribution
of the data.
Calculated mean, median, mode, and variance for individual numeric features.
Replaced missing values in numerical features with the mean, median, and mode, and
data was saved in separate Excel sheets.
Based on the analysis of variance, replaced missing values with mean as there is less
variance in data when replaced with the mean value.
Plotted histograms for categorical features to find whether values are unique or in some
order.
21
Replaced missing values in categorical features with the mode (i.e., repeated values)and
again plotted histograms.
Features were encoded using one-hot encoding since it represent distinct project names
without any inherent order.
Encoded categorical features using one-hot encoding give all the data values in numeric
features, hence suitable for further analysis.
Final Data set is ready and Splitting the data set into training and testing sets in a
ratio of 80:20 to implement in models.
22
Models Implementation
In this work models such as Linear Regressor, Decision Tree Method, Random Forest
method, Artificial Neural Networks, Support Vector Machine, XGboost method, Extra
tree method, Voting Regression, and Stacking method are implemented.
Linear Regression (LR) is a basic regression model used to establish a linear
relationship
between independent and dependent variables.
Decision Tree Regressor (DT) partitions data into subsets based on features for
predictions.
Random Forest Regressor (RF), an ensemble method, combines predictions from
multiple decision trees.
The Neural Network (NN) comprises several dense layers with varying activations
trained using the ’Adam’ optimizer for100 epochs to minimize mean squared error.
23
XGBoost Regressor (XGB) is a gradient-boosting algorithm that combines weak
learners to boost predictive performance.
Support Vector Machine (SVM) with a linear kernel is used for regression.
Extra Trees Regressor (ET) is similar to Random Forest but employs random thresholds
for feature splitting.
VotingRegressor (Voting) amalgamates LR, DT, and RF models.
Stacking Regressor (Stacking) combines LR, DT, RF, ET, and Gradient Boosting
models via a meta-regressor.
Each model showcases unique methodologies and predictive strengths tailored to the
task at hand.
24
Linear Regression, Decision Tree, Random Forest, Extra Trees, Voting Regressor, Support
Vector Machine, Gradient Boosting: scikit-learn
Neural Network: Keras with TensorFlow backend
XGBoost: xgboost library
Model Architectures
Linear Regression (LR): Utilizes a simple linear model to establish a linear relation-
ship between input features and the target variable. No hidden layers are involved.
Decision Tree (DT) Regressor: Employs a decision tree-based model to make
predictions using a tree-like graph, consisting of nodes representing features, branches,
and leaf nodes containing the predicted values.
Random Forest (RF) Regressor: Comprises an ensemble of decision trees to enhance
prediction accuracy by averaging the outputs of multiple decision trees.
Neural Network (NN) Regressor: Implements a feedforward neural network with
four layers: an input layer with the number of features as neurons, followed by three
hidden layers having 2048, 256, and 64 neurons respectively, and an output layer with
one neuron for prediction.
25
Input shape:
The input shape is determined by the number of features in the training data. It’s specified as (X
train.shape[1], ), which indicates the number of columns or features in the input data.
Model compilation:
The model is compiled using the ’adam’ optimizer and the loss function set to ’mean squared error’.
Training:
The model is trained using the fit method, where it’s trained on X train and y train data.
The training is performed for 100 epochs with a batch size of 32.
The verbose parameter set to 0 implies that no output will be printed during training.
Validation data (X test, y test) is used to validate the model’s performance after each epoch.
26
Predictions:
After training, the model is used to make predictions on the test data (X test), and the predictions are
stored in nn predictions.
XGBoost Regressor: Deploys an XGBoost-based ensemble model using gradient boosting that
sequentially builds multiple decision trees to predict the target variable.
Support Vector Machine (SVM) Regressor: Uses a support vector machine algorithm to find the
hyperplane that best separates the data points.
Extra Trees (ET) Regressor: Functions as an ensemble model using extremely randomized trees,
which are an extension of Random Forests.
Voting Regressor: Creates an ensemble by combining the predictions from multiple base estimators
(LR, DT, RF) and generates a final prediction based on the aggregated results.
Stacking Regressor: Combines predictions from multiple base estimators (LR, DT, RF, ET, GB) using
a meta-estimator (LR) to produce final predictions.
27
Performance Metrics for Models:
Root Mean Square Error: Measure of the average magnitude of the errors between
predicted and actual values.
Mean Absolute Error: Average absolute differences between predicted and actual values
in the dataset.
Mean Squared Error: Average squared differences between predicted and actual values
in the dataset.
Coefficient of Determination (R²): Quantifies the proportion of variance in the target
variable that is predictable from the independent variable.
28
TOOLS AND EXPERIMENT
SETUP
Google Colaboratory platform
Python Libraries:
Tools:
• Python (Pytorch for machine learning )
• Matlab library for data visualization ( to plot graphs)
RESULTS AND DISCUSSION
Filtering the Input factor from expert opinion
High numeric values for aspect ”Yes” are taken into consideration and the high
value of aspect ”No” is eliminated.
Additional factors were also given by Experts.
After eliminating and adding some factors, a new questionnaire was made.
The Building Attributes that are finally considered for further processing are listed.
The attributes listed includes name of the project, location of the building, type of building,
construction completion year, site/geographic conditions, access to the site, site area, type of
foundation, plinth area, floor area, floor height, number of floors, number of columns,
number of rooms, number of bathrooms, number of kitchens, number of lifts/elevator,
number of basements, use of building code, type o window, type of door, type of flooring
works, external painting, internal finishing, HVAC work, sanitary works, electrical works,
landscaping, and road works/river training works
30
31
Scatter Plots
Scatter plots can reveal trends or patterns in the data.
For example, if the points form an upward or downward slope, it indicates a positive or
negative linear trend between the variables.
Outliers, which are data points that significantly deviate from the main cluster of points,
are easily identified in scatter plots.
This can help in identifying complex patterns and interactions in the data.
Scatter plots can also reveal non-linear relationships between variables.
If the points form a curve or some other non-linear shape, it suggests a non-linear
relationship.
32
33
Normal Distribution Graph
A normal distribution graph describes how data is distributed when many independent,
random factors contribute to an outcome.
Deviations from the normal distribution can indicate outliers or anomalies in data.
Detecting outliers is crucial in data analysis.
This distribution appears with a tail stretching towards the right side of the curve. It
indicates that the data has a longer right tail compared to the left side.
In such cases, the mean tends to be larger than the median, and most data points cluster
towards the left side.
A tail on the right side of a distribution indicates a right-skewed pattern, and outliers
situated near this tail represent unusually high values that might have a significant impact
on statistical measures and require thorough examination during data analysis.
35
Data Preprocessing Results
36
37
38
39
Histograms for categorical features to find whether values are unique or in some order.
43
Calculating the correlation matrix for the Data Frame:
The correlation matrix shows how each numerical column in the Data Frame is related to
every other numerical column by calculating Pearson correlation coefficients.
The Pearson correlation coefficient ranges from -1 (perfect negative correlation) to 1
(perfect positive correlation), with 0 indicating no linear correlation.
Positive values indicate a positive correlation, while negative values indicate a
negative correlation.
It is useful for feature selection or understanding the data’s pattern.
44
Dropping Plinth Area(sqm) and Number of Bathrooms:
These two features are being removed because they exhibit a
high correlation with other variables (’Floor Area(sqm)’ and
’Number of Rooms’ respectively) beyond a predefined
threshold of 0.70.
46
47
48
RESULTS OF MODELS
IMPLEMENTATION
Regarding the comparison of the models, the
Decision Tree, Random Forest, ExtraTree, Voting,
and Stacking models exhibit relatively better
performance in terms of MSE, MAE, RMSE, and
R2.
49
50
51
52
53
54
55
56
CONCLUSION
Regarding the datasets, the buildings that were used were 12 Educational Building, 3
Commercial Building, 6 Hospital Building, 18 Residential Building, 13 Public Building, 17
Official Building, and 3 Hotel Building having 0 to 2 basements ranging above 1 crore.
The input features were taken from the literature review, and validated by expert opinion.
After pilot testing, a survey questionnaire was distributed among contractors and
consultants.
Data preprocessing helps to clean data.
Missing values are substituted by mean for numeric values and by mode for the categorical
values.
By analyzing the correlation heat map the unwanted features are dropped.
57
The final dataset is divided into train and test sets in the ratio of 80:20.
The nine models were implemented and the Mean absolute error, Mean square error, and
R square value are recorded for evaluation.
The Decision Tree, Random Forest, ExtraTree, Voting, and Stacking models exhibit
relatively better performance in terms of MSE, MAE, RMSE, and R2.
Among these, the Decision Tree, ExtraTree, and Voting models demonstrate
particularly strong performance across multiple metrics.
The Decision Tree or ExtraTree model is considered the best choice based on the
provided metrics, as they seem to have lower errors and higher R2 values compared to
other models.
58
REFERENCES
Akalya, K. R. (2018). Minimizing the cost of construction materials through optimization techniques. . IOSR Journal of
Engineering.
Atapattu, C. N. (2022, November). Statistical cost modelling for preliminary stage cost estimation of infrastructure
projects. Earth and Environmental Science, 1101(5), 052031.
Badawy, M. (2020). A hybrid approach for a cost estimate of residential buildings in Egypt at the early stage. Asian
Journal of Civil Engineering, 21(5), 763-774.
Badra, I. B. (n.d.). Conceptual Cost Estimate of Buildings Using Regression Analysis In Egypt. 17(5).
Beltman, J. F. (2021). Predicting construction costs in the program phase of the construction process: a machine
learning approach. Bachelor's thesis, University of Twente.
(more in report...)
60
61
THANK YOU!