0% found this document useful (0 votes)
105 views62 pages

Presentation - Final Thesis

The document presents a comparative study of machine learning algorithms for early cost estimation of building projects in Nepal. It discusses traditional cost estimation methods and the potential of machine learning models to provide more accurate predictions. Various machine learning algorithms are implemented and compared, including linear regression, decision trees, random forests, neural networks and others.

Uploaded by

anjuli sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views62 pages

Presentation - Final Thesis

The document presents a comparative study of machine learning algorithms for early cost estimation of building projects in Nepal. It discusses traditional cost estimation methods and the potential of machine learning models to provide more accurate predictions. Various machine learning algorithms are implemented and compared, including linear regression, decision trees, random forests, neural networks and others.

Uploaded by

anjuli sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

A COMPARATIVE STUDY OF MACHINE LEARNING

ALGORITHMS FOR EARLY COST ESTIMATION OF


BUILDING PROJECTS IN NEPAL

FINAL THESIS
PRESENTATION
M.Sc. In Construction Management
Department of Civil Engineering, Pulchowk Campus, IOE, TU, Nepal
21th November, 2023

PRESENTED BY
PRESENTED BY
ANJULI SAPKOTA
SUPERVISED BY
Er Samrakshya Karki,
ANJULI SAPKOTA (078/MScOM/002) (078/MSCOM/002)
Building Design Authority Pvt.Ltd
OUTLINES
Introduction
Problem Statement
Objectives
Literature Review
Research Methodology
Results
Conclusion
INTRODUCTION
 Predicting construction expenses is crucial in the early
phases of a building project [2].

 Cost is seen as a standard indicator of the resources used


on a project [3].

 Quantity Rate Analysis is the primary conventional


Source : https://www.vecteezy.com/photo/3468650-architects-
. are-using-a-calculator-to-estimate-the-cost-of-house-plans
method for estimating costs that are commonly utilized
[4].
 Despite the growing popularity of machine learning in various
industries, its application to the construction sector in Nepal has
been relatively limited.

 Traditional cost estimation methods might not fully account for


these complexities, making machine learning an appealing option to
develop more accurate cost estimation model.

 Accurate cost estimation can lead to potential


cost savings and improved time efficiency during project execution,
making it an attractive proposition for the construction industry in
Nepal.

4
PROBLEM STATEMENT
 Construction cost estimation relies on the knowledge of a human expert and engineers'
whose experience is frequently not verified or documented.

 Incorrect cost estimation causes a variety of issues, including modification orders and
delays in the construction process [5].

 Traditional technologies are unable to process and evaluate the vast volume of data
generated by the construction sector, resulting in the loss and storage of a significant
amount of data [7].
OBJECTIVES
Main Objective:
To compare the performance of different machine learning algorithms in estimating

the preliminary costs of Building construction projects specifically in Nepal.

Specific Objectives:
To identify the most significant features/input for cost prediction models of buildings.

To develop and compare the performance of different machine learning models. .
LITERATURE REVIEW
 (Sonmez, 2004) discovered that regression models, on the other hand, typically required fewer
model parameters than neural networks, which had led to greater prediction performance if
the relationships between the variables are well stated.

 The research done by (Cho, 2013) showed that the artificial neural network model had a
lower error rate than the multiple regression model of projected building costs.

 (Kim G. H., 2013) used 197 cases for model construction and validation. remaining 20
instances for testing and discovered that the NN model provided more accurate estimation
results than the RA and SVM models.

 (Badawy, 2020) conducted research where 174 actual residential projects in Egypt provided
the source of the statistics. In comparison to the ANN model and regression models, the
hybrid model's mean absolute percentage error was 10.64%, which is lower.

More details in report..


8
More details in report..
RESEARCH METHODOLOGY
11
Topic Selection

The selection of the topic “A Comparative Study of Machine Learning Algorithms for
Early Cost Estimation of Building Projects in Nepal.” for the thesis was driven by its
profound relevance and practicality within the context of the Nepalese construction
industry.

By considering the data from past projects, computers can learn and make accurate and
fast estimations as compared to software tools and human expertise which is tedious and
time-consuming.

12
Expert Opinion

From the literature reviews the input factors were gathered.

The questionnaire was filled out by 5 experts including both contractor and consultant.

The following criteria are taken for expertise:

 more than 12 years of experience in this construction field.

 must have a relevant educational background.

 working as Consultant/Contractor.
14
Pilot Testing

 A pilot test was conducted with 3 respondents to check the clarity and
comprehensibility of the questionnaire.

 The respondents easily understood the questionnaire; hence, there was no difficulty in
filling up the questionnaire.

 The minimum time taken to fill the questionnaire was around 10 minutes and the
maximum time taken was almost 20 minutes.

15
 The summary of the respondents of the pilot test is given in the table
below:

16
Data Collection

Building projects’ structural data were gathered from various construction


firms and consultancies with their final cost of projects.

 Data were collected from the Department of Urban Development and


Building Construction(DUDBC), Consultancies, and Contractors.

 The data collection process was very tough as cost the bidding amount is
confidential for contractors.

17
18
19
Data Preprocessing
 Separated Numerical and Categorical Features (excluding the total cost of the project).

 Read the Excel file into a Data Frame (df).

 Counted the number of numerical features.

 Plotted individual scatter plots for numerical features to visualize the data and identify
outliers.

 Plotted scatter plots between numerical features and the total cost of the project to
analyze their relationship.

 Plotted a normal distribution graph for numerical features to analyze the distribution
of the data.
 Calculated mean, median, mode, and variance for individual numeric features.

 Replaced missing values in numerical features with the mean, median, and mode, and
data was saved in separate Excel sheets.

 Normal distribution graph was plotted for all metrics.

 Based on the analysis of variance, replaced missing values with mean as there is less
variance in data when replaced with the mean value.

 Counted the number of missing categorical features.

 Plotted histograms for categorical features to find whether values are unique or in some
order.

21
 Replaced missing values in categorical features with the mode (i.e., repeated values)and
again plotted histograms.

 Features were encoded using one-hot encoding since it represent distinct project names
without any inherent order.

 Encoded categorical features using one-hot encoding give all the data values in numeric
features, hence suitable for further analysis.

 There are no missing categorical data now.

 Final Data set is ready and Splitting the data set into training and testing sets in a
ratio of 80:20 to implement in models.

22
Models Implementation
 In this work models such as Linear Regressor, Decision Tree Method, Random Forest
method, Artificial Neural Networks, Support Vector Machine, XGboost method, Extra
tree method, Voting Regression, and Stacking method are implemented.
 Linear Regression (LR) is a basic regression model used to establish a linear
relationship
between independent and dependent variables.
 Decision Tree Regressor (DT) partitions data into subsets based on features for
predictions.
Random Forest Regressor (RF), an ensemble method, combines predictions from
multiple decision trees.
The Neural Network (NN) comprises several dense layers with varying activations
trained using the ’Adam’ optimizer for100 epochs to minimize mean squared error.
23
XGBoost Regressor (XGB) is a gradient-boosting algorithm that combines weak
learners to boost predictive performance.
Support Vector Machine (SVM) with a linear kernel is used for regression.
Extra Trees Regressor (ET) is similar to Random Forest but employs random thresholds
for feature splitting.
VotingRegressor (Voting) amalgamates LR, DT, and RF models.
Stacking Regressor (Stacking) combines LR, DT, RF, ET, and Gradient Boosting
models via a meta-regressor.
Each model showcases unique methodologies and predictive strengths tailored to the
task at hand.

24
Linear Regression, Decision Tree, Random Forest, Extra Trees, Voting Regressor, Support
Vector Machine, Gradient Boosting: scikit-learn
Neural Network: Keras with TensorFlow backend
XGBoost: xgboost library

Model Architectures
Linear Regression (LR): Utilizes a simple linear model to establish a linear relation-
ship between input features and the target variable. No hidden layers are involved.
Decision Tree (DT) Regressor: Employs a decision tree-based model to make
predictions using a tree-like graph, consisting of nodes representing features, branches,
and leaf nodes containing the predicted values.
Random Forest (RF) Regressor: Comprises an ensemble of decision trees to enhance
prediction accuracy by averaging the outputs of multiple decision trees.
Neural Network (NN) Regressor: Implements a feedforward neural network with
four layers: an input layer with the number of features as neurons, followed by three
hidden layers having 2048, 256, and 64 neurons respectively, and an output layer with
one neuron for prediction.
25
Input shape:

 The input shape is determined by the number of features in the training data. It’s specified as (X
train.shape[1], ), which indicates the number of columns or features in the input data.

Model compilation:

 The model is compiled using the ’adam’ optimizer and the loss function set to ’mean squared error’.

Training:

 The model is trained using the fit method, where it’s trained on X train and y train data.
 The training is performed for 100 epochs with a batch size of 32.
 The verbose parameter set to 0 implies that no output will be printed during training.
 Validation data (X test, y test) is used to validate the model’s performance after each epoch.

26
Predictions:

After training, the model is used to make predictions on the test data (X test), and the predictions are
stored in nn predictions.
 XGBoost Regressor: Deploys an XGBoost-based ensemble model using gradient boosting that
sequentially builds multiple decision trees to predict the target variable.
 Support Vector Machine (SVM) Regressor: Uses a support vector machine algorithm to find the
hyperplane that best separates the data points.
 Extra Trees (ET) Regressor: Functions as an ensemble model using extremely randomized trees,
which are an extension of Random Forests.
Voting Regressor: Creates an ensemble by combining the predictions from multiple base estimators
(LR, DT, RF) and generates a final prediction based on the aggregated results.
Stacking Regressor: Combines predictions from multiple base estimators (LR, DT, RF, ET, GB) using
a meta-estimator (LR) to produce final predictions.

27
Performance Metrics for Models:

 Root Mean Square Error: Measure of the average magnitude of the errors between
predicted and actual values.
 Mean Absolute Error: Average absolute differences between predicted and actual values
in the dataset.
 Mean Squared Error: Average squared differences between predicted and actual values
in the dataset.
 Coefficient of Determination (R²): Quantifies the proportion of variance in the target
variable that is predictable from the independent variable.

28
TOOLS AND EXPERIMENT
SETUP
Google Colaboratory platform

Python Libraries:
Tools:
• Python (Pytorch for machine learning )
• Matlab library for data visualization ( to plot graphs)
RESULTS AND DISCUSSION
Filtering the Input factor from expert opinion
High numeric values for aspect ”Yes” are taken into consideration and the high
value of aspect ”No” is eliminated.
Additional factors were also given by Experts.
After eliminating and adding some factors, a new questionnaire was made.
The Building Attributes that are finally considered for further processing are listed.
 The attributes listed includes name of the project, location of the building, type of building,
construction completion year, site/geographic conditions, access to the site, site area, type of
foundation, plinth area, floor area, floor height, number of floors, number of columns,
number of rooms, number of bathrooms, number of kitchens, number of lifts/elevator,
number of basements, use of building code, type o window, type of door, type of flooring
works, external painting, internal finishing, HVAC work, sanitary works, electrical works,
landscaping, and road works/river training works
30
31
Scatter Plots
Scatter plots can reveal trends or patterns in the data.
For example, if the points form an upward or downward slope, it indicates a positive or
negative linear trend between the variables.
Outliers, which are data points that significantly deviate from the main cluster of points,
are easily identified in scatter plots.
This can help in identifying complex patterns and interactions in the data.
Scatter plots can also reveal non-linear relationships between variables.
If the points form a curve or some other non-linear shape, it suggests a non-linear
relationship.

32
33
Normal Distribution Graph
A normal distribution graph describes how data is distributed when many independent,
random factors contribute to an outcome.
Deviations from the normal distribution can indicate outliers or anomalies in data.
Detecting outliers is crucial in data analysis.
This distribution appears with a tail stretching towards the right side of the curve. It
indicates that the data has a longer right tail compared to the left side.
In such cases, the mean tends to be larger than the median, and most data points cluster
towards the left side.
A tail on the right side of a distribution indicates a right-skewed pattern, and outliers
situated near this tail represent unusually high values that might have a significant impact
on statistical measures and require thorough examination during data analysis.
35
Data Preprocessing Results

The data had missing values and preprocessing must be done


Calculated mean, median, mode, and variance for individual numeric features.
Calculated variance for data.
Replaced missing values in numerical features with the mean, median, and mode.
Considering the variance as a measure to decide between using mean or median for imputation, the
results after imputation with the mean have shown a lower variance compared to imputation with the
median.
Lower variance signifies less dispersion of data points from the mean value, implying that the dataset
tends to be more tightly clustered around the mean.
 Comparing the variance values across different attributes between the raw data, mean imputed data, and
median-imputed data, mean imputation tends to preserve the original variability of the dataset better than
median imputation
The decision is also justified by the table.

36
37
38
39
Histograms for categorical features to find whether values are unique or in some order.

Figure 5.7.1: Histogram for Categorical Features


Regarding replacing missing values in categorical features with the mode (most frequent
value), it is a common approach and often a reasonable strategy, especially when dealing
with categorical data.
Imputing missing categorical values with the mode ensures the overall distribution of the
categories and minimizes the potential impact of missing data on data analysis.
42
 Replacing the values in the Total final cost of the project including VAT with their
natural logarithms as shown in Figure 5.10.
 Taking the logarithm can normalize the distribution or reduce the impact of extreme
values making the data more suitable for data analysis.

43
Calculating the correlation matrix for the Data Frame:
 The correlation matrix shows how each numerical column in the Data Frame is related to
every other numerical column by calculating Pearson correlation coefficients.
 The Pearson correlation coefficient ranges from -1 (perfect negative correlation) to 1
(perfect positive correlation), with 0 indicating no linear correlation.
 Positive values indicate a positive correlation, while negative values indicate a
negative correlation.
 It is useful for feature selection or understanding the data’s pattern.

44
Dropping Plinth Area(sqm) and Number of Bathrooms:
These two features are being removed because they exhibit a
high correlation with other variables (’Floor Area(sqm)’ and
’Number of Rooms’ respectively) beyond a predefined
threshold of 0.70.

Due to their strong correlation with other variables, it’s


assumed that they might not provide additional significant
information for the analysis or modeling and could potentially
lead to multi collinearity issues.

Dropping Construction Year: This feature is being dropped


because its correlation with the target variable (’Total final
cost of the project including VAT’) is lower than a specified
threshold of 0.70, specifically correlating to 0.091.

A correlation below this threshold suggests a weak linear


relationship between ’Construction Year’ and the target
variable, which might not significantly contribute to
explaining the variability in the target.
45
 Cleaning the column: Cleaning the column Location of
Building. Since the dataset is
limited to certain places only.
 Data is categorized whether all the data is either inside of
Kathmandu or Outside of Kathmandu.
 Since the Location of the building is cleaned into the
inside valley and the Type of foundation is cleaned into
individual foundations we can drop these two columns.
 The feature was encoded using one-hot encoding as
shown in Figure 5.13.
 One hot encoding is that categorical variables have been
transformed into a numerical format.
 Dropping all the variables that are either highly correlated
with each other or are less correlated with the target variable
which is the Total final Cost of the project including VAT.

46
47
48
RESULTS OF MODELS
IMPLEMENTATION
 Regarding the comparison of the models, the
Decision Tree, Random Forest, ExtraTree, Voting,
and Stacking models exhibit relatively better
performance in terms of MSE, MAE, RMSE, and
R2.

 Among these, the Decision Tree, ExtraTree, and


Voting models demonstrate particularly strong
performance across multiple metrics.

 The Decision Tree or ExtraTree model is


considered the best choice based on the provided
metrics, as they seem to have lower errors and
higher R2 values compared to other models.

49
50
51
52
53
54
55
56
CONCLUSION
 Regarding the datasets, the buildings that were used were 12 Educational Building, 3
Commercial Building, 6 Hospital Building, 18 Residential Building, 13 Public Building, 17
Official Building, and 3 Hotel Building having 0 to 2 basements ranging above 1 crore.
 The input features were taken from the literature review, and validated by expert opinion.
 After pilot testing, a survey questionnaire was distributed among contractors and
consultants.
 Data preprocessing helps to clean data.
 Missing values are substituted by mean for numeric values and by mode for the categorical
values.
 By analyzing the correlation heat map the unwanted features are dropped.

57
 The final dataset is divided into train and test sets in the ratio of 80:20.
The nine models were implemented and the Mean absolute error, Mean square error, and
R square value are recorded for evaluation.
 The Decision Tree, Random Forest, ExtraTree, Voting, and Stacking models exhibit
relatively better performance in terms of MSE, MAE, RMSE, and R2.
 Among these, the Decision Tree, ExtraTree, and Voting models demonstrate
particularly strong performance across multiple metrics.
 The Decision Tree or ExtraTree model is considered the best choice based on the
provided metrics, as they seem to have lower errors and higher R2 values compared to
other models.

58
REFERENCES
Akalya, K. R. (2018). Minimizing the cost of construction materials through optimization techniques. . IOSR Journal of
Engineering.

Atapattu, C. N. (2022, November). Statistical cost modelling for preliminary stage cost estimation of infrastructure
projects. Earth and Environmental Science, 1101(5), 052031.

Badawy, M. (2020). A hybrid approach for a cost estimate of residential buildings in Egypt at the early stage. Asian
Journal of Civil Engineering, 21(5), 763-774.

Badra, I. B. (n.d.). Conceptual Cost Estimate of Buildings Using Regression Analysis In Egypt. 17(5).

Beltman, J. F. (2021). Predicting construction costs in the program phase of the construction process: a machine
learning approach. Bachelor's thesis, University of Twente.

(more in report...)
60
61
THANK YOU!

You might also like