Final Thesis
Final Thesis
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
by
Anjuli Sapkota
A THESIS
SUBMITTED TO THE DEPARTMENT OF CIVIL ENGINEERING IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
OF MASTER IN CONSTRUCTION MANAGEMENT
December 2023
COPYRIGHT©
The author has agreed that the library, Department of Civil Engineering Engineering, Institute
of Engineering, Pulchowk Campus, may make this thesis freely available for inspection.
Moreover the author has agreed that the permission for extensive copying of this thesis work
for scholarly purpose may be granted by the professor(s), who supervised the thesis work
recorded herein or, in their absence, by the Head of the Department, wherein this thesis was
done. It is understood that the recognition will be given to the author of this thesis and to
the Department of Civil Engineering, Pulchowk Campus in any use of the material of this
thesis. Copying of publication or other use of this thesis for financial gain without approval
of the Department of Civil Engineering, Institute of Engineering, Pulchowk Campus and
author’s written permission is prohibited.
Request for permission to copy or to make any use of the material in this thesis in whole or
part should be addressed to:
.....................................................................
Head of Department of Civil Engineering
Institute of Engineering, Pulchowk Campus
Pulchowk, Lalitpur, Nepal
iii
DECLARATION
I hereby declare that the work hereby submitted for the degree of Master of Science in
Engineering in Construction Management, Construction Management (MSCoM) at IOE,
Pulchowk Campus entitled ”A Comparative Study of the Performance of Different
Machine Learning Algorithms In Estimating the Preliminary Costs of Building
Construction Projects Specifically In Nepal” is my original work and has not been
previously submitted by me at any university for any academic award.
I authorize IOE, Pulchowk Campus to lend this thesis to other institutions or individuals for
scholarly research.
......................
Anjuli Sapkota
078MSCoM002
iv
RECOMMENDATION
The undersigned certify that we have read and recommended to the Department of Civil
Engineering for acceptance, a thesis entitled ”A Comparative Study of the Performance
of Different Machine Learning Algorithms In Estimating the Preliminary Costs
of Building Construction Projects Specifically In Nepal”, submitted by Anjuli
Sapkota in partial fulfillment of the requirement for the award of the degree of “Master of
Science in Construction Management”.
..........................................................................
Supervisor: Er. Samrakshya Karki
Building Design Authority (P). Ltd.
Project Planning Department
..................................................................................
External Examiner: Er. Krishna Singh Basnet
Former Executive Director
Road Board Nepal
.................................................................................................
Program Coordinator: Asst. Professor Mahendra Raj Dhital
M.Sc. in Construction Management
Department of Civil Engineering
IOE, Pulchowk Campus
v
ACKNOWLEDGEMENT
I express my deep sense of gratitude to the Department of Civil Engineering, IOE, Pulchowk
Campus for providing me with an opportunity to work on this thesis as part of the coursework
for my Masters of Construction Management (MSCoM). I owe a special debt of gratitude to
coordinator Mahendra Raj Dhital, Associate Professor, Institute of Engineering, Pulchowk
Campus, Tribhuvan University, and Er. Samrakshya Karki, my thesis supervisor, Building
Design Authority for their encouragement, suggestions, and continuous guidance in the thesis.
My sincere thanks also go to my mother Mrs. Bishnu Devi Sapkota and my husband
Er.Bishwas Pokharel, Masters Student in Computer Systems and Knowledge Engineering
for their continuous guidance and motivation.
Any further suggestions or criticisms for the improvement will be highly appreciated.
Sincerely,
Anjuli Sapkota
078MSCom002
i
ABSTRACT
ii
Contents
COPYRIGHT iii
DECLARATION iv
RECOMMENDATION v
ACKNOWLEDGEMENT i
ABSTRACT ii
Contents iii
List of Figures v
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 LITERATURE REVIEW 4
3 THEORETICAL BACKGROUND 9
iii
4 METHODOLOGY 21
6 CONCLUSION 48
7 FUTURE RECOMMENDATIONS 49
REFERENCES 53
APPENDIX A 54
iv
List of Figures
3.10 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Workflow diagram of the cost estimation using machine learning models . . . 21
v
5.10 Replacing final cost with their natural logarithms . . . . . . . . . . . . . . . 40
5.16 R square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vi
List of Tables
vii
CHAPTER 1
INTRODUCTION
1.1 Background
Building construction projects plays a crucial part in the Nepalese economy. Villages in Nepal
are more likely to have adobe constructions, wooden-framed homes, and rubble stone masonry
structures, while the bulk of metropolitan areas and suburbs have stone or brick masonry
[1]. Twenty percent (20%) of buildings are made up of reinforced concrete (RC). Reasonably
predicting construction expenses is crucial in the early phases of a building project [2]. Cost
is seen as a standard indicator of the resources used on a project [3]. Cost estimation is a
critical aspect of any construction project, as it provides an initial budgetary framework
and assists stakeholders in making informed decisions. Accurate cost estimation can lead to
potential cost savings and improved time efficiency during project execution, making it an
attractive proposition for the construction industry in Nepal.
In Nepal, like many other developing countries, construction projects often face budget
overruns and delays due to inaccurate early cost estimates. Nepal’s construction sector
faces specific challenges such as limited resources, topographical constraints, and varying
socioeconomic conditions across the region. Despite the growing popularity of machine
learning in various industries, its application to the construction sector in Nepal has been
relatively limited. The machine learning methods,forecast the cost of building by using a
historical record [4]. Traditional cost estimation methods might not fully account for these
complexities, making machine learning an appealing option to develop context-aware and
more accurate cost estimation models. Quantity Rate Analysis is the primary conventional
method for estimating costs that are commonly utilized [5].
1
cost. In actuality, it takes engineers several years to acquire the skills required to carry out
the cost estimation procedure. The fundamental issue here is that the engineers’ experience
is frequently not verified or documented. This skill is therefore vulnerable to subjectivity
and biasedness. On the other hand, incorrect cost estimation causes a variety of issues,
including modification orders and delays in the construction process. These two elements-
the inability to perform cost estimating manually and the consequences of inaccurate cost
prediction—encourage researchers and construction businesses to look for new creative
solutions to the cost estimation challenge [6]. Traditional technologies are unable to process
and evaluate the vast volume of data generated by the construction sector, resulting in the
loss and storage of a significant amount of data.The availability of drawings and information
is quite limited during the early stages of project development; thus, cost estimation plays a
crucial role in the investors’ decision-making [7]. Due to limited project information accessible
in the early stages, numerous methodologies have been developed to predict construction
costs and one of them is machine learning [8]. Sometimes, even after viewing the data, we
are unable to evaluate or extrapolate the information. We then use machine learning in that
situation[9]. With the ability to identify patterns and relationship between input factors and
cost, the machine learning (ML) technologies can help cost prediction in an early phase of
design [10]. A more precise estimation of construction costs can be made since more cost-
related data is accessible.[11]. For this, first we gather data , then machine learning models
can automatically identify relevant features from a large set of data, allowing the model to
focus on the most important factors influencing cost,it can capture nonlinear relationships
between variables, which traditional methods may struggle to handle.Supervised learning
algorithms, such as linear regression, decision trees can be trained on historical project data
to predict costs based on various input and ML models can continuously learn and adapt as
new data becomes available, leading to continuous improvement in cost estimation accuracy
over time.
2
1.3.1 Specific Objectives
• To identify the most significant features/input for cost prediction models of buildings.
The research focuses on nine machine learning algorithms (Artificial Neural Networks, Regres-
sion Analysis, Support Vector Machine, Decision Tree Method, XGboost method, Random
Forest method, Extra tree method, Voting Regression, Stacking ) only for preliminary cost
estimation. The study is focused on projects whose estimated cost is above 1 crore. This
research has selected relevant features that can impact cost estimation accuracy by finding out
factors from the literature review and further validated by expert opinion. Data was collected
from different residential, commercial, hospital, office buildings, and public buildings.
3
CHAPTER 2
LITERATURE REVIEW
The building is defined as any structure made of any material, whether or not it is inhabited
by people, and which includes the foundation, plinth, walls, floors, roofs, and building
services. Tents, tarpaulin shelters, and other transient structures are not to be regarded as
buildings. All governmental, non-governmental, and private structures that offer general
public services, amenities, opportunities, and products are referred to as public buildings.
Based on occupancy buildings are classified as Residential, Assembly, Educational, Hospitals
and clinics, Commercial, office, industries, and storage. Based on Storey and height buildings
are classified as General Buildings (1 to 5 Stories or below 16m), Medium Rise (6 to 8 Stories
or between 16m to below 25m), High Rise (9 to 39 Stories or 25m to below 100m), Skyscrapers
(40 Stories and above or above 100m) (NBC 206:2015). The Nepal National Building Code
(NBC) code is effectively implemented in buildings in Nepal and was formulated in 1994.
The variables affecting the building project are Project Characteristics (Building Type, Num-
ber of Storeys, Number of Blocks, Project Complexity Representative value, Programmed
Duration, Original Cost Estimate), Procurement System (Functional Grouping, Payment
Method, Contract Conditions), Project Team Performance (Contractor, Design Team, Man-
agement Team), Client: Client Representatives Characteristics (Client Type, Client Priority,
Client Source of Finance, Client Characteristics Representative Value ), Contractor Charac-
teristics (Contractor Characteristics Representatives Value), Design Team Characteristics
(Design Team Characteristics Representative Value), External Conditions (External Condi-
tions Representative Value)[12].
Cost estimate provides a general concept of the cost of the work, allowing for the determination
of the project’s viability, or if it might be completed within the allocated budget. They are
4
mainly three types [13]:
• Abstract estimate is the estimate that solely contains the entire quantities of the items
of work, rates determined by the schedule or market values, and the project’s overall
cost. The estimate that includes updated quantities, specifications, and rates is known
as a revised estimate.
The research done by [2] showed that the Artificial Neural network model had a lower
error rate than the multiple regression model of projected building costs. In this study, the
multiple regression model and an artificial neural network model were contrasted using cost
information kept by a provincial office of education on primary schools built between 2004
and 2007. A total of 96 historical data were divided into 20 historical data for comparing the
built-in regression model with the artificial neural network mode and 76 historical data for
building models. The artificial neural network model was shown to be superior in terms of
average error rate and standard distribution by comparing the estimated values of the two
models.
[8] used 197 cases for model construction and validation. The remaining 20 instances for
tested and discovered that the NN model provided more accurate estimation results than the
RA and SVM models. As a result, it was decided that the NN model was most suited for
determining the cost of school construction projects.
A data collection with 530 historical expenses was employed by[14]. Compared to the CBR
or MRA models, the best NN model produced more precise estimates. The lengthy trial-and-
error procedure, however, made it difficult to find the optimal NN model. In comparison to
the other models, the CBR model was more effective concerning these tradeoffs, particularly
its clarity of explanation when calculating construction costs. The ability to update the
5
building cost model easily and maintain consistency in the variables contained are key aspects
of the model’s long-term use.
[15] described as whereas conventional modeling techniques frequently fall short, ANN offers
solutions for difficult issues. For instance, ANN succeeds where many conventional modeling
techniques fall short in capturing nonlinear and intricate interactions between the variables.
They do, however, have their restrictions. They frequently require a specific set of inputs
and outputs and can only be taught for that problem. As a result, any modification that
calls for updating the network’s architecture cannot be carried out automatically and must
instead include human involvement.
[16] conducted research where 174 actual residential projects in Egypt provided the source of
the statistics. To come to an understanding of the crucial elements influencing early-stage cost
assessment, the Delphi method was employed. Regression techniques such as gamma, Poisson,
and multiple linear regression were utilized. Using multiple linear regression, the suggested
hybrid model was derived from the ANN model and the regression models. In comparison
to the ANN model and regression models, the hybrid model’s mean absolute percentage
error was 10.64%, which is lower. The hybrid model’s results show that it was effective at
estimating the cost of residential projects and that it would be helpful to decision-makers in
the construction sector.
The research done by [17] showed R2 values of RF, SVM and Catboost were calculated to be
0.900, 0.897 and 0.906, respectively. When comparing the performances of different models,
Stacking model was the best among others.
[18] discovered that regression models, on the other hand, typically required fewer model
parameters than neural networks, which led to greater prediction performance if the relation-
ships between the variables were well stated. Comparison of the regression model’s results
with those from the neural network model may help to determine whether the regression
model needs nonlinear or interaction terms.
[19] used 10,000 parametric building configurations, Among the 13 ML regression algorithms
used, the Artificial Neural Network (ANN),Gradient Boosting, and XGBoost models appeared
to be the most suitable to estimate the building costs and the required resources with an
accuracy of 99% within less than a second of the training time.
6
Table 2.1: Machine Learning Models Used in Construction Cost Estimation
7
Table 2.1 – Machine Learning Models Used in Construction Cost Estimation
S.N. Author Input Parameters Machine Learning Models
7 (Badawy, 2020) Area of the floors, Number of floors, ANN model and the re-
Type of external finishing, Type of gression model (Multiple
interior finishing linear regression, polyno-
mial regression, gamma
regression, and Poisson
regression)
8 (E. Chandgude, Number of stores, Number of base- Artificial Neural Network,
2020) ments, Floor area, Volume of con- Support Vector Machine
crete, Area of formworks, Weight of
reinforcing steel
9 (Chandanshive, Ground Floor Area, Typical Floor, SVM, Gradient Boosting
2021) No. of Floors, Structural Parking
Area, Quantity of Elevator Wall,
Quantity of Exterior Wall, Quantity
of Exterior Plaster, Area of Flooring,
No. of Columns, Type of foundation,
No. of Householders
10 (Veliyampatt, 2021) Geographic Conditions, Ground Con- Artificial Neural Net-
ditions, Type of Foundation, Type works, Fuzzy Inference
of Building, Market conditions, De- System and Regression
sign Complexity, Quality of Work, Analysis
Changes in Material, Unforeseen
items/Conditions
11 (Uyeol Park, 2022) Gross floor area, Building area ,Build- Bagging, Boosting, Stack-
ing height ,Number of floors ,Number ing
of basement floors, Number of park-
ing spaces
8
CHAPTER 3
THEORETICAL BACKGROUND
Artificial intelligence (AI) is a sub-field of computer science (CS), which implies the use of
computers and related technology to make a machine replicate or duplicate human behavior.
The goal of Artificial Intelligence is to create machines that can think like humans do,
including through learning, reasoning, and self-correction [20], [16]. Large amounts of data
from prior tenders are used by AI techniques, which then employ a self-learning process
to find patterns or links in the data sets. The identified relationships are not subject to
the subjectivity of estimators, and the utilization of AI methods reduces the effect of the
varying levels of expertise that estimators have on the accuracy of an estimate. The extensive
database of estimations and actual expenses that are recorded for prior projects is used by
these AI approaches, which also make use of implicit information about project execution[21].
Machine learning is a method that creates a model from data. In the field of machine learning,
data from the past is utilized to anticipate future results. The machine learning technique is
first applied to a training data set, and following the learning process, a model is created. The
final result of the machine learning process is this model, which may be applied in real-world
situations. The model can then on unknown input data give an output based on the patterns
or relationships found in the training set. Data is what drives machine learning techniques.
Typically, data sets are gathered by humans and used for training. Depending on the training
approach, three different categories can be applied to machine learning techniques. The
learning process in a supervised machine learning technique is dependent on data sets that
offer both input and output values. Because the patterns in the data are identified using
the accurate output values of the input values, the process is referred to as supervised. The
supervised learning method is comparable to how people learn. Humans solve problems using
current knowledge, and then contrast the result with the original solution. If the response is
incorrect, the present understanding is changed to approach the issue more effectively the
following time. This is accomplished in supervised learning by repeatedly revising a model to
minimize the discrepancy between the correct output and the model’s output for the same
input. Unsupervised learning is applied when there is no known correct solution and relies
9
solely on the input values. This method looks for patterns in the data and responds based on
whether or not those patterns are present in each new piece of data. Unsupervised learning is
frequently used in clustering techniques. Unsupervised learning’s main benefit is that it can
uncover hidden data structures and learn features of the data that were previously unknown.
Sets of input, some output, and grades are used as training data in reinforcement learning. It
is typically employed when the best possible engagement is needed, such as during control and
games. The main focus is on handling issues in a dynamic setting where a given circumstance
necessitates a given course of action. The emphasis is on performance, which entails striking
a balance between forging new ground and making use of existing knowledge [21].
Machine learning uses a variety of techniques to address data issues. The kind of algorithm
used depends on the type of problem we are trying to answer, how many variables there are,
what kind of model will work best, and other factors.
10
1. Artificial Neural Networks (ANN)
ANN has been employed in the Architecture, Engineering, and Construction (AEC) industry
(AEC) since the early 1990s to deal with practical CM problems that are challenging to handle
using standard modeling and analytical techniques. Previous research has demonstrated
that ANN can significantly influence prediction, optimization, categorization, and decision-
making in CM practice. From the planning stage to the operation and maintenance stage,
it has successfully aided in resolving specific issues throughout the project’s life cycle [22].
Input, hidden, and output layers are the three different types of neuron layers that make
up the fundamental architecture. In feed-forward networks, the signal flow is strictly in the
feed-forward direction from input to output units. There are no feedback links, but the data
processing can span several (layers of) units. Feedback links are seen in recurrent networks
[23]. The following three layers could be found in a neural network:
1. Input Layer:
In an input layer, there are typically as many input nodes as there are explanatory variables.
The network receives the patterns from the input layer and transmits them to one or more
hidden layers via the network.
2. Hidden Layer:
The input values inside the network are subjected to the modifications applied by the hidden
layers. This involves incoming arcs that come from input nodes connected to each node or
from other hidden nodes. It connects to output nodes or other hidden nodes using arcs that
are leaving the system. The actual processing is carried out through a system of weighted
connections in a hidden layer.
3. Output Layer:
An output layer is then connected to the concealed layers. The input layer or hidden layers
can send connections to the output layer. It provides an output value that is consistent with
the response variable’s forecast. Most categorization issues have a single output node. The
following connection determines the neuron output signal O:
n
!
X
O = f (net) = f w j xj (3.1)
j=1
where the function f(net) is referred to as an activation (transfer) function and wj is the
weight vector. The weight and input vectors are combined to create a scalar product known
11
as the variable net.
net = wT x = w1 x1 + . . . + wn xn (3.2)
When T is a matrix’s transpose, and in the most straightforward scenario, the output value
O is calculated as
1 if wT x ≥ θ
O = f (net) = (3.3)
0 otherwise
When the threshold level is denoted by the symbol theta, this kind of node is known as a
linear threshold unit [24].
Y = β0 + β1 X + ϵ (3.4)
where:
12
• β0 and β1 are the coefficients of the regression model. They represent the intercept and
slope of the line of best fit, respectively.
• ϵ is the random error or residual term, which captures the variation in the data that
cannot be explained by the model.
Regression aims to estimate the values of β0 and β1 using a sample of data. The line of best
fit is determined by these estimated coefficients. Once the relationship between the input
variable (X) and the target variable (Y ) is established, the model can be used to forecast
the values of Y for new or unknown inputs. By fitting a line to the data, regression helps us
understand if there is a link between the input and output variables and enables us to make
predictions based on that relationship [25], [26].
13
Figure 3.4: Support Vector Machine
.
The data points cannot be separated linearly. When there is no line (separating hyperplane)
that performs well on two classes, even if we use a soft margin classifier that allows for
P
misclassification. So, In this case, The function f (x) = β0 + B i∈S ∞ (ai < x, xi >) can be
written using the kernel function as:
X
[f (x) = β0 + (ai K(x, xi ))] (3.5)
i∈S
Here, B is a constant, i ∈ S denotes the sum over the set of indices corresponding to the
support points, and < x, xi > represents the inner product between points x and xi given by
P
⟨a, b⟩ = i (ai bi ) [28].
14
Figure 3.6: Decision Tree Method
There are three different kinds of nodes in this tree-structured classifier. The first node,
known as the Root Node, represents the whole sample and has the potential to divide into
further nodes. A data set’s characteristics are represented by the Interior Nodes, and the
decision criteria are represented by the branches. Lastly, the result is represented by the Leaf
Nodes. This method is incredibly helpful in resolving issues with decisions.When a certain
data point is used, it is answered True or False questions all the way through the tree until it
reaches the leaf node. The average of the dependent variable’s values at that specific leaf
node represents the final prediction. The Tree is able to forecast an appropriate value for the
data point after going through several rounds.[30]
5. XGboost method
The gradient-boosted trees approach is implemented using the open-source software known
as XGBoost, which stands for extreme gradient boosting. Due to its accuracy and simplicity,
it has been one of the most used machine learning approaches. It is a method for supervised
learning that may be applied to applications involving classification or regression. For orga-
nized, tabular data, XGBoost has done pretty well. On the whole, it is quick. Compared
to other gradient-boosting solutions, incredibly quick. When it comes to classification and
regression predictive modeling issues, XGBoost dominates structured or tabular data sets.
The key strength is its effective handling of missing values, which enables it to handle
real-world missing value data without the need for intensive pre-processing. Additionally, it
includes integrated parallel processing capability, allowing to train models on huge data sets
quickly [31].
K
X
F (xi ) = fk (xi ) + bias (3.6)
k=1
15
Where:
fk (xi ) is the prediction made by the k-th tree for the instance xi .
The final prediction F (xi ) is the sum of predictions from all trees plus a bias term.
Boosting is a sequential strategy in machine learning that increases the accuracy of the
model by turning weak learners or hypotheses into strong learners.With an emphasis on
effectiveness, computational speed, and model performance. A machine learning model that
outperforms random guessing only marginally is called a weak learner Extreme Gradient
Boosting (XGBoost) is a scalable and enhanced variant of the gradient boosting technique
(terminology warning).With the least amount of time, XGBoost is a superb combination of
hardware and software capabilities that improves current boosting approaches with precision.
A strong learner is created by XGBoost by merging many weak learners.. Nevertheless, weak
learners may be merged to create a stronger, far more accurate learner. To operate, XGBoost
trains several decision trees. The final forecast is formed by combining the predictions made
by each tree, each of which is trained on a portion of the data. It is a step up from the
GBM algorithm. The primary distinction is that over fitting is less likely with XGBoost as it
employs a more regularised model.[19]
6. Random Forest Method
A supervised learning algorithm is a random forest. An ensemble of decision trees, often
trained using the bagging approach, make up the ”forest” that it constructs. The bagging
method’s general premise is that combining learning models improves the end outcome. On
various samples, it constructs decision trees and uses their average for classification and
majority vote for regression. The Random Forest Algorithm’s ability to handle data sets
with continuous variables, as in regression, and categorical variables, as in classification, is
one of its most crucial qualities. For classification and regression tasks, it performs better.
The method excels at handling complicated data sets and minimizing overfitting, making it a
helpful tool for a variety of machine learning predictive applications [32].
T
1X
F (xi ) = ft (xi ) (3.7)
T t=1
16
Where:
ft (xi ) is the prediction made by the t-th decision tree for the instance xi .
The final prediction F (xi ) is the average of predictions from all trees.
Each decision tree ft is a function that takes the features xi as inputs and returns a prediction based o
Each decision tree in the Random Forest model is built using a subset of features and a
subset of data points. To put it simply, m features and n random records are selected from a
data collection containing k records. Every sample has a different decision tree generated
for it. Every decision tree will provide a result. The final product is evaluated using either
regression or classification-based majority voting or averaging [33].
17
Where:
ft (xi ) is the prediction made by the t-th decision tree for the instance xi .
The final prediction F (xi ) is the average of predictions from all trees.
More specifically, Extra trees would be a better option than other ensemble tree-based models
when developing models with significant feature engineering/feature selection pre-modelling
procedures when computational cost is a concern [35].The number of decision trees in the
ensemble, the number of input characteristics to choose at random and take into account for
each split point, and the minimum number of samples needed in a node to establish a new
split point are the three primary hyperparameters in the method that need to be tuned [36].
8. Voting Regression
A sort of ensemble approach called the voting ensemble method integrates the results of
various models by voting. By pooling the knowledge and experience of various experts, the
voting ensemble approach can be utilized to produce predictions that are more accurate than
those made by any one model. The concept is to lower the variance and prevent overfitting by
combining the predictions of various models. In situations when there are numerous models
with various configurations or methods, the voting ensemble method is frequently employed.
The classifier ensemble shown below was developed utilizing models trained using various
machine learning algorithms, including logistic regression, SVM, random forest, and others
[37].
18
Figure 3.9: Voting Regression
For a given instance xi with features xi1 , xi2 , . . . , xin , the prediction F (xi ) made by a voting
regression can be represented as follows: there is an error ϵ that must be inside math mode.
M
1 X
F (xi ) = fm (xi ) (3.9)
M m=1
Where:
fm (xi ) is the prediction made by the m-th individual regression model for the instance xi .
The final prediction F (xi ) is the average of predictions from all individual models.
First off, voting won’t be hampered by significant mistakes or incorrect classifications from a
single model because it depends on the performance of several models.Having several models
that can make the right forecast helps to reduce the chance of one model producing an
incorrect one when combining models to produce a prediction. The estimator may be made
more resilient and prone to over fitting using this method. Strong performance from other
models can compensate for a poor performance from one.[26],[38]
9. Stacking
Stacking, which consists of two-layer estimators, is a method of assembling classification or
regression models. All the baseline models that are used to forecast the results on the test
data sets are contained in the first layer. The Meta-Classifier or Regressor in the second
layer uses all the predictions from the baseline models as input to produce new predictions.
Stacking, which consists of two-layer estimators, is a method of assembling classification or
regression models. All the baseline models that are used to forecast the results on the test
data sets are contained in the first layer. The Meta-Classifier or Regressor in the second layer
19
uses all the predictions from the baseline models as input to produce new predictions [39].
Where:
meta-model is the model that combines the predictions of the base models.
prediction1 (xi ), prediction2 (xi ), . . . , predictionN (xi ) are the predictions made by the base models.
Dividing the training set into two halves. Selecting L learners who are weak, fitting them
to the first fold’s data for each of the L weak learners, and then using the weak learners’
predictions as inputs to fit the meta-model to the second fold [17].
T
X
F (xi ) = γt ht (xi ) (3.11)
t=1
Where:
ht (xi ) is the prediction made by the t-th tree for the instance xi .
20
CHAPTER 4
METHODOLOGY
Figure 4.1: Workflow diagram of the cost estimation using machine learning models
.
Figure 4.1 shows the flow of the overall steps involved in the machine learning models.
The selection of the topic “A Comparative Study of the Performance of Different Machine
Learning Algorithms In Estimating the Preliminary Costs of Building Construction Projects
21
Specifically In Nepal” for the thesis was driven by its profound relevance and practicality
within the context of the Nepalese construction industry. By considering the data from
past projects, computers can learn and make accurate and fast estimations as compared to
software tools and human expertise which is tedious and time-consuming.
From the literature reviews the input factors were gathered. Expert opinion is taken
for filtering factors and making them relevant for the buildings residing in Nepal. The
questionnaire was filled out by 5 experts including both contractor and consultant. The
following criteria are taken for expertise:
• working as Consultant/Contractor.
A pilot survey is a survey that the researcher conducts with a smaller data size in collaboration
with the experts. A gathered response helps to guide whether to move forward in research
or change the questionnaire. It also helps to discover challenges that can affect the main
data collection process [40]. The questionnaire developed is attached in an Appendix. A
pilot test was conducted with 3 respondents to check the clarity and comprehensibility of
the questionnaire. The respondents easily understood the questionnaire; hence, there was no
22
difficulty in filling up the questionnaire. The minimum time taken to fill the questionnaire
was around 10 minutes and the maximum time taken was almost 20 minutes. The summary
of the respondents of the pilot test is given in the table below:
S.N Name of the Name of the re- Level of Educa- Position Time taken
Company spondent tion
1 Shitprava Archi- Er. Ganesh Sap- Masters Degree Project almost 10 min-
tect and Engineer- kota Consultant utes
ing Consultancy
Pvt. Ltd
2 Dream Height En- Er. Sagar Bista Masters Degree Project almost 12 min-
gineering and Con- Manager utes
sultancy Pvt. Ltd
3 Seismo-Tech En- Er. Suraj Bhat- Masters Degree Project almost 20 min-
gineering Consul- tarai Consultant utes
tancy Pvt. Ltd
Table 4.2: Pilot survey respondent’s information
Building projects’ structural data were gathered from various construction firms and con-
sultancies with their final cost of projects. Data were collected from the Department of
Urban Development and Building Construction (DUDBC), Consultancies, and Contractors.
The data collection process was very tough as cost the bidding amount is confidential for
contractors.
23
Figure 4.3: Types of building
.
• Separated Numerical and Categorical Features (excluding the total cost of the project).
• Plotted individual scatter plots for numerical features to visualize the data and identify
outliers.
• Plotted scatter plots between numerical features and the total cost of the project to
analyze their relationship.
• Plotted a normal distribution graph for numerical features to analyze the distribution
of the data.
24
• Calculated mean, median, mode, and variance for individual numeric features.
• Replaced missing values in numerical features with the mean, median, and mode, and
data was saved in separate Excel sheets.
• Based on the analysis of variance, replaced missing values with mean as there is less
variance in data when replaced with the mean value.
• Plotted histograms for categorical features to find whether values are unique or in some
order.
• Replaced missing values in categorical features with the mode (i.e., repeated values)
and again plotted histograms.
• Features were encoded using one-hot encoding since it represent distinct project names
without any inherent order.
• Encoded categorical features using one-hot encoding give all the data values in numeric
features, hence suitable for further analysis.
• Final Data set is ready and Splitting the data set into training and testing sets in a
ratio of 80:20 to implement in models.
In this work models such as Linear Regressor, Decision Tree Method, Random Forest method,
Artificial Neural Networks, Support Vector Machine, XGboost method, Extra tree method,
Voting Regression, and Stacking method are implemented.
Linear Regression (LR) is a basic regression model used to establish a linear relationship
between independent and dependent variables. Decision Tree Regressor (DT) partitions data
into subsets based on features for predictions. Random Forest Regressor (RF), an ensemble
method, combines predictions from multiple decision trees. The Neural Network (NN)
25
comprises several dense layers with varying activations trained using the ’Adam’ optimizer for
100 epochs to minimize mean squared error. XGBoost Regressor (XGB) is a gradient-boosting
algorithm that combines weak learners to boost predictive performance. Support Vector
Machine (SVM) with a linear kernel is used for regression. Extra Trees Regressor (ET)
is similiar to Random Forest but employs random thresholds for feature splitting. Voting
Regressor (Voting) amalgamates LR, DT, and RF models. Stacking Regressor (Stacking)
combines LR, DT, RF, ET, and Gradient Boosting models via a meta-regressor. Each model
showcases unique methodologies and predictive strengths tailored to the task at hand.
• Linear Regression, Decision Tree, Random Forest, Extra Trees, Voting Regressor:
scikit-learn
Setting a seed or random state parameter ensures reproducibility of results. When spec-
ifying a particular number, such as 42, as the random state or seed, it initializes the
random number generator in a manner that each execution of the code with the same
seed produces identical random values. For instance, the random state=42 parameter is
used in DecisionTreeRegressor, RandomForestRegressor, ExtraTreesRegressor, and
GradientBoostingRegressor, which initializes these models with the same random seed.
This initialization guarantees consistency in their behavior across different runs. Similarly,
in XGBoost, random state=42 is utilized to ensure reproducibility in the behavior of the
XGBoost regressor.
Model Architectures:
• Linear Regression (LR): Utilizes a simple linear model to establish a linear relation-
ship between input features and the target variable. No hidden layers are involved.
26
• Random Forest (RF) Regressor: Comprises an ensemble of decision trees to enhance
prediction accuracy by averaging the outputs of multiple decision trees.
2. Dense layer with 256 neurons and ’relu’ (Rectified Linear Unit) activation
function.
Input shape:
– The input shape is determined by the number of features in the training data.
It’s specified as (X train.shape[1], ), which indicates the number of columns or
features in the input data.
Model compilation:
– The model is compiled using the ’adam’ optimizer and the loss function set to
’mean squared error’.
Training:
– The model is trained using the fit method, where it’s trained on X train and
y train data.
– The training is performed for 100 epochs with a batch size of 32.
– The verbose parameter set to 0 implies that no output will be printed during
training.
27
– Validation data (X test, y test) is used to validate the model’s performance after
each epoch.
Predictions:
– After training, the model is used to make predictions on the test data (X test),
and the predictions are stored in nn predictions.
• Stacking Regressor: Combines predictions from multiple base estimators (LR, DT,
RF, ET, GB) using a meta-estimator (LR) to produce final predictions.
Calculation of mean square error, mean absolute error, and R square [41].
a. Mean Squared Error (MSE): The Mean Squared Error (MSE) assesses the average
squared differences between predicted (Pj ) and actual (Tj ) values in the dataset. The formula
for MSE is given by:
N
1 X
MSE = (Tj − Pj )2
N j=1
Where:
• Lower MSE values indicate better agreement between predicted and actual values, with
a perfect model yielding an MSE of 0.
28
b. Mean Absolute Error (MAE): The Mean Absolute Error (MAE) measures the average
absolute differences between predicted (Pj ) and actual (Tj ) values in the dataset. The formula
for MAE is given by:
N
1 X
MAE = |Tj − Pj |
N j=1
Where:
• MAE provides a measure of the average magnitude of errors between predicted and
actual values. It is less sensitive to outliers compared to Mean Squared Error (MSE).
• SSE denotes the Sum of Squares of Residuals, reflecting the difference between predicted
and actual values.
• SST represents the Total Sum of Squares, indicating the total variance in the target
variable.
d. Root Mean Square Error: Root Mean Square Error (RMSE) is a commonly used
metric in the field of statistics and machine learning to evaluate the accuracy of a predictive
model. It is a measure of the average magnitude of the errors between predicted and actual
values. The RMSE is calculated by taking the square root of the mean of the squared
differences between predicted and actual values. Here’s the formula:
v
u n
u1 X
RMSE = t (yi − ŷi )2 (4.1)
n i=1
29
Where:
The code is written in Python and uses various libraries for data pre-processing and machine
learning. Google Co-laboratory platform was used for the training part of the experiment. It
is a cost-free Python environment that runs in the cloud where a user can execute code, using
powerful computing resources that are faster to train the complex machine learning models as
compared to other general-purpose computers. Pytorch version 1.13 an open-source platform
for machine learning-related projects, was taken into consideration to implement the different
architecture. The Python Imaging Library (PIL 7.0.0), Matplotlib was used to perform image
processing and computer vision tasks.
• pandas (pd): This library is used for data manipulation and analysis. It provides
data structures like DataFrames that allow to work with structured data efficiently.
• sklearn.model selection: This module within the sci-kit-learn library provides tools
for splitting data sets into train and test sets, and for cross-validation.
• sklearn.impute: This module provides classes for imputing (filling in) missing values
in data sets.
30
CHAPTER 5
RESULTS AND DISCUSSION
As per experts, laboratory tests, building code use, Consulting fees, area of formwork, market
condition, Solid waste management, roof type, insurance of staff, material, and equipment,
and waterproofing were the least important factors and can be shown by the numeric value in
the table 5.1. High numeric values for aspect ”Yes” are taken into consideration and the high
value of aspect ”No” is eliminated. Additional factors were also given by Experts as shown
in table 5.2. After eliminating and adding some factors, a new questionnaire was made which
is shown in the Appendix. The Building Attributes that are finally considered for further
processing are listed. The attributes listed include details such as the name of the project,
location of the building, type of building, construction completion year, site/geographic
conditions, access to the site, site area, type of foundation, plinth area, floor area, floor
height, number of floors, number of columns, number of rooms, number of bathrooms, number
of kitchens, number of lifts/elevator, number of basements, use of building code, type of
31
Aspect Yes No
Location of Building 5 0
Type of Building 5 0
Site/Geographic Conditions 5 0
Access to Site 4 1
Site Area 3 2
Plinth Area 5 0
Floor Area 5 0
Floor Height 4 1
Number of Storeys 4 1
Number of Columns 3 2
Number of Rooms 4 1
Number of Bathrooms 3 2
Number of Beams 3 2
Type of Foundation 5 0
Roof Type 2 3
Number of Lifts/Elevator 4 1
Basement 5 0
Building Code Used 2 3
Laboratory Tests 1 4
Consulting Fees 2 3
Insurance of Staff, Material, 3 2
Equipment
Waterproofing 2 3
Aluminum and Railing Works 5 0
Wood Works 5 0
Type of Flooring Works 5 0
External/Internal Finishing 4 1
Area of Formwork 1 4
HVAC Work 5 0
Water Supply & Drainage Sys- 5 0
tem
Solid Waste Management 4 1
Water Treatment, Septic Tank, 3 2
Soak Pit
Electrical System 5 0
CCTV, AC & Ventilation Sys- 4 1
tem, Solar
Landscaping 5 0
Road Works/River Training 3 2
Works
Market Condition 4 1
Table 5.1: Count of Response from Experts
window, type of door, type of flooring works, external painting, internal finishing, HVAC
work, sanitary works, electrical works, landscaping, and road works/river training works.
There are 0 to 2 basements and a range of 1 to 13 floors in terms of storeys. Buildings’ overall
structural costs range above 1 crore.
32
Table 5.2: Additional Factors from Experts
• Construction year
Among all attributes, there are 17 numeric features including ’Site Area’, ’Plinth Area’, ’Floor
Area’, ’Floor Height’, ’Number of Floors’, ’Number of Beams’, ’Number of Columns’, ’Number
of Rooms’, ’Number of Bathrooms’, ’Number of Lifts/Elevator’, ’Number of Basements’,
’External Painting (Percentage of Total cost)’, ’Internal Finishing (Percentage of Total
cost)’, ’HVAC work (Percentage of Total cost)’, ’Sanitary Works (Percentage of Total cost)’,
’Electrical Works (Percentage of Total cost)’, and ’Total Final Cost of the Project including
VAT’. Similarly, there are 12 categorical features including ’Name of project’, ’Location of
Building’, ’Type of Building’, ’Construction Year’, ’Site/geographic Conditions’, ’Access to
Site’, ’Type of Foundation’, ’Type of Window’, ’Type of Door’, ’Type of Flooring Works’,
’Landscaping’, and ’Road works/River Training Works’.
Scatter plots are a fundamental tool in data exploration and can provide valuable insights
into the relationships between numerical variables in a dataset. Scatter plots can reveal
trends or patterns in the data. For example, if the points form an upward or downward slope,
it indicates a positive or negative linear trend between the variables. Outliers, which are
data points that significantly deviate from the main cluster of points, are easily identified
in scatter plots. This can help in identifying complex patterns and interactions in the data.
33
Scatter plots can also reveal non-linear relationships between variables. If the points form a
curve or some other non-linear shape, it suggests a non-linear relationship. Scatter plots are a
fundamental tool in data exploration and can provide valuable insights into the relationships
between numerical variables in a dataset.
34
Figure 5.3: Scatter plot of features vs Total cost
The scatter plot in Figure 5.2 and 5.3 shows a random distribution of points, it suggests
that there is no discernible pattern or relationship between the variables being plotted. A
random distribution typically indicates that there is no correlation or association between the
variables. Each data point appears scattered across the plot without following any specific
trend, slope, or pattern. This pattern-less arrangement implies that changes in one variable do
not have a consistent effect or influence on the other variable. In such cases, many non-linear
models might be needed to explore other potential relationships or factors that could affect
the variables under consideration.
35
A normal distribution graph describes how data is distributed when many independent,
random factors contribute to an outcome. Deviations from the normal distribution can
indicate outliers or anomalies in data. Detecting outliers is crucial in data analysis.
This distribution appears with a tail stretching towards the right side of the curve. It indicates
that the data has a longer right tail compared to the left side. In such cases, the mean tends
to be larger than the median, and most data points cluster towards the left side. A tail on
the right side of a distribution indicates a right-skewed pattern, and outliers situated near
this tail represent unusually high values that might have a significant impact on statistical
measures and require thorough examination during data analysis.
36
Figure 5.5: Normal Distribution Graph for numerical features
The data had missing values and preprocessing must be done. Some values are extremely
high, null values, disordered text data, or noisy data. Data preprocessing is thus carried out
to address this issue. Calculated mean, median, mode, and variance for individual numeric
features. Calculated variance for data. Replaced missing values in numerical features with the
mean, median, and mode. Table 5.3 considers the variance as a measure to decide between
using mean or median for imputation, the results after imputation with the mean have shown
a lower variance compared to imputation with the median. Lower variance signifies less
dispersion of data points from the mean value, implying that the dataset tends to be more
tightly clustered around the mean. Comparing the variance values across different attributes
between the raw data, mean-imputed data, and median-imputed data, mean imputation
tends to preserve the original variability of the dataset better than median imputation. Hence,
based on the variance as the criterion, mean imputation seems to be more aligned with
maintaining the original data’s variability compared to median imputation, and all numeric
features are replaced by the mean in such dataset. The decision is also justified by the table
5.3.
37
Plotted histograms for categorical features to find whether values are unique or in some order.
Replaced missing values in categorical features with the mode i.e. repeated values.
Table 5.3: Comparison of Raw Data, Mean-Imputed Data, and Median-Imputed Data based
on variance
38
Figure 5.7: Adding missed values with mean
Regarding replacing missing values in categorical features with the mode (most frequent
value), it is a common approach and often a reasonable strategy, especially when dealing
with categorical data. Imputing missing categorical values with the mode ensures the overall
distribution of the categories and minimizes the potential impact of missing data on data
analysis.
39
Figure 5.9: Replaced categorical values with mode
Replacing the values in the Total final cost of the project including VAT with their natural
logarithms as shown in Figure 5.10. Taking the logarithm can normalize the distribution or
reduce the impact of extreme values making the data more suitable for data analysis.
Calculating the correlation matrix for the Data Frame: The correlation matrix shows
how each numerical column in the Data Frame is related to every other numerical column by
calculating Pearson correlation coefficients. The Pearson correlation coefficient ranges from
-1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear
correlation. Positive values indicate a positive correlation, while negative values indicate a
negative correlation. It is useful for feature selection or understanding the data’s pattern.
40
Figure 5.11: Replacing final cost with their natural logarithms
Dropping Plinth Area(sqm) and Number of Bathrooms: These two features are being
removed because they exhibit a high correlation with other variables (’Floor Area(sqm)’ and
’Number of Rooms’ respectively) beyond a predefined threshold of 0.70. Due to their strong
correlation with other variables, it’s assumed that they might not provide additional signifi-
cant information for the analysis or modeling and could potentially lead to multicollinearity
issues.
Dropping Construction Year: This feature is being dropped because its correlation with
the target variable (’Total final cost of the project including VAT’) is lower than a specified
threshold of 0.70, specifically correlating to 0.091. A correlation below this threshold suggests
a weak linear relationship between ’Construction Year’ and the target variable, which might
not significantly contribute to explaining the variability in the target.
41
Figure 5.12: One hot encoding of categorical variables
Cleaning the column: Cleaning the column Location of Building. Since the dataset is
limited to certain places only. Data is categorized whether all the data is either inside of
42
Kathmandu or Outside of Kathmandu. Since the Location of the building is cleaned into the
inside valley and the Type of foundation is cleaned into individual foundations we can drop
these two columns.
The feature was encoded using one-hot encoding as shown in Figure 5.13. One hot encoding
is that categorical variables have been transformed into a numerical format. Dropping all
the variables that are either highly correlated with each other or are less correlated with the
target variable which is the Total final Cost of the project including VAT.
Regarding the comparison of the models, the Decision Tree, Random Forest, ExtraTree,
Voting, and Stacking models exhibit relatively better performance in terms of MSE, MAE,
RMSE, and R2 . Among these, the Decision Tree, ExtraTree, and Voting models demonstrate
particularly strong performance across multiple metrics. The Decision Tree or ExtraTree
model is considered the best choice based on the provided metrics, as they seem to have
lower errors and higher R2 values compared to other models.
43
Figure 5.14: Mean square error
44
Figure 5.16: R square
45
(a) Linear Regressor (b) Decision Tree Regressor
46
(i) Stacking Regressor
47
CHAPTER 6
CONCLUSION
The final input features are 8 Numerical features as (Site Area, Floor height, Number of
Floors, Number of Beams, Number of Columns, Number of Rooms, Number of lifts, Number
of Basement and 6 Categorical features as (Landscaping(Yes/No), Type of flooring (marble,
granite, punning, terrazzo), Type of Door(Upvc, Salwood),Type of Foundation (mat, isolated,
combined, raft, strap, pile),Electrical Works luxury (Yes/No)and Inside Valley(Yes/No).
In the evaluation of various regression models, three stand out as the most promising for
predicting the target variable. The Decision Tree model exhibited remarkable performance
with an MSE (Mean Squared Error) of 0.088575, an MAE (Mean Absolute Error) of 0.104625,
an RMSE (Root Mean Squared Error) of 0.297615, and an R2 (Coefficient of Determination)
of 0.876170. Similarly, the ExtraTree model closely followed with an MSE of 0.088601, an
MAE of 0.102909, an RMSE of 0.297659, and an R2 of 0.876134. The Voting model with
an MSE of 0.105035, an MAE of 0.222807, an RMSE of 0.324091, and an R2 of 0.853159.
The Decision Tree or ExtraTree model is considered the best choice based on the provided
metrics, as they seem to have lower errors and higher R2 values compared to other models.
48
CHAPTER 7
FUTURE RECOMMENDATIONS
Based on the findings and the analysis conducted in this thesis, there are several recommen-
dations for future research and practical applications:
A. Enhanced Data Collection: Gathering more diverse and extensive data sets could
provide a more comprehensive understanding of the relationships between features and
building costs. It also helps to build accurate models.
C. External Validation: Validate the developed models using external data sets from
different geographic locations or periods to ensure the reliability of models.
F. Long-Term Cost Analysis: Extend the analysis to include the assessment of long-term
cost implications and factors influencing ongoing maintenance and operational expenses
post-construction.
49
REFERENCES
[1] Dipendra Gautam, Hugo Rodrigues, Krishna Kumar Bhetwal, Pramod Neupane, and
Yashusi Sanada. Common structural and construction deficiencies of nepalese buildings.
Innovative infrastructure solutions, 1:1–18, 2016.
[2] Hong-Gyu Cho, Kyong-Gon Kim, Jang-Young Kim, and Gwang-Hee Kim. A comparison
of construction cost estimation using multiple regression analysis and neural network
in elementary school project. Journal of the Korea Institute of Building Construction,
13(1):66–74, 2013.
[3] K Akalya, LK Rex, and D Kamalnataraj. Minimizing the cost of construction materials
through optimization techniques. IOSR Journal of Engineering, 2018.
[4] Seokheon Yun. Performance analysis of construction cost prediction using neural network
for multioutput regression. Applied Sciences, 12(19):9592, 2022.
[5] Shabniya Veliyampatt. Determination of efficacy of cost estimation models for building
projects using artificial neural network. International Research Journal of Engineering
and Technology (IRJET) ), 08(10), 2021.
[6] Abdelrahman Osman Elfaki, Saleh Alatawi, Eyad Abushandi, et al. Using intelligent
techniques in construction project cost estimation: 10-year survey. Advances in Civil
engineering, 2014, 2014.
[8] Gwang-Hee Kim, Jae-Min Shin, Sangyong Kim, and Yoonseok Shin. Comparison of
school building construction costs estimation methods using regression analysis, neural
network, and support vector machine. 2013.
[9] Batta Mahesh. Machine learning algorithms-a review. International Journal of Science
and Research (IJSR).[Internet], 9(1):381–386, 2020.
50
[10] Sevgi Zeynep Dogan. Using machine learning techniques for early cost prediction of
structural systems of buildings. Izmir Institute of Technology (Turkey), 2005.
[11] JF Beltman. Predicting construction costs in the program phase of the construction
process: a machine learning approach. B.S. thesis, University of Twente, 2021.
[12] Sunil M Dissanayaka and Mohan M Kumaraswamy. Comparing contributors to time and
cost performance in building projects. Building and Environment, 34(1):31–42, 1998.
[13] Dr. N Seshadri sekhar. A course material on estimation, costing and valuation. 2020.
[14] Gwang-Hee Kim, Sung-Hoon An, and Kyung-In Kang. Comparison of construction
cost estimating models based on regression analysis, neural networks, and case-based
reasoning. Building and environment, 39(10):1235–1242, 2004.
[15] ALIREZA Shojaei and AMIRSAMAN Mahdavian. Revisiting systems and applications
of artificial neural networks in construction engineering and managements. Proceedings
of the International Structural Engineering and Construction, Chicago, IL, USA, pages
20–25, 2019.
[16] Mohamed Badawy. A hybrid approach for a cost estimate of residential buildings in
egypt at the early stage. Asian Journal of Civil Engineering, 21(5):763–774, 2020.
[17] Uyeol Park, Yunho Kang, Haneul Lee, and Seokheon Yun. A stacking heterogeneous
ensemble learning method for the prediction of building construction project costs.
Applied Sciences, 12(19):9729, 2022.
[18] Rifat Sonmez. Conceptual cost estimation of building projects with regression analysis
and neural networks. Canadian Journal of Civil Engineering, 31(4):677–683, 2004.
[19] TQD Pham, T Le-Hong, and XV Tran. Efficient estimation and optimization of building
costs using machine learning. International Journal of Construction Management,
23(5):909–921, 2023.
[20] Joost N Kok, Egbert J Boers, Walter A Kosters, Peter Van der Putten, and Mannes Poel.
Artificial intelligence: definition, trends, techniques, and cases. Artificial intelligence,
1:270–299, 2009.
51
[21] Erik Matel, Faridaddin Vahdatikhaki, Siavash Hosseinyalamdary, Thijs Evers, and Hans
Voordijk. An artificial neural network approach for cost estimation of engineering services.
International journal of construction management, 22(7):1274–1287, 2022.
[22] Hongyu Xu, Ruidong Chang, Min Pan, Huan Li, Shicheng Liu, Ronald J Webber, Jian
Zuo, and Na Dong. Application of artificial neural networks in construction management:
A scientometric review. Buildings, 12(7):952, 2022.
[23] Ajith Abraham. Artificial neural networks. Handbook of measuring system design, 2005.
[24] Omer Tatari and Murat Kucukvar. Cost premium prediction of certified green buildings:
A neural network approach. Building and Environment, 46(5):1081–1086, 2011.
[26] David J Lowe, Margaret W Emsley, and Anthony Harding. Predicting construction
cost using multiple regression techniques. Journal of construction engineering and
management, 132(7):750–758, 2006.
[28] Bing Dong, Cheng Cao, and Siew Eang Lee. Applying support vector machines to predict
building energy consumption in tropical region. Energy and Buildings, 37(5):545–553,
2005.
[29] Wei-Yin Loh. Classification and regression trees. Wiley interdisciplinary reviews: data
mining and knowledge discovery, 1(1):14–23, 2011.
[30] Zhun Yu, Fariborz Haghighat, Benjamin CM Fung, and Hiroshi Yoshino. A decision tree
method for building energy demand modeling. Energy and Buildings, 42(10):1637–1646,
2010.
[31] Amal Asselman, Mohamed Khaldi, and Souhaib Aammou. Enhancing the prediction
of student performance based on the machine learning xgboost algorithm. Interactive
Learning Environments, pages 1–20, 2021.
[32] Mark R Segal. Machine learning benchmarks and random forest regression. 2004.
52
[33] Argaw Gurmu and Mani Pourdadash Miri. Machine learning regression for estimating
the cost range of building projects. Construction Innovation, 2023.
[34] Aakanksha Sharaff and Harshil Gupta. Extra-tree classifier with metaheuristics approach
for email classification. In Advances in Computer Communication and Computational
Sciences: Proceedings of IC4S 2018, pages 189–197. Springer, 2019.
[35] Sokratis Papadopoulos, Elie Azar, Wei-Lee Woon, and Constantine E Kontokosta.
Evaluation of tree-based ensemble learning algorithms for building energy performance
estimation. Journal of Building Performance Simulation, 11(3):322–332, 2018.
[36] Olga Kurasova, Virginijus Marcinkevičius, Viktor Medvedev, and Birutė Mockevičienė.
Early cost estimation in customized furniture manufacturing using machine learning.
International journal of machine learning and computing. Singapore: IJMLC, 2021, vol.
11, no. 1., 2021.
[37] Pyae-Pyae Phyo, Yung-Cheol Byun, and Namje Park. Short-term energy forecasting
using machine-learning-based ensemble voting regression. Symmetry, 14(1):160, 2022.
[38] Marzieh Khosravi, Sadman Bin Arif, Ali Ghaseminejad, Hamed Tohidi, and Hanieh
Shabanian. Performance evaluation of machine learning regressors for estimating real
estate house prices. 2022.
[39] Bohdan Pavlyshenko. Using stacking approaches for machine learning models. In 2018
IEEE Second International Conference on Data Stream Mining & Processing (DSMP),
pages 255–258. IEEE, 2018.
[40] CN Atapattu, ND Domingo, and M Sutrisna. Statistical cost modelling for preliminary
stage cost estimation of infrastructure projects. In IOP Conference Series: Earth and
Environmental Science, volume 1101, page 052031. IOP Publishing, 2022.
[41] Yasamin Ghadbhan Abed, Taha Mohammed Hasan, and Raquim Nihad Zehawi. Machine
learning algorithms for constructions cost prediction: A systematic review. International
Journal of Nonlinear Analysis and Applications, 13(2):2205–2218, 2022.
53
APPENDIX A
Expert opinion
Pilot Testing
54
Figure 7.3: Expert Opinion
55
Figure 7.5: Expert Opinion
56
Figure 7.7: Pilot Survey
57
Figure 7.9: Letter for Data Collection
58
Figure 7.10: Similarity Index
59