0% found this document useful (0 votes)
22 views70 pages

Final Thesis

Uploaded by

anjuli sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views70 pages

Final Thesis

Uploaded by

anjuli sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS

THESIS NO.: 078/MSCoM/002

A Comparative Study of the Performance of Different Machine Learning


Algorithms In Estimating the Preliminary Costs of Building Construction
Projects In Nepal

by
Anjuli Sapkota

A THESIS
SUBMITTED TO THE DEPARTMENT OF CIVIL ENGINEERING IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
OF MASTER IN CONSTRUCTION MANAGEMENT

DEPARTMENT OF CIVIL ENGINEERING


LALITPUR, NEPAL

December 2023
COPYRIGHT©

The author has agreed that the library, Department of Civil Engineering Engineering, Institute
of Engineering, Pulchowk Campus, may make this thesis freely available for inspection.
Moreover the author has agreed that the permission for extensive copying of this thesis work
for scholarly purpose may be granted by the professor(s), who supervised the thesis work
recorded herein or, in their absence, by the Head of the Department, wherein this thesis was
done. It is understood that the recognition will be given to the author of this thesis and to
the Department of Civil Engineering, Pulchowk Campus in any use of the material of this
thesis. Copying of publication or other use of this thesis for financial gain without approval
of the Department of Civil Engineering, Institute of Engineering, Pulchowk Campus and
author’s written permission is prohibited.

Request for permission to copy or to make any use of the material in this thesis in whole or
part should be addressed to:

.....................................................................
Head of Department of Civil Engineering
Institute of Engineering, Pulchowk Campus
Pulchowk, Lalitpur, Nepal

iii
DECLARATION

I hereby declare that the work hereby submitted for the degree of Master of Science in
Engineering in Construction Management, Construction Management (MSCoM) at IOE,
Pulchowk Campus entitled ”A Comparative Study of the Performance of Different
Machine Learning Algorithms In Estimating the Preliminary Costs of Building
Construction Projects Specifically In Nepal” is my original work and has not been
previously submitted by me at any university for any academic award.

I authorize IOE, Pulchowk Campus to lend this thesis to other institutions or individuals for
scholarly research.

......................
Anjuli Sapkota
078MSCoM002

iv
RECOMMENDATION

The undersigned certify that we have read and recommended to the Department of Civil
Engineering for acceptance, a thesis entitled ”A Comparative Study of the Performance
of Different Machine Learning Algorithms In Estimating the Preliminary Costs
of Building Construction Projects Specifically In Nepal”, submitted by Anjuli
Sapkota in partial fulfillment of the requirement for the award of the degree of “Master of
Science in Construction Management”.

..........................................................................
Supervisor: Er. Samrakshya Karki
Building Design Authority (P). Ltd.
Project Planning Department

..................................................................................
External Examiner: Er. Krishna Singh Basnet
Former Executive Director
Road Board Nepal

.................................................................................................
Program Coordinator: Asst. Professor Mahendra Raj Dhital
M.Sc. in Construction Management
Department of Civil Engineering
IOE, Pulchowk Campus

Date: December 2023

v
ACKNOWLEDGEMENT

I express my deep sense of gratitude to the Department of Civil Engineering, IOE, Pulchowk
Campus for providing me with an opportunity to work on this thesis as part of the coursework
for my Masters of Construction Management (MSCoM). I owe a special debt of gratitude to
coordinator Mahendra Raj Dhital, Associate Professor, Institute of Engineering, Pulchowk
Campus, Tribhuvan University, and Er. Samrakshya Karki, my thesis supervisor, Building
Design Authority for their encouragement, suggestions, and continuous guidance in the thesis.
My sincere thanks also go to my mother Mrs. Bishnu Devi Sapkota and my husband
Er.Bishwas Pokharel, Masters Student in Computer Systems and Knowledge Engineering
for their continuous guidance and motivation.

Any further suggestions or criticisms for the improvement will be highly appreciated.

Sincerely,
Anjuli Sapkota
078MSCom002

i
ABSTRACT

Construction cost estimation is crucial to project success and it is challenging to accurately


predict the cost at an early stage. Traditional methods are being used for preliminary cost
estimation in the construction industry . There still exists the problem of cost overrun, and
time delay due to incorrect cost budgeting. This study aims to analyze a modern method
of preliminary cost estimation to prove its efficiency over the traditional method. Models
such as Linear Regressor, Decision Tree Method, Random Forest method, Artificial Neural
Networks, Support Vector Machine, XGboost method, Extra tree method, Voting Regression,
and Stacking method are used. Regarding the data sets, the buildings that were used are 12
Educational Building, 3 Commercial Building, 6 Hospital Building, 18 Residential Building,
13 Public Building, 17 Official Building, and 3 Hotel Building having 0 to 2 basements ranging
above 1 crore. The input features were taken from the literature review, and validated by
expert opinion, and after successfully conducting pilot testing, the survey questionnaire was
distributed among contractors and consultants. Data preprocessing was carried out and
training and testing data sets were developed. The model was developed for nine algorithms.
Mean absolute error, Mean square error, Root mean square error, and R square value were
used as evaluation metrics. In the evaluation of various regression models, three stand out as
the most promising for predicting the target variable. The Decision Tree model exhibited
remarkable performance with an MSE (Mean Squared Error) of 0.088575, an MAE (Mean
Absolute Error) of 0.104625, an RMSE (Root Mean Squared Error) of 0.297615, and an R2
(Coefficient of Determination) of 0.876170. Similarly, the ExtraTree model closely followed
with an MSE of 0.088601, an MAE of 0.102909, an RMSE of 0.297659, and an R2 of 0.876134.
The Voting model with an MSE of 0.105035, an MAE of 0.222807, an RMSE of 0.324091,
and an R2 of 0.853159. This study also opens the path for the exploration of other models
and motivates us to follow the trends of machine learning in the present era.

Keywords: Construction Management, Building, Cost Estimation, Machine learning, Pilot


testing

ii
Contents

COPYRIGHT iii

DECLARATION iv

RECOMMENDATION v

ACKNOWLEDGEMENT i

ABSTRACT ii

Contents iii

List of Figures v

List of Tables vii

1 INTRODUCTION 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Main Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Scope of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Contributions of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 LITERATURE REVIEW 4

2.1 Building definition and types of Building . . . . . . . . . . . . . . . . . . . . 4

2.2 Factors affecting Building project . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Cost estimate and types of Cost Estimate . . . . . . . . . . . . . . . . . . . 4

2.4 Application of Machine Learning Algorithms in Cost Prediction . . . . . . . 5

3 THEORETICAL BACKGROUND 9

3.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 10

iii
4 METHODOLOGY 21

4.1 Workflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Topic Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Expert Opinion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.3 Pilot Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.4 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.5 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.6 Models Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.7 Performance Metrics for Models . . . . . . . . . . . . . . . . . . . . . 28

4.1.8 Tools and Experimental setup . . . . . . . . . . . . . . . . . . . . . . 30

5 RESULTS AND DISCUSSION 31

5.1 Raw Data from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Filtering the Input factor from expert opinion . . . . . . . . . . . . . . . . . 31

5.3 Data pre-processing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Results of Models implementation . . . . . . . . . . . . . . . . . . . . . . . . 43

6 CONCLUSION 48

7 FUTURE RECOMMENDATIONS 49

REFERENCES 53

APPENDIX A 54

iv
List of Figures

3.1 Machine learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Artificial Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Support Vector Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6 Decision Tree Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.7 Random forest Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.8 Extra tree Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.9 Voting Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.10 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Workflow diagram of the cost estimation using machine learning models . . . 21

4.2 Data Entry in excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Types of building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 raw data from different sources. . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Scatter plot of features vs Total cost . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Scatter plot of features vs Total cost . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Normal Distribution Graph for numerical features . . . . . . . . . . . . . . . 36

5.5 Normal Distribution Graph for numerical features . . . . . . . . . . . . . . . 37

5.6 Missing values of numerical features . . . . . . . . . . . . . . . . . . . . . . . 38

5.7 Adding missed values with mean . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.8 Missing categorical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.9 Replaced categorical values with mode . . . . . . . . . . . . . . . . . . . . . 40

v
5.10 Replacing final cost with their natural logarithms . . . . . . . . . . . . . . . 40

5.11 Replacing final cost with their natural logarithms . . . . . . . . . . . . . . . 41

5.12 One hot encoding of categorical variables . . . . . . . . . . . . . . . . . . . . 42

5.13 Correlation Matrix Heatmap for Final feautres . . . . . . . . . . . . . . . . . 42

5.14 Mean square error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.15 Mean absolute error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.16 R square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.17 Root Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.18 Plot of Actual vs Predicted for all models . . . . . . . . . . . . . . . . . . . 47

7.1 Expert Opinion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.2 Expert Opinion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3 Expert Opinion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.4 Expert Opinion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.5 Expert Opinion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.6 Pilot Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.7 Pilot Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.8 Pilot Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.9 Letter for Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.10 Similarity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.11 Paper Acceptance Mail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

vi
List of Tables

2.1 Machine Learning Models Used in Construction Cost Estimation . . . . . . . 7

4.1 List of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Pilot survey respondent’s information . . . . . . . . . . . . . . . . . . . . . . 23

5.1 Count of Response from Experts . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Additional Factors from Experts . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Comparison of Raw Data, Mean-Imputed Data, and Median-Imputed Data


based on variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii
CHAPTER 1
INTRODUCTION

1.1 Background

Building construction projects plays a crucial part in the Nepalese economy. Villages in Nepal
are more likely to have adobe constructions, wooden-framed homes, and rubble stone masonry
structures, while the bulk of metropolitan areas and suburbs have stone or brick masonry
[1]. Twenty percent (20%) of buildings are made up of reinforced concrete (RC). Reasonably
predicting construction expenses is crucial in the early phases of a building project [2]. Cost
is seen as a standard indicator of the resources used on a project [3]. Cost estimation is a
critical aspect of any construction project, as it provides an initial budgetary framework
and assists stakeholders in making informed decisions. Accurate cost estimation can lead to
potential cost savings and improved time efficiency during project execution, making it an
attractive proposition for the construction industry in Nepal.
In Nepal, like many other developing countries, construction projects often face budget
overruns and delays due to inaccurate early cost estimates. Nepal’s construction sector
faces specific challenges such as limited resources, topographical constraints, and varying
socioeconomic conditions across the region. Despite the growing popularity of machine
learning in various industries, its application to the construction sector in Nepal has been
relatively limited. The machine learning methods,forecast the cost of building by using a
historical record [4]. Traditional cost estimation methods might not fully account for these
complexities, making machine learning an appealing option to develop context-aware and
more accurate cost estimation models. Quantity Rate Analysis is the primary conventional
method for estimating costs that are commonly utilized [5].

1.2 Problem Statement

Construction cost estimation is an example of a knowledge-intensive engineering task that


is it relies on the knowledge of a human expert. This is done by Engineers/Professionals
by measuring quantities from drawings and multiplying by their rates. Drawings are drawn
in Auto cad and estimation file is made on excel. And then, after adding all the individual
rates of items with charge of labours ,tools and equipment’s required, we find preliminary

1
cost. In actuality, it takes engineers several years to acquire the skills required to carry out
the cost estimation procedure. The fundamental issue here is that the engineers’ experience
is frequently not verified or documented. This skill is therefore vulnerable to subjectivity
and biasedness. On the other hand, incorrect cost estimation causes a variety of issues,
including modification orders and delays in the construction process. These two elements-
the inability to perform cost estimating manually and the consequences of inaccurate cost
prediction—encourage researchers and construction businesses to look for new creative
solutions to the cost estimation challenge [6]. Traditional technologies are unable to process
and evaluate the vast volume of data generated by the construction sector, resulting in the
loss and storage of a significant amount of data.The availability of drawings and information
is quite limited during the early stages of project development; thus, cost estimation plays a
crucial role in the investors’ decision-making [7]. Due to limited project information accessible
in the early stages, numerous methodologies have been developed to predict construction
costs and one of them is machine learning [8]. Sometimes, even after viewing the data, we
are unable to evaluate or extrapolate the information. We then use machine learning in that
situation[9]. With the ability to identify patterns and relationship between input factors and
cost, the machine learning (ML) technologies can help cost prediction in an early phase of
design [10]. A more precise estimation of construction costs can be made since more cost-
related data is accessible.[11]. For this, first we gather data , then machine learning models
can automatically identify relevant features from a large set of data, allowing the model to
focus on the most important factors influencing cost,it can capture nonlinear relationships
between variables, which traditional methods may struggle to handle.Supervised learning
algorithms, such as linear regression, decision trees can be trained on historical project data
to predict costs based on various input and ML models can continuously learn and adapt as
new data becomes available, leading to continuous improvement in cost estimation accuracy
over time.

1.3 Main Objective

To compare the performance of different machine learning algorithms in estimating the


preliminary costs of Building construction projects specifically in Nepal.

2
1.3.1 Specific Objectives

• To identify the most significant features/input for cost prediction models of buildings.

• To develop and compare the performance of different machine learning models.

1.4 Scope of the Work

The research focuses on nine machine learning algorithms (Artificial Neural Networks, Regres-
sion Analysis, Support Vector Machine, Decision Tree Method, XGboost method, Random
Forest method, Extra tree method, Voting Regression, Stacking ) only for preliminary cost
estimation. The study is focused on projects whose estimated cost is above 1 crore. This
research has selected relevant features that can impact cost estimation accuracy by finding out
factors from the literature review and further validated by expert opinion. Data was collected
from different residential, commercial, hospital, office buildings, and public buildings.

1.5 Contributions of Thesis

The key contributions of the thesis:


Data Set: Data collection is a very difficult task in Nepal and is collected from different res-
idential, commercial, hospital, office buildings, and public buildings. This work’s publication
of the data sets will be a significant contribution to further research.

Evaluation and study of different models: A detailed performance study of several


machine learning models helps the field of the construction industry gain tentative insights
on what factors, if considered, determine how much it costs to build construction.

3
CHAPTER 2
LITERATURE REVIEW

2.1 Building definition and types of Building

The building is defined as any structure made of any material, whether or not it is inhabited
by people, and which includes the foundation, plinth, walls, floors, roofs, and building
services. Tents, tarpaulin shelters, and other transient structures are not to be regarded as
buildings. All governmental, non-governmental, and private structures that offer general
public services, amenities, opportunities, and products are referred to as public buildings.
Based on occupancy buildings are classified as Residential, Assembly, Educational, Hospitals
and clinics, Commercial, office, industries, and storage. Based on Storey and height buildings
are classified as General Buildings (1 to 5 Stories or below 16m), Medium Rise (6 to 8 Stories
or between 16m to below 25m), High Rise (9 to 39 Stories or 25m to below 100m), Skyscrapers
(40 Stories and above or above 100m) (NBC 206:2015). The Nepal National Building Code
(NBC) code is effectively implemented in buildings in Nepal and was formulated in 1994.

2.2 Factors affecting Building project

The variables affecting the building project are Project Characteristics (Building Type, Num-
ber of Storeys, Number of Blocks, Project Complexity Representative value, Programmed
Duration, Original Cost Estimate), Procurement System (Functional Grouping, Payment
Method, Contract Conditions), Project Team Performance (Contractor, Design Team, Man-
agement Team), Client: Client Representatives Characteristics (Client Type, Client Priority,
Client Source of Finance, Client Characteristics Representative Value ), Contractor Charac-
teristics (Contractor Characteristics Representatives Value), Design Team Characteristics
(Design Team Characteristics Representative Value), External Conditions (External Condi-
tions Representative Value)[12].

2.3 Cost estimate and types of Cost Estimate

Cost estimate provides a general concept of the cost of the work, allowing for the determination
of the project’s viability, or if it might be completed within the allocated budget. They are

4
mainly three types [13]:

• A Preliminary estimate is a rough one that is typically based on an approximation


of square feet per estimate. The measurements and Area in this estimate are only
provided for illustrative purposes. Sometimes the price can vary by up to 50%.

• Detailed Estimate is the comprehensive estimate which includes material specifications,


the proposed method of completion, as well as precise measurements and drawings.
The amounts of the work may vary by up to 10%.

• Abstract estimate is the estimate that solely contains the entire quantities of the items
of work, rates determined by the schedule or market values, and the project’s overall
cost. The estimate that includes updated quantities, specifications, and rates is known
as a revised estimate.

2.4 Application of Machine Learning Algorithms in Cost Prediction

The research done by [2] showed that the Artificial Neural network model had a lower
error rate than the multiple regression model of projected building costs. In this study, the
multiple regression model and an artificial neural network model were contrasted using cost
information kept by a provincial office of education on primary schools built between 2004
and 2007. A total of 96 historical data were divided into 20 historical data for comparing the
built-in regression model with the artificial neural network mode and 76 historical data for
building models. The artificial neural network model was shown to be superior in terms of
average error rate and standard distribution by comparing the estimated values of the two
models.
[8] used 197 cases for model construction and validation. The remaining 20 instances for
tested and discovered that the NN model provided more accurate estimation results than the
RA and SVM models. As a result, it was decided that the NN model was most suited for
determining the cost of school construction projects.
A data collection with 530 historical expenses was employed by[14]. Compared to the CBR
or MRA models, the best NN model produced more precise estimates. The lengthy trial-and-
error procedure, however, made it difficult to find the optimal NN model. In comparison to
the other models, the CBR model was more effective concerning these tradeoffs, particularly
its clarity of explanation when calculating construction costs. The ability to update the

5
building cost model easily and maintain consistency in the variables contained are key aspects
of the model’s long-term use.
[15] described as whereas conventional modeling techniques frequently fall short, ANN offers
solutions for difficult issues. For instance, ANN succeeds where many conventional modeling
techniques fall short in capturing nonlinear and intricate interactions between the variables.
They do, however, have their restrictions. They frequently require a specific set of inputs
and outputs and can only be taught for that problem. As a result, any modification that
calls for updating the network’s architecture cannot be carried out automatically and must
instead include human involvement.
[16] conducted research where 174 actual residential projects in Egypt provided the source of
the statistics. To come to an understanding of the crucial elements influencing early-stage cost
assessment, the Delphi method was employed. Regression techniques such as gamma, Poisson,
and multiple linear regression were utilized. Using multiple linear regression, the suggested
hybrid model was derived from the ANN model and the regression models. In comparison
to the ANN model and regression models, the hybrid model’s mean absolute percentage
error was 10.64%, which is lower. The hybrid model’s results show that it was effective at
estimating the cost of residential projects and that it would be helpful to decision-makers in
the construction sector.
The research done by [17] showed R2 values of RF, SVM and Catboost were calculated to be
0.900, 0.897 and 0.906, respectively. When comparing the performances of different models,
Stacking model was the best among others.
[18] discovered that regression models, on the other hand, typically required fewer model
parameters than neural networks, which led to greater prediction performance if the relation-
ships between the variables were well stated. Comparison of the regression model’s results
with those from the neural network model may help to determine whether the regression
model needs nonlinear or interaction terms.
[19] used 10,000 parametric building configurations, Among the 13 ML regression algorithms
used, the Artificial Neural Network (ANN),Gradient Boosting, and XGBoost models appeared
to be the most suitable to estimate the building costs and the required resources with an
accuracy of 99% within less than a second of the training time.

6
Table 2.1: Machine Learning Models Used in Construction Cost Estimation

S.N. Author Input Parameters Machine Learning Models


1 (Kim G. H., 2004) Year, Gross floor area, Storeys, To- Regression analysis, neu-
tal unit, Duration, Roof types, FDN, ral networks, and case-
Usage of the basement, Finishing based reasoning
Grades
2 (Kim S. Y., 2005) Location, Area, Storey, Roof types, Case-Based Reasoning,
Unit per story, Average Area of Unit, Artificial Neural Network,
Foundation type, Usage of units, Fin- Random Forest, Extra
ishing grades, Duration (months) Tree
3 (Kim G. H., 2013) Budget, School levels, Land acquisi- Regression Analysis,
tion, Class number, Building Area, Support Vector Machine
Gross floor area, Storey, Basement and Neural Network, XG
floor, Floor height boost
4 (Roxas, 2014) Ground floor area, Typical floor Artificial Neural Net-
area, Number of storeys, Number of work,Voting regression
columns, Type of footings, Number
of elevators, Number of rooms
5 (Yadav, 2016) Cement, Sand, Steel, Aggregate, Artificial Neural Network,
Mason, Skilled worker, non-skilled Regression
worker, Contractor per square feet
construction
6 (Badra, 2020) Slab Type, Floor Area, Electro- Regression Analysis
mechanic works, Floors Number, El-
evators Number, Internal Finishing,
External Finishing

7
Table 2.1 – Machine Learning Models Used in Construction Cost Estimation
S.N. Author Input Parameters Machine Learning Models
7 (Badawy, 2020) Area of the floors, Number of floors, ANN model and the re-
Type of external finishing, Type of gression model (Multiple
interior finishing linear regression, polyno-
mial regression, gamma
regression, and Poisson
regression)
8 (E. Chandgude, Number of stores, Number of base- Artificial Neural Network,
2020) ments, Floor area, Volume of con- Support Vector Machine
crete, Area of formworks, Weight of
reinforcing steel
9 (Chandanshive, Ground Floor Area, Typical Floor, SVM, Gradient Boosting
2021) No. of Floors, Structural Parking
Area, Quantity of Elevator Wall,
Quantity of Exterior Wall, Quantity
of Exterior Plaster, Area of Flooring,
No. of Columns, Type of foundation,
No. of Householders
10 (Veliyampatt, 2021) Geographic Conditions, Ground Con- Artificial Neural Net-
ditions, Type of Foundation, Type works, Fuzzy Inference
of Building, Market conditions, De- System and Regression
sign Complexity, Quality of Work, Analysis
Changes in Material, Unforeseen
items/Conditions
11 (Uyeol Park, 2022) Gross floor area, Building area ,Build- Bagging, Boosting, Stack-
ing height ,Number of floors ,Number ing
of basement floors, Number of park-
ing spaces

8
CHAPTER 3
THEORETICAL BACKGROUND

3.1 Machine Learning

Artificial intelligence (AI) is a sub-field of computer science (CS), which implies the use of
computers and related technology to make a machine replicate or duplicate human behavior.
The goal of Artificial Intelligence is to create machines that can think like humans do,
including through learning, reasoning, and self-correction [20], [16]. Large amounts of data
from prior tenders are used by AI techniques, which then employ a self-learning process
to find patterns or links in the data sets. The identified relationships are not subject to
the subjectivity of estimators, and the utilization of AI methods reduces the effect of the
varying levels of expertise that estimators have on the accuracy of an estimate. The extensive
database of estimations and actual expenses that are recorded for prior projects is used by
these AI approaches, which also make use of implicit information about project execution[21].
Machine learning is a method that creates a model from data. In the field of machine learning,
data from the past is utilized to anticipate future results. The machine learning technique is
first applied to a training data set, and following the learning process, a model is created. The
final result of the machine learning process is this model, which may be applied in real-world
situations. The model can then on unknown input data give an output based on the patterns
or relationships found in the training set. Data is what drives machine learning techniques.
Typically, data sets are gathered by humans and used for training. Depending on the training
approach, three different categories can be applied to machine learning techniques. The
learning process in a supervised machine learning technique is dependent on data sets that
offer both input and output values. Because the patterns in the data are identified using
the accurate output values of the input values, the process is referred to as supervised. The
supervised learning method is comparable to how people learn. Humans solve problems using
current knowledge, and then contrast the result with the original solution. If the response is
incorrect, the present understanding is changed to approach the issue more effectively the
following time. This is accomplished in supervised learning by repeatedly revising a model to
minimize the discrepancy between the correct output and the model’s output for the same
input. Unsupervised learning is applied when there is no known correct solution and relies

9
solely on the input values. This method looks for patterns in the data and responds based on
whether or not those patterns are present in each new piece of data. Unsupervised learning is
frequently used in clustering techniques. Unsupervised learning’s main benefit is that it can
uncover hidden data structures and learn features of the data that were previously unknown.
Sets of input, some output, and grades are used as training data in reinforcement learning. It
is typically employed when the best possible engagement is needed, such as during control and
games. The main focus is on handling issues in a dynamic setting where a given circumstance
necessitates a given course of action. The emphasis is on performance, which entails striking
a balance between forging new ground and making use of existing knowledge [21].

3.2 Machine Learning Algorithms

Machine learning uses a variety of techniques to address data issues. The kind of algorithm
used depends on the type of problem we are trying to answer, how many variables there are,
what kind of model will work best, and other factors.

Figure 3.1: Machine learning Algorithms


.

10
1. Artificial Neural Networks (ANN)
ANN has been employed in the Architecture, Engineering, and Construction (AEC) industry
(AEC) since the early 1990s to deal with practical CM problems that are challenging to handle
using standard modeling and analytical techniques. Previous research has demonstrated
that ANN can significantly influence prediction, optimization, categorization, and decision-
making in CM practice. From the planning stage to the operation and maintenance stage,
it has successfully aided in resolving specific issues throughout the project’s life cycle [22].
Input, hidden, and output layers are the three different types of neuron layers that make
up the fundamental architecture. In feed-forward networks, the signal flow is strictly in the
feed-forward direction from input to output units. There are no feedback links, but the data
processing can span several (layers of) units. Feedback links are seen in recurrent networks
[23]. The following three layers could be found in a neural network:
1. Input Layer:
In an input layer, there are typically as many input nodes as there are explanatory variables.
The network receives the patterns from the input layer and transmits them to one or more
hidden layers via the network.
2. Hidden Layer:
The input values inside the network are subjected to the modifications applied by the hidden
layers. This involves incoming arcs that come from input nodes connected to each node or
from other hidden nodes. It connects to output nodes or other hidden nodes using arcs that
are leaving the system. The actual processing is carried out through a system of weighted
connections in a hidden layer.
3. Output Layer:
An output layer is then connected to the concealed layers. The input layer or hidden layers
can send connections to the output layer. It provides an output value that is consistent with
the response variable’s forecast. Most categorization issues have a single output node. The
following connection determines the neuron output signal O:
n
!
X
O = f (net) = f w j xj (3.1)
j=1

where the function f(net) is referred to as an activation (transfer) function and wj is the
weight vector. The weight and input vectors are combined to create a scalar product known

11
as the variable net.
net = wT x = w1 x1 + . . . + wn xn (3.2)

When T is a matrix’s transpose, and in the most straightforward scenario, the output value
O is calculated as 
1 if wT x ≥ θ

O = f (net) = (3.3)
0 otherwise

When the threshold level is denoted by the symbol theta, this kind of node is known as a
linear threshold unit [24].

Figure 3.2: Artificial Neural Network Model


.

2. Linear Regression (LR)


By relating the observed data to a linear equation, linear regression attempts to describe the
connection between two variables. These are the different kinds of variables: independent
(explanatory) variable; and dependent (this one affects the result) variable. Any variable
whose values we seek to predict or explain is referred to as a dependent variable. Since we
utilize this variable as the foundation for predictions, it may also be referred to as a predicted
variable. The variable that helps forecast the value of the output variable is known as the
input or predictor variable. It is frequently called X. The variable we wish to forecast is the
output or target variable. It is frequently called Y. Regression is represented by the equation:

Y = β0 + β1 X + ϵ (3.4)

where:

• Y is the target variable or the dependent variable.

• X is the input or the independent variable.

12
• β0 and β1 are the coefficients of the regression model. They represent the intercept and
slope of the line of best fit, respectively.

• ϵ is the random error or residual term, which captures the variation in the data that
cannot be explained by the model.

Regression aims to estimate the values of β0 and β1 using a sample of data. The line of best
fit is determined by these estimated coefficients. Once the relationship between the input
variable (X) and the target variable (Y ) is established, the model can be used to forecast
the values of Y for new or unknown inputs. By fitting a line to the data, regression helps us
understand if there is a link between the input and output variables and enables us to make
predictions based on that relationship [25], [26].

Figure 3.3: Regression Analysis


.

3. Support Vector Machine


Support Vector Machine, sometimes known as SVM, is a linear model used to solve classi-
fication and regression issues. It works well for such cases: complex patterns in the data,
having data sets where the number of features is high, ability to handle Small datasets: In
situations where the available data for cost estimation might be limited. The SVM concept
is straightforward: A line or a hyperplane that divides the data into classes is produced by
the algorithm. To achieve the greatest feasible separation between the two classes, SVM
attempts to create a decision boundary [27].

13
Figure 3.4: Support Vector Machine
.

The data points cannot be separated linearly. When there is no line (separating hyperplane)
that performs well on two classes, even if we use a soft margin classifier that allows for
P
misclassification. So, In this case, The function f (x) = β0 + B i∈S ∞ (ai < x, xi >) can be
written using the kernel function as:
X
[f (x) = β0 + (ai K(x, xi ))] (3.5)
i∈S

Here, B is a constant, i ∈ S denotes the sum over the set of indices corresponding to the
support points, and < x, xi > represents the inner product between points x and xi given by
P
⟨a, b⟩ = i (ai bi ) [28].

Figure 3.5: Support Vector Machine Model

4. Decision Tree method


Each internal node of a decision tree represents a test on a feature (such as whether a
coin will land on its head or tail), each leaf node represents a class label (decision made
after computing all features), and branches represent conjunctions of features that result in
those class labels. Classification rules are represented by the routes from root to leaf. The
fundamental decision-making loop is shown in the picture below with the labels Rain(Yes),
and No Rain(No) [29].

14
Figure 3.6: Decision Tree Method

There are three different kinds of nodes in this tree-structured classifier. The first node,
known as the Root Node, represents the whole sample and has the potential to divide into
further nodes. A data set’s characteristics are represented by the Interior Nodes, and the
decision criteria are represented by the branches. Lastly, the result is represented by the Leaf
Nodes. This method is incredibly helpful in resolving issues with decisions.When a certain
data point is used, it is answered True or False questions all the way through the tree until it
reaches the leaf node. The average of the dependent variable’s values at that specific leaf
node represents the final prediction. The Tree is able to forecast an appropriate value for the
data point after going through several rounds.[30]

5. XGboost method
The gradient-boosted trees approach is implemented using the open-source software known
as XGBoost, which stands for extreme gradient boosting. Due to its accuracy and simplicity,
it has been one of the most used machine learning approaches. It is a method for supervised
learning that may be applied to applications involving classification or regression. For orga-
nized, tabular data, XGBoost has done pretty well. On the whole, it is quick. Compared
to other gradient-boosting solutions, incredibly quick. When it comes to classification and
regression predictive modeling issues, XGBoost dominates structured or tabular data sets.
The key strength is its effective handling of missing values, which enables it to handle
real-world missing value data without the need for intensive pre-processing. Additionally, it
includes integrated parallel processing capability, allowing to train models on huge data sets
quickly [31].

K
X
F (xi ) = fk (xi ) + bias (3.6)
k=1

15
Where:

K is the number of trees (boosting rounds) in the ensemble.

fk (xi ) is the prediction made by the k-th tree for the instance xi .

The final prediction F (xi ) is the sum of predictions from all trees plus a bias term.

Boosting is a sequential strategy in machine learning that increases the accuracy of the
model by turning weak learners or hypotheses into strong learners.With an emphasis on
effectiveness, computational speed, and model performance. A machine learning model that
outperforms random guessing only marginally is called a weak learner Extreme Gradient
Boosting (XGBoost) is a scalable and enhanced variant of the gradient boosting technique
(terminology warning).With the least amount of time, XGBoost is a superb combination of
hardware and software capabilities that improves current boosting approaches with precision.
A strong learner is created by XGBoost by merging many weak learners.. Nevertheless, weak
learners may be merged to create a stronger, far more accurate learner. To operate, XGBoost
trains several decision trees. The final forecast is formed by combining the predictions made
by each tree, each of which is trained on a portion of the data. It is a step up from the
GBM algorithm. The primary distinction is that over fitting is less likely with XGBoost as it
employs a more regularised model.[19]
6. Random Forest Method
A supervised learning algorithm is a random forest. An ensemble of decision trees, often
trained using the bagging approach, make up the ”forest” that it constructs. The bagging
method’s general premise is that combining learning models improves the end outcome. On
various samples, it constructs decision trees and uses their average for classification and
majority vote for regression. The Random Forest Algorithm’s ability to handle data sets
with continuous variables, as in regression, and categorical variables, as in classification, is
one of its most crucial qualities. For classification and regression tasks, it performs better.
The method excels at handling complicated data sets and minimizing overfitting, making it a
helpful tool for a variety of machine learning predictive applications [32].
T
1X
F (xi ) = ft (xi ) (3.7)
T t=1

16
Where:

T is the number of trees in the Random Forest ensemble.

ft (xi ) is the prediction made by the t-th decision tree for the instance xi .

The final prediction F (xi ) is the average of predictions from all trees.

Each decision tree ft is a function that takes the features xi as inputs and returns a prediction based o

Figure 3.7: Random forest Method

Each decision tree in the Random Forest model is built using a subset of features and a
subset of data points. To put it simply, m features and n random records are selected from a
data collection containing k records. Every sample has a different decision tree generated
for it. Every decision tree will provide a result. The final product is evaluated using either
regression or classification-based majority voting or averaging [33].

7. Extra Tree Method


When precision is greater than a generic model, we consider Extra Tree. As a result, it gives
Low Variance. Additionally, it assigns feature importance. In contrast to random forest,
Extra Tree Classifier does not use bootstrap aggregation. Takes a random subset of data
without replacing it, to put it simply. So, rather than using the best splits, nodes are split
at random. Therefore, in Extra Tree Classifier, randomization originates from the data’s
random splits rather than from bootstrap aggregating. Extra Trees is considerably quicker
than Random Forest in terms of computational cost. This is due to Extra Trees using a
random algorithm to choose the value at which to split features, as opposed to Random
Forest strategy [34].
T
1X
F (xi ) = ft (xi ) (3.8)
T t=1

17
Where:

T is the number of trees in the Extra Trees ensemble.

ft (xi ) is the prediction made by the t-th decision tree for the instance xi .

The final prediction F (xi ) is the average of predictions from all trees.

Figure 3.8: Extra tree Method

More specifically, Extra trees would be a better option than other ensemble tree-based models
when developing models with significant feature engineering/feature selection pre-modelling
procedures when computational cost is a concern [35].The number of decision trees in the
ensemble, the number of input characteristics to choose at random and take into account for
each split point, and the minimum number of samples needed in a node to establish a new
split point are the three primary hyperparameters in the method that need to be tuned [36].

8. Voting Regression
A sort of ensemble approach called the voting ensemble method integrates the results of
various models by voting. By pooling the knowledge and experience of various experts, the
voting ensemble approach can be utilized to produce predictions that are more accurate than
those made by any one model. The concept is to lower the variance and prevent overfitting by
combining the predictions of various models. In situations when there are numerous models
with various configurations or methods, the voting ensemble method is frequently employed.
The classifier ensemble shown below was developed utilizing models trained using various
machine learning algorithms, including logistic regression, SVM, random forest, and others
[37].

18
Figure 3.9: Voting Regression

For a given instance xi with features xi1 , xi2 , . . . , xin , the prediction F (xi ) made by a voting
regression can be represented as follows: there is an error ϵ that must be inside math mode.

M
1 X
F (xi ) = fm (xi ) (3.9)
M m=1
Where:

M is the number of individual regression models in the ensemble.

fm (xi ) is the prediction made by the m-th individual regression model for the instance xi .

The final prediction F (xi ) is the average of predictions from all individual models.

First off, voting won’t be hampered by significant mistakes or incorrect classifications from a
single model because it depends on the performance of several models.Having several models
that can make the right forecast helps to reduce the chance of one model producing an
incorrect one when combining models to produce a prediction. The estimator may be made
more resilient and prone to over fitting using this method. Strong performance from other
models can compensate for a poor performance from one.[26],[38]

9. Stacking
Stacking, which consists of two-layer estimators, is a method of assembling classification or
regression models. All the baseline models that are used to forecast the results on the test
data sets are contained in the first layer. The Meta-Classifier or Regressor in the second
layer uses all the predictions from the baseline models as input to produce new predictions.
Stacking, which consists of two-layer estimators, is a method of assembling classification or
regression models. All the baseline models that are used to forecast the results on the test
data sets are contained in the first layer. The Meta-Classifier or Regressor in the second layer

19
uses all the predictions from the baseline models as input to produce new predictions [39].

F (xi ) = meta-model(prediction1 (xi ), prediction2 (xi ), . . . , predictionN (xi )) (3.10)

Where:

meta-model is the model that combines the predictions of the base models.

prediction1 (xi ), prediction2 (xi ), . . . , predictionN (xi ) are the predictions made by the base models.

Figure 3.10: Stacking

Dividing the training set into two halves. Selecting L learners who are weak, fitting them
to the first fold’s data for each of the L weak learners, and then using the weak learners’
predictions as inputs to fit the meta-model to the second fold [17].

T
X
F (xi ) = γt ht (xi ) (3.11)
t=1
Where:

T is the number of boosting iterations (trees) in the LightGBM model.

γt is the weight assigned to the t-th tree.

ht (xi ) is the prediction made by the t-th tree for the instance xi .

20
CHAPTER 4
METHODOLOGY

4.1 Workflow Diagram

Figure 4.1: Workflow diagram of the cost estimation using machine learning models
.

Figure 4.1 shows the flow of the overall steps involved in the machine learning models.

4.1.1 Topic Selection

The selection of the topic “A Comparative Study of the Performance of Different Machine
Learning Algorithms In Estimating the Preliminary Costs of Building Construction Projects

21
Specifically In Nepal” for the thesis was driven by its profound relevance and practicality
within the context of the Nepalese construction industry. By considering the data from
past projects, computers can learn and make accurate and fast estimations as compared to
software tools and human expertise which is tedious and time-consuming.

4.1.2 Expert Opinion

From the literature reviews the input factors were gathered. Expert opinion is taken
for filtering factors and making them relevant for the buildings residing in Nepal. The
questionnaire was filled out by 5 experts including both contractor and consultant. The
following criteria are taken for expertise:

• more than 12 years of experience in this construction field.

• must have a relevant educational background.

• working as Consultant/Contractor.

Experts Company Name Expert’s Designa- Experience Experience


tion/ Position in Building
Rajan Shrestha BKOI builders Pvt. Ltd Senior Engineer/ Con- 20 15
tract Manager
Laxmi Bhakta Boddhi Engineering Proprietor/ Senior Engi- 18 18
Maharjan Consultancy neer
Sabir Baidya Civil Homes Project Manager 14 12
Suraj Bhattarai Seismo-tech Engineering Managing Director 12+ 12+
Consultancy Pvt. Ltd
Sunil Raj Acharya Building Design Author- Senior Civil Engineer 20 15
ity Pvt. Ltd
Table 4.1: List of Experts

4.1.3 Pilot Testing

A pilot survey is a survey that the researcher conducts with a smaller data size in collaboration
with the experts. A gathered response helps to guide whether to move forward in research
or change the questionnaire. It also helps to discover challenges that can affect the main
data collection process [40]. The questionnaire developed is attached in an Appendix. A
pilot test was conducted with 3 respondents to check the clarity and comprehensibility of
the questionnaire. The respondents easily understood the questionnaire; hence, there was no

22
difficulty in filling up the questionnaire. The minimum time taken to fill the questionnaire
was around 10 minutes and the maximum time taken was almost 20 minutes. The summary
of the respondents of the pilot test is given in the table below:

S.N Name of the Name of the re- Level of Educa- Position Time taken
Company spondent tion
1 Shitprava Archi- Er. Ganesh Sap- Masters Degree Project almost 10 min-
tect and Engineer- kota Consultant utes
ing Consultancy
Pvt. Ltd
2 Dream Height En- Er. Sagar Bista Masters Degree Project almost 12 min-
gineering and Con- Manager utes
sultancy Pvt. Ltd
3 Seismo-Tech En- Er. Suraj Bhat- Masters Degree Project almost 20 min-
gineering Consul- tarai Consultant utes
tancy Pvt. Ltd
Table 4.2: Pilot survey respondent’s information

4.1.4 Data collection

Building projects’ structural data were gathered from various construction firms and con-
sultancies with their final cost of projects. Data were collected from the Department of
Urban Development and Building Construction (DUDBC), Consultancies, and Contractors.
The data collection process was very tough as cost the bidding amount is confidential for
contractors.

Figure 4.2: Data Entry in excel


.

23
Figure 4.3: Types of building
.

4.1.5 Data pre-processing

• Separated Numerical and Categorical Features (excluding the total cost of the project).

• Read the Excel file into a Data Frame (df).

• Counted the number of numerical features.

• Plotted individual scatter plots for numerical features to visualize the data and identify
outliers.

• Plotted scatter plots between numerical features and the total cost of the project to
analyze their relationship.

• Plotted a normal distribution graph for numerical features to analyze the distribution
of the data.

24
• Calculated mean, median, mode, and variance for individual numeric features.

• Replaced missing values in numerical features with the mean, median, and mode, and
data was saved in separate Excel sheets.

• Normal distribution graph was plotted for all metrics.

• Based on the analysis of variance, replaced missing values with mean as there is less
variance in data when replaced with the mean value.

• Counted the number of missing categorical features.

• Plotted histograms for categorical features to find whether values are unique or in some
order.

• Replaced missing values in categorical features with the mode (i.e., repeated values)
and again plotted histograms.

• Features were encoded using one-hot encoding since it represent distinct project names
without any inherent order.

• Encoded categorical features using one-hot encoding give all the data values in numeric
features, hence suitable for further analysis.

• There are no missing categorical data now.

• Final Data set is ready and Splitting the data set into training and testing sets in a
ratio of 80:20 to implement in models.

4.1.6 Models Implementation

In this work models such as Linear Regressor, Decision Tree Method, Random Forest method,
Artificial Neural Networks, Support Vector Machine, XGboost method, Extra tree method,
Voting Regression, and Stacking method are implemented.

Linear Regression (LR) is a basic regression model used to establish a linear relationship
between independent and dependent variables. Decision Tree Regressor (DT) partitions data
into subsets based on features for predictions. Random Forest Regressor (RF), an ensemble
method, combines predictions from multiple decision trees. The Neural Network (NN)

25
comprises several dense layers with varying activations trained using the ’Adam’ optimizer for
100 epochs to minimize mean squared error. XGBoost Regressor (XGB) is a gradient-boosting
algorithm that combines weak learners to boost predictive performance. Support Vector
Machine (SVM) with a linear kernel is used for regression. Extra Trees Regressor (ET)
is similiar to Random Forest but employs random thresholds for feature splitting. Voting
Regressor (Voting) amalgamates LR, DT, and RF models. Stacking Regressor (Stacking)
combines LR, DT, RF, ET, and Gradient Boosting models via a meta-regressor. Each model
showcases unique methodologies and predictive strengths tailored to the task at hand.

• Linear Regression, Decision Tree, Random Forest, Extra Trees, Voting Regressor:
scikit-learn

• Neural Network: Keras with TensorFlow backend

• XGBoost: xgboost library

• Support Vector Machine: scikit-learn

• Gradient Boosting: scikit-learn

Setting a seed or random state parameter ensures reproducibility of results. When spec-
ifying a particular number, such as 42, as the random state or seed, it initializes the
random number generator in a manner that each execution of the code with the same
seed produces identical random values. For instance, the random state=42 parameter is
used in DecisionTreeRegressor, RandomForestRegressor, ExtraTreesRegressor, and
GradientBoostingRegressor, which initializes these models with the same random seed.
This initialization guarantees consistency in their behavior across different runs. Similarly,
in XGBoost, random state=42 is utilized to ensure reproducibility in the behavior of the
XGBoost regressor.

Model Architectures:

• Linear Regression (LR): Utilizes a simple linear model to establish a linear relation-
ship between input features and the target variable. No hidden layers are involved.

• Decision Tree (DT) Regressor: Employs a decision tree-based model to make


predictions using a tree-like graph, consisting of nodes representing features, branches,
and leaf nodes containing the predicted values.

26
• Random Forest (RF) Regressor: Comprises an ensemble of decision trees to enhance
prediction accuracy by averaging the outputs of multiple decision trees.

• Neural Network (NN) Regressor: Implements a feedforward neural network with


four layers: an input layer with the number of features as neurons, followed by three
hidden layers having 2048, 256, and 64 neurons respectively, and an output layer with
one neuron for prediction.

– The neural network consists of a Sequential model, indicating a linear stack of


layers.

– There are three hidden Dense layers:

1. Dense layer with 2048 neurons and ’softmax’ activation function.

2. Dense layer with 256 neurons and ’relu’ (Rectified Linear Unit) activation
function.

3. Dense layer with 64 neurons and ’relu’ activation function.

– Finally, there is an output layer with a single neuron.

Input shape:

– The input shape is determined by the number of features in the training data.
It’s specified as (X train.shape[1], ), which indicates the number of columns or
features in the input data.

Model compilation:

– The model is compiled using the ’adam’ optimizer and the loss function set to
’mean squared error’.

Training:

– The model is trained using the fit method, where it’s trained on X train and
y train data.

– The training is performed for 100 epochs with a batch size of 32.

– The verbose parameter set to 0 implies that no output will be printed during
training.

27
– Validation data (X test, y test) is used to validate the model’s performance after
each epoch.

Predictions:

– After training, the model is used to make predictions on the test data (X test),
and the predictions are stored in nn predictions.

• XGBoost Regressor: Deploys an XGBoost-based ensemble model using gradient


boosting that sequentially builds multiple decision trees to predict the target variable.

• Support Vector Machine (SVM) Regressor: Uses a support vector machine


algorithm to find the hyperplane that best separates the data points.

• Extra Trees (ET) Regressor: Functions as an ensemble model using extremely


randomized trees, which are an extension of Random Forests.

• Voting Regressor: Creates an ensemble by combining the predictions from multiple


base estimators (LR, DT, RF) and generates a final prediction based on the aggregated
results.

• Stacking Regressor: Combines predictions from multiple base estimators (LR, DT,
RF, ET, GB) using a meta-estimator (LR) to produce final predictions.

4.1.7 Performance Metrics for Models

Calculation of mean square error, mean absolute error, and R square [41].

a. Mean Squared Error (MSE): The Mean Squared Error (MSE) assesses the average
squared differences between predicted (Pj ) and actual (Tj ) values in the dataset. The formula
for MSE is given by:
N
1 X
MSE = (Tj − Pj )2
N j=1
Where:

• N represents the total number of data points.

• Lower MSE values indicate better agreement between predicted and actual values, with
a perfect model yielding an MSE of 0.

28
b. Mean Absolute Error (MAE): The Mean Absolute Error (MAE) measures the average
absolute differences between predicted (Pj ) and actual (Tj ) values in the dataset. The formula
for MAE is given by:
N
1 X
MAE = |Tj − Pj |
N j=1
Where:

• N represents the total number of data points.

• MAE provides a measure of the average magnitude of errors between predicted and
actual values. It is less sensitive to outliers compared to Mean Squared Error (MSE).

c. R-Squared (Coefficient of Determination): The R2 coefficient of determination


quantifies the proportion of variance in the target variable (Tj ) that is predictable from the
independent variables. The R2 value is calculated as:
SSE
R2 = 1 −
SST
Where:

• SSE denotes the Sum of Squares of Residuals, reflecting the difference between predicted
and actual values.

• SST represents the Total Sum of Squares, indicating the total variance in the target
variable.

d. Root Mean Square Error: Root Mean Square Error (RMSE) is a commonly used
metric in the field of statistics and machine learning to evaluate the accuracy of a predictive
model. It is a measure of the average magnitude of the errors between predicted and actual
values. The RMSE is calculated by taking the square root of the mean of the squared
differences between predicted and actual values. Here’s the formula:

v
u n
u1 X
RMSE = t (yi − ŷi )2 (4.1)
n i=1

29
Where:

n is the number of observations or data points,

yi is the actual or observed value for the ith data point,

ŷi is the predicted value for the ith data point.

4.1.8 Tools and Experimental setup

The code is written in Python and uses various libraries for data pre-processing and machine
learning. Google Co-laboratory platform was used for the training part of the experiment. It
is a cost-free Python environment that runs in the cloud where a user can execute code, using
powerful computing resources that are faster to train the complex machine learning models as
compared to other general-purpose computers. Pytorch version 1.13 an open-source platform
for machine learning-related projects, was taken into consideration to implement the different
architecture. The Python Imaging Library (PIL 7.0.0), Matplotlib was used to perform image
processing and computer vision tasks.

• pandas (pd): This library is used for data manipulation and analysis. It provides
data structures like DataFrames that allow to work with structured data efficiently.

• sklearn.model selection: This module within the sci-kit-learn library provides tools
for splitting data sets into train and test sets, and for cross-validation.

• sklearn.impute: This module provides classes for imputing (filling in) missing values
in data sets.

• sklearn.preprocessing: This module provides functions for preprocessing data before


feeding it into machine learning models. In this context, it includes Label Encoding,
which is used to transform categorical labels into numerical values.

• sklearn.ensemble: This module provides ensemble methods for machine learning,


such as Random Forest Regressor, which is a type of ensemble model for regression
tasks.

• sklearn.metrics: This module provides various metrics to evaluate the performance


of machine learning models. mean squared error, is a common metric to measure the
quality of a regression model’s predictions.

30
CHAPTER 5
RESULTS AND DISCUSSION

5.1 Raw Data from Source

Figure 5.1: raw data from different sources.


.

5.2 Filtering the Input factor from expert opinion

As per experts, laboratory tests, building code use, Consulting fees, area of formwork, market
condition, Solid waste management, roof type, insurance of staff, material, and equipment,
and waterproofing were the least important factors and can be shown by the numeric value in
the table 5.1. High numeric values for aspect ”Yes” are taken into consideration and the high
value of aspect ”No” is eliminated. Additional factors were also given by Experts as shown
in table 5.2. After eliminating and adding some factors, a new questionnaire was made which
is shown in the Appendix. The Building Attributes that are finally considered for further
processing are listed. The attributes listed include details such as the name of the project,
location of the building, type of building, construction completion year, site/geographic
conditions, access to the site, site area, type of foundation, plinth area, floor area, floor
height, number of floors, number of columns, number of rooms, number of bathrooms, number
of kitchens, number of lifts/elevator, number of basements, use of building code, type of

31
Aspect Yes No
Location of Building 5 0
Type of Building 5 0
Site/Geographic Conditions 5 0
Access to Site 4 1
Site Area 3 2
Plinth Area 5 0
Floor Area 5 0
Floor Height 4 1
Number of Storeys 4 1
Number of Columns 3 2
Number of Rooms 4 1
Number of Bathrooms 3 2
Number of Beams 3 2
Type of Foundation 5 0
Roof Type 2 3
Number of Lifts/Elevator 4 1
Basement 5 0
Building Code Used 2 3
Laboratory Tests 1 4
Consulting Fees 2 3
Insurance of Staff, Material, 3 2
Equipment
Waterproofing 2 3
Aluminum and Railing Works 5 0
Wood Works 5 0
Type of Flooring Works 5 0
External/Internal Finishing 4 1
Area of Formwork 1 4
HVAC Work 5 0
Water Supply & Drainage Sys- 5 0
tem
Solid Waste Management 4 1
Water Treatment, Septic Tank, 3 2
Soak Pit
Electrical System 5 0
CCTV, AC & Ventilation Sys- 4 1
tem, Solar
Landscaping 5 0
Road Works/River Training 3 2
Works
Market Condition 4 1
Table 5.1: Count of Response from Experts

window, type of door, type of flooring works, external painting, internal finishing, HVAC
work, sanitary works, electrical works, landscaping, and road works/river training works.
There are 0 to 2 basements and a range of 1 to 13 floors in terms of storeys. Buildings’ overall
structural costs range above 1 crore.

32
Table 5.2: Additional Factors from Experts

Expert no. 1 Expert no. 2 Expert no. 3 Expert no. 4

• Availability of • Interior design • Rate of skilled • Quality of materi-


skilled labor labor als
• Rate of con-
• Specific materi- struction mate- • Specification • Final drawings
als/equipment rials of works
from foreign
countries • Construction • Construction
material year
• Brand of materi-
als/equipment • Construction • Proper discus-
year sion in finish-
• Quality of materi- ing materials
als • Rainwater har- and qualities
vesting
• Distance of source
of materials from
the project site

• Construction year

• Level of safety re-


quirements

Among all attributes, there are 17 numeric features including ’Site Area’, ’Plinth Area’, ’Floor
Area’, ’Floor Height’, ’Number of Floors’, ’Number of Beams’, ’Number of Columns’, ’Number
of Rooms’, ’Number of Bathrooms’, ’Number of Lifts/Elevator’, ’Number of Basements’,
’External Painting (Percentage of Total cost)’, ’Internal Finishing (Percentage of Total
cost)’, ’HVAC work (Percentage of Total cost)’, ’Sanitary Works (Percentage of Total cost)’,
’Electrical Works (Percentage of Total cost)’, and ’Total Final Cost of the Project including
VAT’. Similarly, there are 12 categorical features including ’Name of project’, ’Location of
Building’, ’Type of Building’, ’Construction Year’, ’Site/geographic Conditions’, ’Access to
Site’, ’Type of Foundation’, ’Type of Window’, ’Type of Door’, ’Type of Flooring Works’,
’Landscaping’, and ’Road works/River Training Works’.
Scatter plots are a fundamental tool in data exploration and can provide valuable insights
into the relationships between numerical variables in a dataset. Scatter plots can reveal
trends or patterns in the data. For example, if the points form an upward or downward slope,
it indicates a positive or negative linear trend between the variables. Outliers, which are
data points that significantly deviate from the main cluster of points, are easily identified
in scatter plots. This can help in identifying complex patterns and interactions in the data.

33
Scatter plots can also reveal non-linear relationships between variables. If the points form a
curve or some other non-linear shape, it suggests a non-linear relationship. Scatter plots are a
fundamental tool in data exploration and can provide valuable insights into the relationships
between numerical variables in a dataset.

Figure 5.2: Scatter plot of features vs Total cost

34
Figure 5.3: Scatter plot of features vs Total cost

The scatter plot in Figure 5.2 and 5.3 shows a random distribution of points, it suggests
that there is no discernible pattern or relationship between the variables being plotted. A
random distribution typically indicates that there is no correlation or association between the
variables. Each data point appears scattered across the plot without following any specific
trend, slope, or pattern. This pattern-less arrangement implies that changes in one variable do
not have a consistent effect or influence on the other variable. In such cases, many non-linear
models might be needed to explore other potential relationships or factors that could affect
the variables under consideration.

35
A normal distribution graph describes how data is distributed when many independent,
random factors contribute to an outcome. Deviations from the normal distribution can
indicate outliers or anomalies in data. Detecting outliers is crucial in data analysis.

Figure 5.4: Normal Distribution Graph for numerical features

This distribution appears with a tail stretching towards the right side of the curve. It indicates
that the data has a longer right tail compared to the left side. In such cases, the mean tends
to be larger than the median, and most data points cluster towards the left side. A tail on
the right side of a distribution indicates a right-skewed pattern, and outliers situated near
this tail represent unusually high values that might have a significant impact on statistical
measures and require thorough examination during data analysis.

36
Figure 5.5: Normal Distribution Graph for numerical features

5.3 Data pre-processing results

The data had missing values and preprocessing must be done. Some values are extremely
high, null values, disordered text data, or noisy data. Data preprocessing is thus carried out
to address this issue. Calculated mean, median, mode, and variance for individual numeric
features. Calculated variance for data. Replaced missing values in numerical features with the
mean, median, and mode. Table 5.3 considers the variance as a measure to decide between
using mean or median for imputation, the results after imputation with the mean have shown
a lower variance compared to imputation with the median. Lower variance signifies less
dispersion of data points from the mean value, implying that the dataset tends to be more
tightly clustered around the mean. Comparing the variance values across different attributes
between the raw data, mean-imputed data, and median-imputed data, mean imputation
tends to preserve the original variability of the dataset better than median imputation. Hence,
based on the variance as the criterion, mean imputation seems to be more aligned with
maintaining the original data’s variability compared to median imputation, and all numeric
features are replaced by the mean in such dataset. The decision is also justified by the table
5.3.

37
Plotted histograms for categorical features to find whether values are unique or in some order.
Replaced missing values in categorical features with the mode i.e. repeated values.

Table 5.3: Comparison of Raw Data, Mean-Imputed Data, and Median-Imputed Data based
on variance

Attribute Raw Data Mean-Imputed Median-Imputed


Data Data
Site Area(sqm) 3204085 2931397 2966409
Plinth Area(sqm) 1467124 1123755 1142837
Floor Area(sqm) 5227659 4782752 4808721
Floor Height 0 0 0
Number of Floors 3 3 3
Number of Beams 15 14 14
Number of Columns 797 746 754
Number of Rooms 947 746 762
Number of Bathrooms 203 181 184
Number of lifts/ Elevator 1 1 1
Number of Basement 0 0 0
External Painting (% of To- 7 6 6
tal cost)
Internal Finishing (% of To- 58 44 45
tal cost)
HVAC work (% of Total 5 4 5
cost)
Sanitary Works (% of Total 8 7 7
cost)
Electrical Works (% of Total 7 7 7
cost)
Total final cost of project in- 56577389761916752 56577389761916752 56577389761916752
cluding VAT

Figure 5.6: Missing values of numerical features

38
Figure 5.7: Adding missed values with mean

Figure 5.8: Missing categorical values

Regarding replacing missing values in categorical features with the mode (most frequent
value), it is a common approach and often a reasonable strategy, especially when dealing
with categorical data. Imputing missing categorical values with the mode ensures the overall
distribution of the categories and minimizes the potential impact of missing data on data
analysis.

39
Figure 5.9: Replaced categorical values with mode

Replacing the values in the Total final cost of the project including VAT with their natural
logarithms as shown in Figure 5.10. Taking the logarithm can normalize the distribution or
reduce the impact of extreme values making the data more suitable for data analysis.

Figure 5.10: Replacing final cost with their natural logarithms

Calculating the correlation matrix for the Data Frame: The correlation matrix shows
how each numerical column in the Data Frame is related to every other numerical column by
calculating Pearson correlation coefficients. The Pearson correlation coefficient ranges from
-1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear
correlation. Positive values indicate a positive correlation, while negative values indicate a
negative correlation. It is useful for feature selection or understanding the data’s pattern.

40
Figure 5.11: Replacing final cost with their natural logarithms

Dropping Plinth Area(sqm) and Number of Bathrooms: These two features are being
removed because they exhibit a high correlation with other variables (’Floor Area(sqm)’ and
’Number of Rooms’ respectively) beyond a predefined threshold of 0.70. Due to their strong
correlation with other variables, it’s assumed that they might not provide additional signifi-
cant information for the analysis or modeling and could potentially lead to multicollinearity
issues.
Dropping Construction Year: This feature is being dropped because its correlation with
the target variable (’Total final cost of the project including VAT’) is lower than a specified
threshold of 0.70, specifically correlating to 0.091. A correlation below this threshold suggests
a weak linear relationship between ’Construction Year’ and the target variable, which might
not significantly contribute to explaining the variability in the target.

41
Figure 5.12: One hot encoding of categorical variables

Figure 5.13: Correlation Matrix Heatmap for Final feautres

Cleaning the column: Cleaning the column Location of Building. Since the dataset is
limited to certain places only. Data is categorized whether all the data is either inside of

42
Kathmandu or Outside of Kathmandu. Since the Location of the building is cleaned into the
inside valley and the Type of foundation is cleaned into individual foundations we can drop
these two columns.
The feature was encoded using one-hot encoding as shown in Figure 5.13. One hot encoding
is that categorical variables have been transformed into a numerical format. Dropping all
the variables that are either highly correlated with each other or are less correlated with the
target variable which is the Total final Cost of the project including VAT.

5.4 Results of Models implementation

Table 5.4: Model Evaluation Metrics

Model MSE MAE RMSE R2


Linear Regression 0.238824 0.426620 0.488696 0.666118
Decision Tree 0.088575 0.104625 0.297615 0.876170
Random Forest 0.113138 0.216848 0.336360 0.841830
Neural Network 0.423127 0.466034 0.650482 0.408457
XGBoost 0.353340 0.230873 0.594424 0.506021
SVM 4.100760 1.112391 2.025033 0.473297
ExtraTree 0.088601 0.102909 0.297659 0.876134
Voting 0.105035 0.222807 0.324091 0.853159
Stacking 0.135783 0.225357 0.368487 0.810172

Regarding the comparison of the models, the Decision Tree, Random Forest, ExtraTree,
Voting, and Stacking models exhibit relatively better performance in terms of MSE, MAE,
RMSE, and R2 . Among these, the Decision Tree, ExtraTree, and Voting models demonstrate
particularly strong performance across multiple metrics. The Decision Tree or ExtraTree
model is considered the best choice based on the provided metrics, as they seem to have
lower errors and higher R2 values compared to other models.

43
Figure 5.14: Mean square error

Figure 5.15: Mean absolute error

44
Figure 5.16: R square

Figure 5.17: Root Mean Square Error

45
(a) Linear Regressor (b) Decision Tree Regressor

(c) Random Forest Regressor (d) Artificial Neural Network Regressor

(e) XGBoost Regressor (f ) SVM Regressor

(g) Extra Tree Regressor (h) Voting Regressor

46
(i) Stacking Regressor

Figure 5.18: Plot of Actual vs Predicted for all models

47
CHAPTER 6
CONCLUSION

The final input features are 8 Numerical features as (Site Area, Floor height, Number of
Floors, Number of Beams, Number of Columns, Number of Rooms, Number of lifts, Number
of Basement and 6 Categorical features as (Landscaping(Yes/No), Type of flooring (marble,
granite, punning, terrazzo), Type of Door(Upvc, Salwood),Type of Foundation (mat, isolated,
combined, raft, strap, pile),Electrical Works luxury (Yes/No)and Inside Valley(Yes/No).
In the evaluation of various regression models, three stand out as the most promising for
predicting the target variable. The Decision Tree model exhibited remarkable performance
with an MSE (Mean Squared Error) of 0.088575, an MAE (Mean Absolute Error) of 0.104625,
an RMSE (Root Mean Squared Error) of 0.297615, and an R2 (Coefficient of Determination)
of 0.876170. Similarly, the ExtraTree model closely followed with an MSE of 0.088601, an
MAE of 0.102909, an RMSE of 0.297659, and an R2 of 0.876134. The Voting model with
an MSE of 0.105035, an MAE of 0.222807, an RMSE of 0.324091, and an R2 of 0.853159.
The Decision Tree or ExtraTree model is considered the best choice based on the provided
metrics, as they seem to have lower errors and higher R2 values compared to other models.

48
CHAPTER 7
FUTURE RECOMMENDATIONS

Based on the findings and the analysis conducted in this thesis, there are several recommen-
dations for future research and practical applications:

A. Enhanced Data Collection: Gathering more diverse and extensive data sets could
provide a more comprehensive understanding of the relationships between features and
building costs. It also helps to build accurate models.

B. Feature Engineering: Explore more advanced feature engineering selection techniques


that enhance the predictive power of the models.

C. External Validation: Validate the developed models using external data sets from
different geographic locations or periods to ensure the reliability of models.

D. Domain Expert Involvement: Engage domain experts, including architects, engineers,


and construction professionals, to refine the feature selection process and improve model
interpretability.

E. Real-Time Cost Prediction: Developing real-time cost prediction tools or software


applications based on other models to assist construction companies, contractors, and
developers in estimating project costs accurately also explores the applicability of research.

F. Long-Term Cost Analysis: Extend the analysis to include the assessment of long-term
cost implications and factors influencing ongoing maintenance and operational expenses
post-construction.

By addressing these recommendations, future research can contribute to the advancement of


accurate cost prediction models in the construction domain, thereby assisting stakeholders in
making informed decisions and improving overall project management efficiency.

49
REFERENCES

[1] Dipendra Gautam, Hugo Rodrigues, Krishna Kumar Bhetwal, Pramod Neupane, and
Yashusi Sanada. Common structural and construction deficiencies of nepalese buildings.
Innovative infrastructure solutions, 1:1–18, 2016.

[2] Hong-Gyu Cho, Kyong-Gon Kim, Jang-Young Kim, and Gwang-Hee Kim. A comparison
of construction cost estimation using multiple regression analysis and neural network
in elementary school project. Journal of the Korea Institute of Building Construction,
13(1):66–74, 2013.

[3] K Akalya, LK Rex, and D Kamalnataraj. Minimizing the cost of construction materials
through optimization techniques. IOSR Journal of Engineering, 2018.

[4] Seokheon Yun. Performance analysis of construction cost prediction using neural network
for multioutput regression. Applied Sciences, 12(19):9592, 2022.

[5] Shabniya Veliyampatt. Determination of efficacy of cost estimation models for building
projects using artificial neural network. International Research Journal of Engineering
and Technology (IRJET) ), 08(10), 2021.

[6] Abdelrahman Osman Elfaki, Saleh Alatawi, Eyad Abushandi, et al. Using intelligent
techniques in construction project cost estimation: 10-year survey. Advances in Civil
engineering, 2014, 2014.

[7] Viren B Chandanshive and Ajaykumar R Kambekar. Prediction of building construc-


tion project cost using support vector machine. Industrial Engineering and Strategic
Management, 1(1):31–42, 2021.

[8] Gwang-Hee Kim, Jae-Min Shin, Sangyong Kim, and Yoonseok Shin. Comparison of
school building construction costs estimation methods using regression analysis, neural
network, and support vector machine. 2013.

[9] Batta Mahesh. Machine learning algorithms-a review. International Journal of Science
and Research (IJSR).[Internet], 9(1):381–386, 2020.

50
[10] Sevgi Zeynep Dogan. Using machine learning techniques for early cost prediction of
structural systems of buildings. Izmir Institute of Technology (Turkey), 2005.

[11] JF Beltman. Predicting construction costs in the program phase of the construction
process: a machine learning approach. B.S. thesis, University of Twente, 2021.

[12] Sunil M Dissanayaka and Mohan M Kumaraswamy. Comparing contributors to time and
cost performance in building projects. Building and Environment, 34(1):31–42, 1998.

[13] Dr. N Seshadri sekhar. A course material on estimation, costing and valuation. 2020.

[14] Gwang-Hee Kim, Sung-Hoon An, and Kyung-In Kang. Comparison of construction
cost estimating models based on regression analysis, neural networks, and case-based
reasoning. Building and environment, 39(10):1235–1242, 2004.

[15] ALIREZA Shojaei and AMIRSAMAN Mahdavian. Revisiting systems and applications
of artificial neural networks in construction engineering and managements. Proceedings
of the International Structural Engineering and Construction, Chicago, IL, USA, pages
20–25, 2019.

[16] Mohamed Badawy. A hybrid approach for a cost estimate of residential buildings in
egypt at the early stage. Asian Journal of Civil Engineering, 21(5):763–774, 2020.

[17] Uyeol Park, Yunho Kang, Haneul Lee, and Seokheon Yun. A stacking heterogeneous
ensemble learning method for the prediction of building construction project costs.
Applied Sciences, 12(19):9729, 2022.

[18] Rifat Sonmez. Conceptual cost estimation of building projects with regression analysis
and neural networks. Canadian Journal of Civil Engineering, 31(4):677–683, 2004.

[19] TQD Pham, T Le-Hong, and XV Tran. Efficient estimation and optimization of building
costs using machine learning. International Journal of Construction Management,
23(5):909–921, 2023.

[20] Joost N Kok, Egbert J Boers, Walter A Kosters, Peter Van der Putten, and Mannes Poel.
Artificial intelligence: definition, trends, techniques, and cases. Artificial intelligence,
1:270–299, 2009.

51
[21] Erik Matel, Faridaddin Vahdatikhaki, Siavash Hosseinyalamdary, Thijs Evers, and Hans
Voordijk. An artificial neural network approach for cost estimation of engineering services.
International journal of construction management, 22(7):1274–1287, 2022.

[22] Hongyu Xu, Ruidong Chang, Min Pan, Huan Li, Shicheng Liu, Ronald J Webber, Jian
Zuo, and Na Dong. Application of artificial neural networks in construction management:
A scientometric review. Buildings, 12(7):952, 2022.

[23] Ajith Abraham. Artificial neural networks. Handbook of measuring system design, 2005.

[24] Omer Tatari and Murat Kucukvar. Cost premium prediction of certified green buildings:
A neural network approach. Building and Environment, 46(5):1081–1086, 2011.

[25] Alan O Sykes. An introduction to regression analysis. 1993.

[26] David J Lowe, Margaret W Emsley, and Anthony Harding. Predicting construction
cost using multiple regression techniques. Journal of construction engineering and
management, 132(7):750–758, 2006.

[27] S Patel. Svm (support vector machine)—theory-machine learning 101-medium. 2017.

[28] Bing Dong, Cheng Cao, and Siew Eang Lee. Applying support vector machines to predict
building energy consumption in tropical region. Energy and Buildings, 37(5):545–553,
2005.

[29] Wei-Yin Loh. Classification and regression trees. Wiley interdisciplinary reviews: data
mining and knowledge discovery, 1(1):14–23, 2011.

[30] Zhun Yu, Fariborz Haghighat, Benjamin CM Fung, and Hiroshi Yoshino. A decision tree
method for building energy demand modeling. Energy and Buildings, 42(10):1637–1646,
2010.

[31] Amal Asselman, Mohamed Khaldi, and Souhaib Aammou. Enhancing the prediction
of student performance based on the machine learning xgboost algorithm. Interactive
Learning Environments, pages 1–20, 2021.

[32] Mark R Segal. Machine learning benchmarks and random forest regression. 2004.

52
[33] Argaw Gurmu and Mani Pourdadash Miri. Machine learning regression for estimating
the cost range of building projects. Construction Innovation, 2023.

[34] Aakanksha Sharaff and Harshil Gupta. Extra-tree classifier with metaheuristics approach
for email classification. In Advances in Computer Communication and Computational
Sciences: Proceedings of IC4S 2018, pages 189–197. Springer, 2019.

[35] Sokratis Papadopoulos, Elie Azar, Wei-Lee Woon, and Constantine E Kontokosta.
Evaluation of tree-based ensemble learning algorithms for building energy performance
estimation. Journal of Building Performance Simulation, 11(3):322–332, 2018.

[36] Olga Kurasova, Virginijus Marcinkevičius, Viktor Medvedev, and Birutė Mockevičienė.
Early cost estimation in customized furniture manufacturing using machine learning.
International journal of machine learning and computing. Singapore: IJMLC, 2021, vol.
11, no. 1., 2021.

[37] Pyae-Pyae Phyo, Yung-Cheol Byun, and Namje Park. Short-term energy forecasting
using machine-learning-based ensemble voting regression. Symmetry, 14(1):160, 2022.

[38] Marzieh Khosravi, Sadman Bin Arif, Ali Ghaseminejad, Hamed Tohidi, and Hanieh
Shabanian. Performance evaluation of machine learning regressors for estimating real
estate house prices. 2022.

[39] Bohdan Pavlyshenko. Using stacking approaches for machine learning models. In 2018
IEEE Second International Conference on Data Stream Mining & Processing (DSMP),
pages 255–258. IEEE, 2018.

[40] CN Atapattu, ND Domingo, and M Sutrisna. Statistical cost modelling for preliminary
stage cost estimation of infrastructure projects. In IOP Conference Series: Earth and
Environmental Science, volume 1101, page 052031. IOP Publishing, 2022.

[41] Yasamin Ghadbhan Abed, Taha Mohammed Hasan, and Raquim Nihad Zehawi. Machine
learning algorithms for constructions cost prediction: A systematic review. International
Journal of Nonlinear Analysis and Applications, 13(2):2205–2218, 2022.

53
APPENDIX A

Expert opinion
Pilot Testing

Figure 7.1: Expert Opinion

Figure 7.2: Expert Opinion

54
Figure 7.3: Expert Opinion

Figure 7.4: Expert Opinion

55
Figure 7.5: Expert Opinion

Figure 7.6: Pilot Survey

56
Figure 7.7: Pilot Survey

Figure 7.8: Pilot Survey

57
Figure 7.9: Letter for Data Collection

58
Figure 7.10: Similarity Index

Figure 7.11: Paper Acceptance Mail

59

You might also like