0% found this document useful (0 votes)
302 views15 pages

12 Machine Learning Model To Predict Construction Duration

This document discusses the development of a machine learning model to predict the construction duration of tall building projects. The researchers developed twelve machine learning models using algorithms like multi-linear regression, k-nearest neighbors, artificial neural networks, support vector machines, and ensemble methods. The best performing model was an ensemble method that combined the outputs of the machine learning algorithms using an artificial neural network, achieving a correlation coefficient of 0.69, root mean squared error of 301.72, and mean absolute percentage error of 18%. The study aims to enhance time performance in tall building projects through the adoption of machine learning technologies.

Uploaded by

Chioma Uche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
302 views15 pages

12 Machine Learning Model To Predict Construction Duration

This document discusses the development of a machine learning model to predict the construction duration of tall building projects. The researchers developed twelve machine learning models using algorithms like multi-linear regression, k-nearest neighbors, artificial neural networks, support vector machines, and ensemble methods. The best performing model was an ensemble method that combined the outputs of the machine learning algorithms using an artificial neural network, achieving a correlation coefficient of 0.69, root mean squared error of 301.72, and mean absolute percentage error of 18%. The study aims to enhance time performance in tall building projects through the adoption of machine learning technologies.

Uploaded by

Chioma Uche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Journal of Construction Engineering, Management & Innovation

2021 Volume 4 Issue 1 Pages 022-036


https://doi.org/10.31462/jcemi.2021.01022036
www.goldenlightpublish.com

RESEARCH ARTICLE

Developing a machine learning model to predict the construction


duration of tall building projects

Muizz O. Sanni-Anibire *1 , Rosli Mohamad Zin 2 , Sunday Olusanya Olatunji 3


1
King Fahd University of Petroleum and Minerals, Dammam Community College, Dhahran, Kingdom of
Saudi Arabia
2
Universiti Teknologi Malaysia (UTM), School of Civil Engineering, Faculty of Engineering, Johor,
Malaysia
3
Imam Abdulrahman Bin Faisal University, College of Computer Science and Information Technology,
Department of Computer Science, Dammam, Kingdom of Saudi Arabia

Abstract
The construction industry is witnessing a rapid rise in tall building projects due to an anticipated urban
population explosion. However, this building typology has been subject to time overruns and total
abandonment due to an underestimation of the project duration. Consequently, this paper presents the
development of a model to predict the construction duration of tall building projects. In developing the
model, a suite of machine learning algorithms was adopted including Multi-Linear Regression Analysis
(MLRA), k-Nearest Neighbors (KNN), Artificial Neural Networks (ANN), Support Vector Machines
(SVM), and Ensemble Methods. Thus, twelve models were developed in the process, and the most efficient
model was selected. The procedure described in this study presents researchers and practitioners with a
strategy to enhance the time performance of tall building projects through the adoption of modern digital
technologies such as machine learning. The proposed model was based on an ensemble method using ANN
as the combiner, with a Correlation Coefficient (R2) of 0.69, Root Mean Squared Error (RMSE) of 301.72,
and Mean Absolute Percentage Error (MAPE) of 18%.
Keywords
Duration prediction; Regression; k nearest neighbour; Neural networks; Support vector machines; Ensemble
methods
Received: 12 January 2021; Accepted: 22 March 2021
ISSN: 2630-5771 (online) © 2021 Golden Light Publishing All rights reserved.

1. Introduction opinion that large variances in the estimated and


The 21st century is witnessing a rising complexity actual duration of construction projects due to
in buildings, embodied in the rapid growth of tall underestimation is one of the prevalent problems in
buildings in urban centers globally. These projects the industry [3]. Bromilow [4] suggested that only
are however characterized by uncertainties that one-eighth of building contracts were completed
affect the success of the project, usually expressed within the scheduled completion dates and that the
in cost, time, and quality [1,2]. Experts are of the average time overrun exceeded 40%. Likewise,

*
Corresponding author
Email: [email protected]
23 Sanni-Anibire et al.

Alzara et al. [5] reported delays in the range of 50% Percentage Error (MAPE). The outcome of the
to 150 %. systematic model development process described in
Particularly, tall building projects are notorious this study is the proposed ML model for the
for their delayed completion times. Interestingly, duration prediction of tall building projects. The
the Council on Tall Buildings and Urban Habitat, model can be described as an ensemble method that
CTBUH [6] in its report “Dream Deferred: combines the outputs of ML algorithms considered
Unfinished Tall Buildings” noted the alarming rate in this study using ANN as the combiner.
of increase of “never completed” tall buildings.
Previous researchers have suggested that a reliable 2. Literature review
prediction of the duration of construction projects is The following sections present an overview of
crucial to avoiding construction delays [7,8,3,9]. relevant background on construction duration
Traditional methods such as the Critical Path estimation. Firstly, traditional approaches, as well
Method (CPM) or Program Evaluation Research as modern trends in construction duration
Task (PERT) have been shown to consistently estimation, were reviewed. Subsequently, previous
underestimate the actual project duration [10]. studies related to the development of mathematical
Typical considerations may include the client’s models as well as the application of artificial
time constraints, budget, or conducting a detailed intelligence and machine learning techniques have
analysis subject to skill, experience, and individual been presented.
intuition of the project engineer. Therefore, there is
a high level of subjectivity in the process which 2.1. Approaches to construction duration
ultimately yields high levels of uncertainty [11]. estimation
In this regard, some research works have sought
The duration of an activity is simply the length of
to apply Artificial Intelligence (AI) and Machine
time or period it takes to complete that activity. This
Learning (ML) to the duration prediction of
is typically measured in hours, days, weeks,
construction projects [11-22]. These studies are
months, or years. Determining task durations
however limited in the techniques used, as they
utilizing detailed analysis is dependent on the
have focused on one or two algorithms, without
required human and material resources, as well as
exploring ensemble methods to achieve improved
the productivity rates of these resources.
performance. Moreover, despite the rapid growth of
Traditionally, there are two modeling techniques
tall building construction, and the recurring time
used in construction project scheduling which are:
overruns of such projects, there is a dearth of
the Critical Path Method (CPM) and Program
research on the subject of its duration estimation.
Evaluation and Review Technique (PERT). The
In light of the foregoing, research to develop a
CPM schedule assumes the duration of work items
model for the estimation of the duration of tall
is known with some level of certainty. On the other
building projects based on ML has been
hand, PERT considers the uncertainty in
conceptualized. Historical data on the construction
determining the duration of work items. Hence,
duration of tall building projects has been obtained.
PERT is based on a "three-time estimate" i.e.
The dataset was further used to develop duration
optimistic estimate, most likely estimate, and the
prediction models based on popular machine
pessimistic estimate. The average of the "three-time
learning algorithms such as Multi Linear
estimate" is adopted as the duration [23].
Regression (MLRA), k-Nearest Neighbors (KNN),
Regardless of the methods applied, the calculated
Support Vector Machines (SVM), Artificial Neural
values remain approximate, and are characteristic
Network (ANN), and Ensemble Methods. The
of high levels of uncertainty. The estimator's
performance of these models was evaluated based
background and experience are highly correlated to
on the Correlation Coefficient (R2), Root Mean
the accuracy of the estimation. Lack of adequate
Squared Error (RMSE), and Mean Absolute
experience and thorough understanding of the
Developing a machine learning model to predict the construction duration of tall building projects 24

projects' scope of work will lead to poor construction projects continue to suffer from
estimations. Additionally, there exists the problem performance and productivity issues [28]. The
of material and labor price variations/fluctuations following sections provide a non-exhaustive review
and inflation which are characteristic of of literature related to the application of artificial
construction projects. To solve the problem of intelligence and machine learning techniques to
uncertainty which may be due to insufficient duration estimation.
information, variations, and human error,
researchers have sought to employ more intelligent 2.2.1. Knowledge-based expert system
methods. Though research interests in duration Knowledge-based expert systems are computer
estimation can be traced back to the 1960s, the past programs originally developed in the field of
few decades have witnessed a resurgence [24,25]. Artificial Intelligence (AI) and designed to reach
The investigated approaches can be summarily the level of performance of a human expert in some
classified into three including Artificial Intelligence specialized problem-solving domain. Hendrickson
(AI)-based scheduling which includes Knowledge- et al. [29] presented a framework for modifying
Based Scheduling, Expert systems and Case-Based standard work productivities for activity duration
Reasoning (CBR), Genetic Algorithms and Neural estimation. The study proposed an expert system
Networks; simulation-based scheduling; and “MASON”. Moselhi and Nicholas [30] also
integrated BIM-based scheduling [24]. presented ESCHEDULER, a prototype system for
precedence setting and modifying durations. Also,
2.2. Previous studies on the development of Shaked and Warszawski [31] developed
models for duration prediction HISCHED, which is a knowledge-based expert
Bromilow [4] is accredited with developing the first system for the construction planning of buildings.
empirical model that establishes the relationship
2.2.2. Linear regression analysis
between cost and time. Bromilow’s Time-Cost
(BTC) is based on historical data from 309 building Hoffman et al. [32] identified the factors
projects completed in Australia between July 1964 influencing construction duration through an
and July 1967. The BTC model has been the subject assessment of 856 facility projects. The study
of many other studies to re-calibrate and test the compared the results of a multiple linear regression
performance of the model in other locations and model with the BTC model and concluded that
various project types. Further developments to multiple linear regression provided a more accurate
Bromilow's model were made by Chan and prediction. Yeom et al. [21] presented a multiple
Kumaraswamy [26] to combine the cost and floor linear regression model that facilitates accurate
area in a similar model. Other researchers studied prediction (94.72%) of construction durations of
the model further and included other variables in general office buildings in Korea. Blyth et al. [15]
the equation. This could be seen as the foundation presented a multiple linear regression analysis
for later studies in developing mathematical models which showed that twenty-one most influential
with the aid of the multi-linear regression method. project variables could accurately predict
Interestingly, the late 80s witnessed the acceptance construction duration for buildings in the UK. The
of more intelligent methods such as AI to solving developed model was further validated with a new
the inherent construction problem of estimating set of data which showed that the absolute
durations. Mohan [27] outlined 37 expert-system percentage error for the overall duration varied
applications in the field of construction engineering between 0.38% and 6.68%. Lin et al. [33]
and management. After five decades of Bromilow's developed a regression model for predicting the
initial model, a lot of technological advancements construction duration of steel-reinforced concrete
can be witnessed in construction project building projects in Taiwan. Khosrowshahi and
management, planning, and scheduling. However, Kaka [12] proposed two models for cost and
25 Sanni-Anibire et al.

duration with an adjusted coefficient of analysis, artificial neural networks, genetic


determination of 81.4% and 92.7% respectively. algorithm, and Monte-Carlo simulation.
Chan and Kumaraswamy [13] developed
models to estimate the duration of various work 2.2.5. Discussion on the previous studies
packages based on data obtained from 15 case The extant literature reveals that the construction
studies of residential buildings in Hong Kong and industry has evolved from its early adoption of
showed that the percentage error was about ±10% Bromilow’s Time-Cost model and its variants to
for overall construction durations. Abu Hammad et more robust methods. It is observed, however, that
al. [8] utilized data from 140 projects in Jordan to there is a dearth of literature on the application of
develop regression models and concluded that there other machine learning algorithms. While linear
is a 95% probability that the proposed models could regression and neural networks have dominated the
accurately predict project cost and duration with a discourse, not so much focus has been made to
precision of ±0.035% of the mean cost and time. study the performance of algorithms such as k
Nearest Neighbours, Support Vector Machines, as
2.2.3. Neural networks (NNs) well as ensemble techniques. This may be partly
Mensah et al. [18] utilized the historical data of 30 attributed to barriers such as the huge quantity and
completed bridge projects in Ghana to develop a confidentiality of data required. Data needed for
model for the prediction of construction duration. machine learning application will need to be
The study compared the stepwise regression systematically documented by potential
method and artificial neural network (ANN), with stakeholders. Another observation is that tall
the regression model having a MAPE of 25%, and buildings have not received noteworthy attention in
ANN model with a MAPE of 26%. Attal [16] also terms of machine learning applications to solve the
compared the performance of regression analysis problem of time-overruns, despite the significance
and ANN for predicting the duration of highway of such projects in the urban habitat of the 21st
projects, with ANN having higher accuracy and century. Though the study by Li et al. [11] focused
reliability. Peško et al. [20] carried out a study on the application of Case-Based Reasoning and k
which combined two popular artificial intelligence Nearest Neighbours to skyscrapers in China, it is
technique i.e. ANN and SVM for the estimation of also limited in its approach, while further
costs and duration in construction projects. Both investigation has the potential to provide better
techniques displayed approximately equal results. Therefore, the current study seeks to
performance, with the MAPE for SVM of 22.77% investigate the performance of selected machine
and ANN 26.26%. learning algorithms in developing duration
prediction models, as well as investigate the
2.2.4. Case-based reasoning (CBR) performance of ensemble models to seek an
Jin et al. [19] developed a CBR model for improved performance of the final prediction
estimating construction duration based on 83 multi- model.
housing projects. The results based on the MAPE of
5.74%-9.88% suggested the reliability of the 3. Methodology
model. Li et al. [11] established a revised CBR
model to estimate the duration of skyscrapers in 3.1. Dataset establishment
China. The results showed an accuracy of 69%. The primary source of the dataset used in this study
Koo et al. [17] utilized 101 completed multi-family is the Mega Project Case Study Center of China at
housing projects to develop a CBR hybrid model http://www.mpcsc.org/case_search.htm. The data
with which to predict the construction duration and set was further corroborated with information from
cost of a project in its early stage. The hybrid model CTBUH's skyscraper center available at
features case-based reasoning, multiple regression http://www.skyscrapercenter.com/country/china. A
Developing a machine learning model to predict the construction duration of tall building projects 26

sample size of 35 projects was identified with various views of the dataset will provide a general
construction completion dates between 1993 and idea of the views that are better for the machine-
2015. Remarkably, all the projects contained in the learning problem [36]. Generally, the best
dataset are from China, which according to CTBUH algorithm to be used to solve an ML problem is
[34] accounts for 61.5% of 200-meter-tall buildings usually not known beforehand. Experts have
in the world in 2018, and has maintained its role as suggested that common ML algorithms should
the most prolific country in tall building firstly be explored, especially those common in the
construction for over two decades. field of the ML problem at hand [36,37]. In this
study, four ML algorithms have been considered
3.2. Data pre-processing including Multi Linear Regression Analysis
Since the dataset has been obtained from the real (MLRA), Artificial Neural Networks (ANN), K-
world, it may exhibit characteristics not ideal for Nearest Neighbors (KNN), and Support Vector
ML modeling and thus require pre-processing and Machines (SVM). The dataset was split into a train-
re-shaping. In this study, the Waikato Environment test ratio of 66% to 34%. Additionally, five various
for Knowledge Analysis (Weka 3.8.3) has been views were considered as follows:
used. This is an open-source machine learning  Raw dataset: original dataset as described in
software written in Java and developed at the Table 1.
University of Waikato, New Zealand [35]. Table 1  Normalized view (input features only):
provides descriptive statistics of the numerical rescaling values in the input dataset to a range
features of the dataset, while Table 2 describes the of 0 and 1, such that the largest value for each
non-numerical features of the dataset and their feature is 1 and the lowest is 0. Normalization is
conversion to dummy variables. a good technique to use when the distribution of
the data is either unknown, or is Gaussian (i.e.
3.3. Views of the dataset bell curve). The formula for normalization is
expressed in Eq. 1:
These are copies of the dataset in addition to the
original dataset. They are created based on some 𝑥𝑥 − 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚
𝑥𝑥𝑛𝑛𝑛𝑛𝑛𝑛 = (1)
system such as normalization and standardization. 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚
Evaluating the performance of algorithms on

Table 1. Descriptive statistics of the dataset


Standard Missing
Mean deviation Maximum Minimum values
GDP (bill USD) 302.91 108.22 446.31 80.77 0
# of elevators 50.66 31.31 130 6 6
Building area (m2) 289364.06 152174.12 602401 91600 1
Floor area (m2) 30569.68 45035.52 197000 4126 6
Height to tip (m2) 386.18 113.64 636 237.5 0
# of floors above GF 76.97 23.22 128 37 0
Height of occupied floors (m) 339.30 115.88 610 213.9 1
# of total floors 80.91 23.49 133 39 0
# of basement floors 6.31 5.204 30 2 6
# of parking spaces 1058.32 619.86 2702 128 7
Cost (bill Yuan) 8.34 8.53 30 0.38 5
Duration (days) 1783 682.16 4555 730 0
27 Sanni-Anibire et al.

Table 2. Description of the non-numeric features of the dataset


Conversion to dummy
Features Description variables Missing values
O/Office, BOH/Business, office, hotel,
ROH/Residential, office, hotel, O = 1; BOH = 2; ROH = 3;
Facility type 0
BO/Business, office, BOR/Business, BO = 4; BOR = 5
office, residential
T-T/Tube in Tube, D/Diagrid, C- T-T = 1; D = 2; C-T = 3; T
Structural form 9
T/Core-Tube, T/Tubular =4
RC/ Reinforced concrete,
RC = 1; RCS = 2; S = 3; C
Structural material RCS/Reinforced concrete and steel, 0
=4
S/Steel C/Composite
Summer = 1; Autumn = 2;
Commencement period Summer, autumn, winter and spring 0
Winter = 3; Spring = 4

Normalized view (entire dataset): input and determined and subsequently ranked. Furthermore,
output values are converted to a range of 0 to 1. attribute selection was based on the Recursive
In this case, a further step was required to de- Feature Elimination (RFE) procedure [38]. In RFE,
normalize the output data. the entire feature set (V) ranked according to the
 Standardized view (input features): input values correlation coefficient is split in half to derive the
are rescaled such that the means are set at 0, and best V/2 features, and the worst V/2 features are
the standard deviation is 1. This technique is eliminated. The splitting process continues
more useful if the dataset has a Gaussian (bell recursively until only one best feature is left.
curve) distribution [36]. The process is executed Thereafter, the feature subset that achieved the best
according to equation 2 (where µ represents the accuracy/or the best performance measure is finally
arithmetic mean and σ the standard deviation): chosen as the best subset to be used.
𝑥𝑥 − µ
𝑥𝑥𝑛𝑛𝑛𝑛𝑛𝑛 = (2) 3.5. Hyperparameter optimization
𝜎𝜎
 Replace missing values: datasets for machine The performance of ML algorithms is dependent on
learning usually contain missing values that the tuning of optimal hyperparameters. It involves
need to be treated by removing or replacing the searching for the hyperparameters that result in the
missing values [36]. The best performance of an algorithm given a set of
ReplaceMissingValues filter in Weka was used data. The ML algorithms used in this study are
to create this view where the missing values are described in the Weka environment as follows:
set equal to the mean of the distribution for MLRA: “LinearRegression”, k-NN: “IBk”, ANN:
numerical features, and the mode for categorical “Multilayer Perceptron”, and SVM: “SMOReg”. In
features respectively. determining the hyperparameters that yield optimal
model performance, Weka was used to execute a
3.4. Feature selection modified systematic search i.e. a range of randomly
spaced values are searched first, and then the range
The best view of the dataset determined from the
that performs best is zoomed in for further
previous section was selected for further
investigation. The optimal hyperparameter for
processing. The “CorrelationAttributeEval”
KNN is the k value (search range 1 - 30), as well as
technique was used to determine the most relevant
the search and distance function, while ANN
attributes contributing to the predictive
depends on the learning rate (search range 0.1 –
performance. The correlation of various features in
0.3), hidden layers (search range 1 – 4) and number
the dataset to the prediction output is firstly
Developing a machine learning model to predict the construction duration of tall building projects 28

of nodes (search range 1 – 4). SVM optimization procedure followed in this study is further
depends on the regularization factor C (search range summarized and illustrated in Fig. 1.
1 – 1000), the type of kernel function, as well as
epsilon parameter (search range 0.1 – 0.00001). 4. Results and findings

3.6. Performance measurement 4.1. Comparison of various views of the dataset


In measuring the performance of the algorithms In this study, five views of the dataset were
employed, the Correlation Coefficient (R2), Root prepared for comparison. The datasets have been
Mean Squared Error (RMSE), and Mean Absolute evaluated with the four selected ML algorithms
Percentage Error (MAPE) have been employed. (MLRA, ANN, KNN & SVM), and the results
The mathematical expressions for these measures showed that all algorithms performed best when the
are presented in Eqs. 3-5 as follows: entire view of the dataset was normalized i.e.
including the output feature (Table 3). While the
∑(𝑦𝑦𝑎𝑎 − 𝑦𝑦 ′ 𝑎𝑎 )(𝑦𝑦𝑝𝑝 − 𝑦𝑦′𝑝𝑝 )
𝑅𝑅2 = comparative performance (based on RMSE values)
(3) of the various views of the dataset were
�∑(𝑦𝑦𝑎𝑎 − 𝑦𝑦′𝑎𝑎 )2 ∑(𝑦𝑦𝑝𝑝 − 𝑦𝑦′𝑝𝑝 )2
insignificantly different, the normalized view of the
where 𝑦𝑦 𝑎𝑎 and 𝑦𝑦 𝑝𝑝 are the actual and predicted entire dataset displayed exemplary performance. A
values while 𝑦𝑦 ′𝑎𝑎 and 𝑦𝑦′ 𝑝𝑝 are the mean of the actual decrease in error of at least 51% for MLRA, 46%
and predicted values. for ANN, 43% for KNN, and 55% for SVM was
observed when comparing the normalized view of
(𝑦𝑦𝑎𝑎 − 𝑦𝑦𝑝𝑝 )2 + (𝑦𝑦𝑎𝑎 − 𝑦𝑦𝑝𝑝 )2 + ⋯ + (𝑦𝑦𝑎𝑎 − 𝑦𝑦𝑝𝑝 )2
(4) the entire dataset with the worst-performing view.
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = �
𝑛𝑛

where (𝑦𝑦 𝑎𝑎 - 𝑦𝑦 𝑝𝑝 ) is the difference between the 4.2. Performance of machine learning algorithms
actual and predicted values and 𝑛𝑛 is the size of the To develop the initial models for which the
dataset used. performance of ML algorithms will be evaluated,
1
𝑛𝑛
𝑦𝑦 𝑎𝑎 − 𝑦𝑦 𝑝𝑝 the best combination of features that yields
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = �� � (5) optimum performance was determined. This was
𝑛𝑛 𝑦𝑦 𝑝𝑝
𝑡𝑡=1 achieved using the “CorrelationAttributeEval” and
where 𝑦𝑦 𝑎𝑎 is the actual value and 𝑦𝑦 𝑝𝑝 is the predicted RFE techniques discussed previously in the
value, and 𝑛𝑛 is the size of the dataset used. “feature selection” section. Thus, in addition to a
dataset containing all features, four more feature
3.7. Combining algorithms sets were developed as described in Table 4. It can
be observed from the Table, that the most important
To improve the performance of the techniques used,
feature influencing the duration of tall building
an ensemble method was used. This is an approach
projects is the number of total floors followed by
that combines the prediction outcomes of a set of
the number of floors above the ground floors. As
algorithms with the same or different sets of
shown in Table 5, the performance of the four
features. This can be achieved through averaging
algorithms varied with different sets of features.
(fixed rules) and stacking (trained rules) [39,40].
MLRA exhibited the best performance with the best
Averaging is a simple aggregation of the
two features and best feature respectively i.e. “# of
predictions of other models based on a fixed rule
total floors” and “# of floors above ground floors”.
such as the mean, maximum and minimum values.
It is also observed that both features (best V/8 and
Stacking is an extension of averaging which allows
best feature) exhibit similar performance for all
another algorithm to learn how best to combine the
algorithms except KNN. This suggests that both
predictions of other models [36]. The systematic
features are collinear, and one of them could
satisfactorily replace the other.
29 Sanni-Anibire et al.

Fig. 1. Methodology for developing the proposed ML duration prediction model

Table 3. Performance (RMSE) of ML algorithms for various views of the dataset


Replace Normalized view Standardized
Raw data missing (input features Normalized view view (input
ML Algorithm set values only) (entire dataset) features)
MLRA
1365.52 1332.52 1724.82 652.47 1365.52
(LinearRegression)
ANN (Multilayer
1652.56 1495.74 1652.57 800.47 1652.56
Perceptron)
KNN (IBk) 1067.58 1507.96 1067.58 611.67 1067.58
SVM (SMOReg) 1231.49 1195.16 1229.36 540.39 1228.87
Developing a machine learning model to predict the construction duration of tall building projects 30

Table 4. Description of selected feature subsets based on CorrelationAttributeEval


No. of
RFE process features Description
# of total floors; # of floors above ground floors; # of parking spaces;
Best V/2 features 8
cost; building area; height to tip; floor area; # of elevators
Best V/4 features 4 # of total floors; # of floors above ground floors; # of parking spaces; cost
Best V/8 features 2 # of total floors; # of floors above ground floors
Best feature 1 # of total floors

Table 5. Performance (RMSE) of ML algorithms for various feature subsets


ML Algorithm All features Best V/2 Best V/4 Best V/8 Best feature

MLRA (LinearRegression) 652.47 509.67 1154.69 538.32 538.32


ANN (Multilayer Perceptron) 800.47 371.78 208.61 248.79 248.62
KNN (IBk) 611.67 225.89 957.16 392.76 369.39

SVM (SMOReg) 540.39 293.14 261.19 299.69 299.69

ANN performed best with the best V/4 features Though, MOD3 had a better correlation coefficient
(described in Table 4), with an RMSE of 208.61, a (R2) compared to MOD4, a cross plot of the actual
74% decrease in error compared to the case where versus predicted values presented in Fig. 2 shows
all features were used (RMSE of 800.47), as shown that the predicted values for MOD3 was simply a
in Table 5. KNN performed best with the best V/2 general average and thus did not reflect a realistic
features (RMSE of 225.89), while SVM performed prediction which is evident in the high inaccuracies
best with the best V/4 features (RMSE of 261.19). from the RMSE and MAPE values.
The results presented in Table 5 formed the basis
for the development of the initial models for further 4.3. Performance of ensemble methods
investigation through hyperparameter tuning and To seek further improvement in the predictive
optimization. performance, an ensemble method was adopted.
Based on the performance of various The three best performing models (MOD1, MOD2
combinations of ML algorithms and feature sets, & MOD4) were combined using fixed and trained
five initial models have been selected. The rules also referred to as averaging and stacking
highlighted figures in Table 5 indicate the preferred respectively. Thus, seven more models were
configuration for the models. To further optimize created for further investigation as shown in Table
the performance of ML algorithms, tuning their 8.
hyperparameters becomes necessary. Table 6
presents the developed models showing the 4.3.1. Averaging (fixed rules)
combination of ML algorithms, feature sets, and To combine the selected best-performing models
optimal hyperparameters. The performance of the through a fixed rule system, the mean, maximum
various models is presented in Table 7. To and minimum values of the models’ predicted
determine the models with the best performance, outputs were considered. This formed MOD6,
the models with the lowest RMSE and MAPE MOD7 & MOD8. The performance results are
values, as well as the highest R2 values are presented in Table 9. It can be observed that MOD8
considered first. It can be observed that MOD1, performed best with the least RMSE and MAPE
MOD2, and MOD4 performed best when the values, as well as the highest R2 value.
RMSE, MAPE, and R2 results are compared.
31 Sanni-Anibire et al.

Table 6. ML models (combinations of algorithms, optimized hyperparameters and selected feature sets)
Model ML Algorithm Selected features Optimization hyperparameters
ANN (Multilayer # of total floors; # of floors above 0.3 learning rate; one hidden layer
MOD1
Perceptron) ground floors; # of parking spaces; cost with four nodes
# of total floors; # of floors above Nearest Neighbor: LineraNN
ground floors; # of parking spaces; cost; Distance function: Manhattan
MOD2 KNN (IBk)
building area; height to tip; floor area; # Distance
of elevators K: 1
Nearest Neighbor: LineraNN
Distance function: Manhattan
MOD3 KNN (IBk) # of total floors
Distance
K: 18
Kernel: Polykernel
# of total floors; # of floors above Cost function, C: 1
MOD4 SVM (SMOReg)
ground floors; # of parking spaces; cost Epsilon: 1E -12
Epsilon parameter: 1E -3
Kernel: Pearson VII function based
universal kernel (PUK)
MOD5 SVM (SMOReg) # of total floors Cost function, C: 100
Epsilon: 1E -12
Epsilon parameter: 1E -4

Table 7. Performance of initial ML models


Performance measure MOD1 MOD2 MOD3 MOD4 MOD5
RMSE 356.26 380.79 477.91 446.06 481.17
R2 0.53 0.64 0.54 0.49 0.47
MAPE 0.22 0.22 0.31 0.27 0.32

Fig. 2. Cross-plots of the actual vs. predicted duration values for selected models
Developing a machine learning model to predict the construction duration of tall building projects 32

Table 8. ML models based on Ensemble Methods


Model Ensemble Method (Input models: MOD1; MOD2; MOD4)
MOD6 Mean
MOD7 Maximum
MOD8 Minimum
MOD9 MLRA
MOD10 ANN (0.3 learning rate; one hidden layer with four nodes)
MOD11 KNN (Nearest Neighbor: LineraNN, Distance function: Euclidean Distance, K: 2)
SVM (Kernel: Pearson VII function based universal kernel (PUK),Cost function, C: 5, Epsilon: 1E -
MOD12
12, Epsilon parameter: 1E -5)

Table 9. Performance of ML models based on Ensemble Method


Performance
MOD6 MOD7 MOD8 MOD9 MOD10 MOD11 MOD12
measure
RMSE 338.67 473.23 331.43 372.63 301.76 310.13 349.59
R2 0.64 0.59 0.67 0.64 0.69 0.71 0.69
MAPE 0.21 0.30 0.17 0.21 0.18 0.18 0.19

4.3.2. Stacking (trained rules) 5. Discussion


The three best performing models (MOD1, MOD2 Tall building construction is rapidly developing in
& MOD4) were also combined using the trained the urban context as a sustainable solution to an
rules, while the four algorithms considered in the impending housing crisis and urban population
study were employed as the combiner system. The explosion. The complexity involved in the design
details of the optimal hyperparameters for the and construction of many tall buildings has resulted
combiner system are presented in Table 8. Thus, in notorious time overruns, incompletion, and total
four more models were developed labeled as abandonment [6,9,38]. Time overruns in tall
MOD9, MOD10, MOD11 & MOD12. As shown in building projects could lead to dissatisfied
Table 9, the best performing model was considered stakeholders, litigation, project abandonment, and
to be MOD10 based on the low values of RMSE ultimately a failure in fulfilling its intended
and MAPE. It can be observed that the correlation purposes. Therefore, previous studies have
coefficient (R2) of some of the models, specifically suggested that the use of mathematical models, as
MOD8, MOD10, MOD11, and MOD12 were well as data mining/machine learning to predict
approximately the same. Consequently, the construction duration, is a viable mitigation
performance was decided based on the reduced strategy [41]. The studies that dominate the
error observed in the RMSE and MAPE values. It research arena are limited in the techniques
may also be inferred that to seek further employed, especially concerning tall building
improvement in predictive performance may projects.
require some other strategies such as the In light of the foregoing, this study developed a
establishment of a larger dataset, or seeking to machine learning model based on a systematic
investigate other machine learning algorithms, as investigation and further combination of various
well as automated approaches to hyperparameter machine learning algorithms. The study firstly
tuning. established a dataset and subsequently carried out
pre-processing of the data. The results showed that
the view of the dataset which enhanced the
33 Sanni-Anibire et al.

performance of machine learning algorithms was the predictive performance of the model built, due
the normalized view, where all features (input and to the limited amount of data available for training
output) were re-scaled to range between 0 and 1. and testing. Similarly, previous research works in
The study further explored various feature sets that duration prediction have relied on similar sizes of
contribute to the performance of the algorithm. This the dataset. It may thus be concluded that the
was to identify and select the features that construction industry is deficient in recording and
contribute to improving the predictive performance publishing data suitable for ML applications for
of the ML algorithm. Additionally, feature selection duration prediction. Furthermore, the study's
helps to control the “curse of dimensionality”, limitation in the dataset being sourced from china
which is a phenomenon characteristic of real data. may be relieved due to China being the major driver
In this study, the most relevant feature that of tall building construction globally [34]. Also, the
correlates with the output for prediction (i.e. dataset contains the GDP of the cities where the
duration) was the number of total floors. This is a building projects are located and thus may provide
logical outcome when considering the nature of the an opportunity for extension to other construction
study’s focus (i.e. tall buildings). Further to that was climates globally based on the GDP of a city. The
the algorithm evaluation process, where the significance of this study is reflected in its
hyperparameters of the selected algorithms were addressing a current trend in the construction
adjusted to determine the optimal values for the industry, which is the exponential rise of tall
initial duration estimation models (MOD1, MOD2, building projects in urban centers across the globe.
MOD3, MOD4, and MOD5). The initial duration These projects are known to be characterized by
estimation models were further combined through their delayed completion times.
ensemble techniques i.e. fixed and trained rules.
The final result from the overall process was the 6. Conclusion
selection of a model which was based on a This study demonstrated the significance of
combination of three initial models using ANN as leveraging the capabilities of machine learning for
the combiner. The best performing model in this enhanced time performance in the construction
study described as MOD10 had an R2 of 0.69, industry. Specifically, the literature reveals that
MAPE of 0.18, and RMSE of 301.76. there is a dearth of studies in the construction
The level of accuracy of MOD10 suggests that domain on the time performance of tall buildings.
it could be recommended as a decision support tool Tall building projects have become a dominant
in estimating the duration of tall building projects. building typology of the modern urban habitat.
A comparison of the performance of MOD10 with Furthermore, it is now widely considered an
the poorest performing model in this study (i.e. important area of construction engineering and
MOD5) revealed a gain in performance of 47% was management research. Many factors may contribute
achieved in the correlation coefficient (R2), 44% in to studying this specific building typology separate
MAPE, and 37% in RMSE. Likewise, MOD10 from other/horizontal building types. For instance,
arguably outperforms the CBR model developed by this study reveals that the most significant factor
Li et al. [11] for skyscrapers. The R2 value obtained influencing the construction duration is the total
in this study for MOD10 was 69%, while a CBR number of floors – an intrinsic attribute of tall
model developed by Li et al. [11] achieved 62%; buildings. Other potential factors may include the
which was only improved to 69% when some structural systems used. Furthermore, tall buildings
poorly predicted cases were deleted from the CBR have specific characteristics that may lead to their
model. Thus, the superiority of MOD10 is apparent. delayed completion times, such as the complexity
The limitations of this study may be reflected in the in design and construction, as well as the large
source and size of the dataset used. The dataset in number of professionals involved.
this study contains about 35 cases that may impact
Developing a machine learning model to predict the construction duration of tall building projects 34

Notably, tall buildings are also considered a Data availability statement


viable strategy towards sustainable urban
Some or all data, models, or code that support the
development. Poor time performance of such
findings of this study are available from the
projects will defeat their intended purpose of
corresponding author upon reasonable request.
providing adequate urban space for the inevitable
population. Therefore, accurate estimation of the Declaration of conflicting interests
duration of such projects based on historic data is of
potential value to the broader society. Thus, the The author(s) declared no potential conflicts of
contribution of this study to the global community interest with respect to the research, authorship,
is shown in facilitating timely delivery of tall and/or publication of this article.
building projects as a sustainable strategy to an
impending housing crisis. As regards the study’s References
contribution to the construction community, it [1] Hamta N, Ehsanifar M, Sarikhani J (2018)
enhances time performance through the adoption of Presenting a goal programming model in the time-
modern digital technology in a rarely researched cost-quality trade-off. International Journal of
domain as tall buildings. It is widely acknowledged Construction Management, 21(1): 1-11.
that time performance is a crucial measure of [2] Williamson M, Ganah A, John GA (2019) Barriers
to adopting modern methods of construction in the
project success in the construction industry.
UK. Journal of Construction Engineering, 2(1): 30-
This study achieved this contribution through a
39.
systematic approach in developing models through [3] Sanni-Anibire MO, Mohamad Zin R, Olatunji SO
the application of established machine learning (2020) Causes of delay in the global construction
algorithms as well as combinations of the same. The industry: a meta analytical review. International
study did not intend to develop new machine Journal of Construction Management,
learning algorithms. However, the study has doi:10.1080/15623599.2020.1716132.
methodically applied existing algorithms in [4] Bromilow FJ (1969) Contract time performance
developing predictive models for estimating the expectations and the reality. In Building forum,
1(3): 70-80.
duration of tall building projects. The model thus
[5] Alzara M, Kashiwagi J, Kashiwagi D, Al-Tassan A,
proposed in this study was based on an ensemble
(2016) Using PIPS to minimize causes of delay in
method using ANN as the combiner. Remarkably, Saudi Arabian construction projects: university
the model’s accuracy which is comparable to case study. Procedia Engineering, 145: 932-939.
similar studies suggests its suitable adoption as a [6] CTBUH (2014) Dreams deferred: unfinished tall
decision support tool. Convincingly, the application buildings. CTBUH Journal, 4: 46-47.
of machine learning has the potential to make the [7] Abdul-Rahman H, Berawi MA, Berawi AR,
process of duration estimation smarter and more Mohamed O, Othman M, Yahya IA, (2006) Delay
efficient. The model proposed in this study may be mitigation in the Malaysian construction industry.
Journal of construction engineering and
limited in its level of generalization, as is the case
management, 132(2): 125-133.
with data-driven models. However, the systematic
[8] Abu Hammad AAA, Ali SMA, Sweis GJ, Bashir A,
procedure described herein could be adapted to (2008) Prediction model for construction cost and
other datasets, while the dataset used in the current duration in Jordan. Jordan Journal of Civil
study could be expanded for enhanced performance Engineering, 2(3): 250-266.
and applicability. Forthcoming research will seek to [9] Sanni-Anibire MO, Zin RM, Olatunji SO (2020)
incorporate such predictive models into computing Causes of Delay in Tall Building Projects in GCC
tools used in construction project management, and Countries. Proceedings of the 8th International
also make comparative assessments with traditional Conference on Construction Engineering and
Project Management Dec. 7-8, 2020, Hong Kong
methods.
SAR
35 Sanni-Anibire et al.

[10] Ballesteros-Pérez P, Larsen GD, González-Cruz Journal of Civil Engineering and Management,
MC (2018) Do projects really end late? On the 24(3): 238-253.
shortcomings of the classical scheduling [22] Sağlam B, Bettemir ÖH (2018) Estimation of
techniques. JOTSE: Journal of Technology and duration of earthwork with backhoe excavator by
Science Education, 8(1): 17-33. Monte Carlo Simulation. Journal of Construction
[11] Li Y, Lu K, Lu Y (2016) Project schedule Engineering, Management & Innovation 1(2): 85-
forecasting for skyscrapers. Journal of 94.
Management in Engineering, 33(3): 05016023. [23] Hinze J. Construction Planning and Scheduling.
[12] Khosrowshahi F, Kaka AP (1996) Estimation of Pearson/Prentice Hall, 2011.
project total cost and duration for housing projects [24] Faghihi V, Nejat A, Reinschmidt KF, Kang JH
in the UK. Building and Environment, 31(4): 375- (2015) Automation in construction scheduling: a
383. review of the literature. The International Journal
[13] Chan DWM, Kumaraswamy MM (1999) of Advanced Manufacturing Technology, 81(9-12):
Forecasting construction durations for public 1845-1856.
housing projects: a Hong Kong perspective. [25] Liu H, Al-Hussein M, Lu M (2015) BIM-based
Building and environment, 34(5): 633–646. integrated approach for detailed construction
[14] Skitmore RM, Ng ST (2003) Forecast models for scheduling under resource constraints. Automation
actual construction time and cost. Building and in Construction, 53: 29-43.
environment, 38(8): 1075-1083. [26] Chan DW, Kumaraswamy MM (1995) A study of
[15] Blyth K, Lewis J, Kaka A (2004) Predicting project the factors affecting construction durations in Hong
and activity duration for buildings in the UK. Kong. Construction Management and Economics,
Journal of Construction Research, 5(02): 329-347. 13(4): 319-333.
[16] Attal A. Development of neural network models for [27] Mohan S (1990) Expert systems applications in
prediction of highway construction cost and project construction management and engineering. Journal
duration. Doctoral dissertation, Ohio University, of Construction Engineering and Management,
2010. 116(1): 87-99.
[17] Koo C, Hong T, Hyun C, Koo K (2010) A CBR- [28] Al-Kofahi ZG, Mahdavian A, Oloufa A (2020)
based hybrid model for predicting a construction System dynamics modeling approach to quantify
duration and cost based on project characteristics in change orders impact on labor productivity 1:
multi-family housing projects. Canadian Journal of principles and model development comparative
Civil Engineering, 37(5): 739-752. study. International Journal of Construction
[18] Mensah I, Nani G, Adjei-Kumi T (2016) Management,
Development of a Model for Estimating the doi:10.1080/15623599.2020.1711494.
Duration of. Bridge Construction Projects in [29] Hendrickson C, Zozaya-Gorostiza C, Rehak D,
Ghana, International Journal of Construction Baracco-Miller E, Lim P (1987) Expert system for
Engineering and Management, 5(2): 55-64. construction planning. Journal of Computing in
[19] Jin R, Han S, Hyun C, Cha Y (2016) Application of Civil Engineering, 1(4): 253-269.
case-based reasoning for estimating preliminary [30] Moselhi O, Nicholas MJ (1990) Hybrid expert
duration of building projects. Journal of system for construction planning and scheduling.
Construction Engineering and Management, Journal of Construction Engineering and
142(2): 04015082. Management, 116(2): 221-238.
[20] Peško I, Mučenski V, Šešlija M, Radović N, [31] Shaked O, Warszawski A (1995) Knowledge-based
Vujkov A, Bibić D, Krklješ M (2017) Estimation of system for construction planning of high-rise
costs and durations of construction of urban roads buildings. Journal of Construction Engineering and
using ann and svm. Complexity, Article ID: Management, 121(2): 172-182.
2450370. [32] Hoffman GJ, Thal Jr AE, Webb TS, Weir JD
[21] Yeom DJ, Seo HM, Kim YJ, Cho CS, Kim Y (2007) Estimating performance time for
(2018) Development of an approximate construction projects. Journal of Management in
construction duration prediction model during the Engineering, 23(4): 193-199.
project planning phase for general office buildings. [33] Lin MC, Tserng HP, Ho SP, Young DL (2011)
Developing a construction-duration model based
Developing a machine learning model to predict the construction duration of tall building projects 36

on a historical dataset for building project. Journal [38] Sanni-Anibire MO, Zin RM, Olatunji SO (2020)
of Civil Engineering and Management, 17(4): 529- Machine learning model for delay risk assessment
539. in tall building projects. International Journal of
[34] CTBUH (2018) CTBUH Year in Review: Tall Construction Management, doi:10.1080/15623599.
Trends of 2018. https://www.skyscrapercenter.com 2020.1768326
/year-in-review/2018 [39] Xia R, Zong C, Li S (2011) Ensemble of feature
[35] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: sets and classification algorithms for sentiment
Practical Machine Learning Tools and Techniques. classification. Information Sciences, 181(6): 1138-
Morgan Kaufmann, 2016. 1152.
[36] Brownlee J (2018) Machine learning mastery with [40] Kuncheva LI. Combining Pattern Classifiers:
Weka. https://machinelearningmastery.com/machi Methods and Algorithms. John Wiley & Sons,
ne-learning-mastery-weka/ 2014.
[37] Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, [41] Gunduz M, Nielsen Y, Ozdemir M (2015) Fuzzy
Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Assessment Model to Estimate the Probability of
Zhou ZH (2008) Top 10 algorithms in data mining. Delay in Turkish Construction Projects. Journal of
Knowledge and Information Systems, 14: 1-37. Management in Engineering, 31(4): 04014055.

You might also like