Pham 2021
Pham 2021
To cite this article: T. Q. D. Pham, T. Le-Hong & X. V. Tran (2021): Efficient estimation and
optimization of building costs using machine learning, International Journal of Construction
Management, DOI: 10.1080/15623599.2021.1943630
Article views: 29
ABSTRACT KEYWORDS
This study provides a fast and accurate Machine learning (ML) and optimization framework, which allows Building cost; maintenance
a quick estimate for building costs, hence improving operational efficiency and competitiveness of a con- cost; machine learning;
struction company. A dataset composed of 10,000 parametric building configurations, collected from regression analysis; neural
networks; operational
end-to-end real-world activities in our partner company, was used to train and validate the ML models to efficiency
perform multiple tasks. Among the 13 ML regression algorithms used, the Artificial Neural Network (ANN),
Gradient Boosting, and XGBoost models appear to be the most suitable to estimate the building costs
and the required resources with an accuracy of 99% within less than a second of the training time. The
ANN models are also developed to identify available options of the building features under a given
budget. The optimization problem under constraints is solved, helping clients determine the optimal
building costs according to their preferences. Besides, the optimized building costs obtained by this study
are 7% smaller than those of the actual data, hence to improve the company’s competitiveness. This
study showcases that ML models can be efficiently used in the construction sector to optimize the work-
flow for cost savings and provide some practical implications for data-driven management.
help to streamline operations, improve design and engineer- as few-storey buildings and there is constant and open inter-
ing cost savings as well as risk management efficiency. action between the house owner and builder to finalize the
4. Finally, in order to take into account the very diverse prefer- building features. The purpose of the building cost prediction is
ences of customers, an optimization tool under constraints to identify the end-to-end relationship between the building fea-
needs to be developed to estimate the building costs in dif- tures (i.e., number of floors) requested by the customer and the
ferent scenarios. building costs computed by the company to set the offered cost
estimation. Focussing on the four tasks described in Section
It should be noted that these four tasks can be done and con-
introduction, this section aims to review articles related to the
tinuously improved by the current practices. This paper aims to
building cost estimation using ML models.
demonstrate how a construction company can exploit the avail-
Note that the objective of this study is to predict the building
able data to perform the above-mentioned tasks with bet-
cost from the construction company’s perspectives rather than
ter efficiency.
the house price from the clients’ ones. However, as the building
Supervised machine learning (ML) models, based on self-
costs and the house price are somehow interlinked, both these
learning algorithms using a set of training data, are widely used in
perspectives are reviewed. In the housing sector, ML is used to
different technology sectors (Kim et al. 2004; 2004; 2013). As dis-
predict the house prices from the mortgage data based on house
cussed in detail in Section literature review, the application of ML
features, for example, locations, state, useful surface, garden …
for the building cost prediction using the available real-world
(Limsombunchai 2004; Park and Bae 2015; Gao et al. 2019;
engineering data from a construction company has not yet exten-
Madhuri et al. 2019), and mainly for the demand side once the
sive in the construction management and operation applications
houses are already built (Limsombunchai 2004; Gao et al. 2019).
(Kim et al. 2004; 2004; 2013). Also, it appears that a few papers
From the supply side of house building companies, advanced
that reported a systematic deployment of a dozen ML models to
data analytics has recently gained interest. Le et al. (2019) per-
perform all the above-mentioned four tasks to identify their suit-
formed an in-depth literature review of the civil integrated man-
ability and to provide some practical implications. More precisely,
agement data. They concluded that data sharing and integration
solving an inverse prediction model to determine the range of fea-
were critical for successfully implementing integrated civil infra-
ture options under a given budget and optimization problems structure management.
with constraints to save costs has not gained extensive interest. Ghosalkar and Dhage (2018) predicted the building cost using
These problems are investigated in this paper in order to provide multiple linear regression ML algorithms. The three most
fast and accurate ML-based models that can be used to predict important factors influencing the building cost were physical
and optimize the building costs under different scenarios. conditions, concepts, and location. Their research proved that
In this study, based on the actual data obtained from a long- the linear regression model predicted the building cost with a
established construction company, an ML-based estimation tool mean squared error of 0.3713, ensuring a good predictive model.
is developed to provide a fast and efficient estimation of the Besides, Bhagat et al. (2016) developed a purely statistical frame-
building and maintenance costs directly from the customer work to predict the efficient house pricing for real estate custom-
requests (end-to-end services). More precisely, the laborious and ers with respect to their budgets and priorities using a linear
time-consuming steps performed by the back-end offices are to regression algorithm. Based on a dataset composed of 286 build-
be integrated into an automatic tool that the front-end company ing configurations collected in the United Kingdom, Lowe et al.
can employ to quickly estimate the building costs to better dis- (2006) predicted the total client costs given the construction and
cuss with customers. The ML-based model is expected to link client costs with an R2 (coefficient of determination) value of
the client requests and the final cost estimation within a short 0.661 using multiple linear regression. Recently, Huang and
time, improving the client – builder relationship. Besides, the Hsieh (2020) proposed a simple linear regression framework to
repeated designs, as well as resource estimations, can be avoided, predict the labour cost using 19 completed Building Information
leading to significantly enhanced work efficiency. In addition, for Modelling (BIM) projects from a leading construction company
a set of feature constraints required by the customer preferences, in Taiwan. Their results indicated that the linear regression
the building cost needs to be minimized to increase the com- model was the most stable, compared to the Random Forest
pany’s profits. Thus, the tool developed in this study will help model, with increasing the number of projects.
the company save costs, increase revenues, and improve cus- Regarding the nonlinear ML methods, Chen et al. (2017)
tomer satisfaction. adopted a novel approach based on the Support Vector Machine
In this paper, Section literature review is dedicated to a litera- (Suykens and Vandewalle 1999) algorithm to predict residential
ture review. The data description, data processing techniques, house prices. Their results showed that the maximum accuracy
and ML models are presented in Section dataset and method- (R2 value) could reach 0.72. Phan (2018) developed a framework
ology. In Section results and discussions, the results obtained using a polynomial regression method to forecast house prices in
from ML models performing the four tasks mentioned previ- Australia with an accuracy (R2) of 0.84. Besides, the Artificial
ously, including the cost prediction for a given set of building Neural Network (ANN) was also used in the construction sector,
features, the identification of the possible building features for a for example, to predict engineering service costs (Matel et al.
given budget, the prediction of the required resources for a given 2019) or classify the delay risk assessment in tall building proj-
set of building features, and the optimization of the building ects (Sanni-Anibire et al. 2020). These studies suggested that the
costs with and without constraints on building features are dis- ANN model can present a high accuracy of more than 93%.
cussed in detail prior to conclusions in the last section. As for the boosting-based ML algorithms, Ceh et al. (2018)
estimated Random Forest’s performance versus multiple linear
regression’s one when predicting the apartment price of 7407
Literature review
data from 2008-2013. The authors concluded that the Random
It should be noted that we focus on medium size construction Forest outperformed the other regression method at all perform-
company that is working on many relatively small projects such ance measures. The Random Forest has also been used to predict
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 3
Figure 1. Statistical distributions of the variables X, Z, and Y from the actual dataset.
the building cost in many studies as in (Afonso et al. 2019; methods and optimization under constraints using real-world
Hong et al. 2020; Jui et al. 2020). Several frameworks related to data from the end-to-end activities of a construction company in
the building cost prediction were recently developed using the order to provide some practical implications for construc-
XGBoost method (Chen et al. 2015). Zhen Peng et al. (2019) tion management.
used this technique to analyze 35,417 pieces of data captured by
the Chengdu HOME LINK network. The results indicated that
XGBoost prediction accuracy was the highest, with an R2 value Dataset and methodology
reaching about 0.93. In a similar study, Truong et al. (2020) con- Data description
cluded that XGBoost appeared to be the best algorithm to pre-
dict the house price. In this study, a dataset composed of 10,000 parametric building
It should be noted that the above-mentioned studies (Suykens configurations collected from the end-to-end activities of a con-
and Vandewalle 1999; Lowe et al. 2006; Chen et al. 2015; Bhagat struction company was used to train, validate, and optimize ML
et al. 2016; Chen et al. 2017; Ceh et al. 2018; Ghosalkar and models in order to estimate building costs and required resour-
Dhage 2018; Phan 2018; Afonso et al. 2019; Matel et al. 2019; ces. The dataset consists of 24 variables divided into three spe-
Peng et al. 2019; Hong et al. 2020; Huang and Hsieh 2020; Jui cific groups, including building features, costs, and required
et al. 2020; Sanni-Anibire et al. 2020; Truong et al. 2020) mainly resources. The building features, denoted by the vector X, have
focussed on the estimation of the building costs using data from 11 features named X1, X2, … , X11. They consist of key charac-
one or a few internal services without going through an end-to- teristics of a building, which are provided by the client, such as
end workflow from customer requests to the building price the number of floors, the building depth, the width of windows,
offered. Regarding cost optimization used to improve profits, few the floor height, etc. The costs, denoted by the vector Z, have 2
studies have been reported in the literature. Kravanja and Zula components, Z1 and Z2, corresponding to the building and main-
(2010) presented a simultaneous cost, topology, and standard tenance cost per unit surface, respectively. The required resour-
cross-section optimization of industrial steel building structures ces, denoted by the vector Y, are composed of 11 parameters,
using the mixed-integer nonlinear programming approach. named Y1, Y2, … , Y11. Internal to the company, these variables
Similarly, Michael J. Risbeck et al. (2015) developed a framework represent the amount of cement and sand, etc., estimated by the
to optimize the combined building heating/cooling equipment company, which serve to calculate costs and prepare necessary
cost. Besides, Vinko Lesic et al. (2017) proposed a modular resources for construction.
energy cost optimization for buildings with an integrated micro- The actual dataset is included in the Appendix. This dataset is
grid. The results showed a considerable cost-saving of modular a matrix of 10,000 lines and 24 columns formed by the standar-
energy in different configurations. dized data. The first 11 columns represent the independent varia-
Based on this review, it appears that the use of advanced data bles of the building features X. The next two columns represent
analytics for the real-world data collected from end-to-end activ- the costs Z, which are functions of the building features X. The
ities of a construction company to perform the four tasks last 11 columns are the required resources Y estimated by differ-
described in Section introduction has not been extensively ent company departments, which are also functions of X. Each
studied. Also, most of the published works focussed on predict- line in the data represents a configuration of the building to be
ive models to estimate the costs from the building features. built and associated costs estimated beforehand and required
However, limited attention has been given to an inverse resources. They were obtained across internal services and
approach that identifies the features under a given budget. departments of the company.
Besides, the optimization of the building costs performed in the Figure 1 shows the distributions of the variables X, Z, and Y
literature was usually subjected to constraints on the input, from the actual dataset. Note that Figure 1 is presented in the
which will be considered in this paper. Additionally, our study logarithmic scale without normalization, and the negative values
aims to demonstrate the suitability of different ML regression of X1 are excluded. As can be seen in the figure, each feature
4 T. Q. D. PHAM ET AL.
exhibits a different distribution. These differences could be very magnitudes and statistical distributions. This significantly
sensitive and considerably affect the algorithms that consider the slows down the computation process and negatively affects
Euclidean distance, such as the K-means algorithm. Additionally, the model performance. In order to deal with these issues,
this significantly increases the ML model’s training time, in par- the data of each column in the actual dataset were normal-
ticular that of the ANN. ized in the range of 0 and 1 using the following equation:
XX min
X scaled ¼ , (1)
Regression frameworks X max Xmin
Figure 2 represents the workflow for the ML regression frame-
work used in this study. It includes data processing, ML regres- where Xscaled is the normalized value of each variable of X
sion algorithm, and model evaluation. The actual data, defined from the actual dataset. Xmax and Xmin represent its actual
as those collected from the company without any modification, minimum and maximum values. Similar normalization was
were first analyzed using several processing methods, such as performed on Y and Z variables. The distributions of the
data normalization, data checking, and Principal Component normalized data are shown in Figure 3. In this paper, the
Analysis (PCA) (Wold et al. 1987). They were then split into normalized values were used for training, testing, and valid-
training and test datasets with a ratio of 80% and 20%, respect- ation processes, as well as for result illustration, except the
ively. The training dataset was used as the input for the ML optimization task under constraints presented in Section
models that were then validated using the test dataset. During optimization of the costs Z under constraints on the build-
this process, cross-validation and parameter tuning were per- ing features X.
formed to improve/maximize the model accuracy. Data checking and Principal Component Analysis
Figure 4 shows the correlation coefficients between the
building features X before and after applying the Principal
Data processing Component Analysis (PCA) using Pearson correlation ana-
Data normalization lysis. It can be seen that without the PCA, some features are
As shown in Figure 1, due to the diverse intrinsic character- strongly correlated, in particular between X9 or X10 and
istics and properties of the features X from the actual data, other features with absolute values of the correlation coeffi-
their scales are relatively large with different orders of cient of 0.4 or 0.5, as shown in Figure 4. The correlations
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 5
Figure 4. Correlation matrices for the building features X before (left) and after (right) applying the PCA by Pearson correlation analysis.
Table 1. Analysis results of the PCA method for the dataset (the values higher than 0.5 are marked in bold).
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
X1 7e-5 2.1e-4 8.9e-4 1.1e-3 8.1e-5 6.9e-3 1.9e-3 1.2e-3
X2 2.5e-4 7.8e-4 1.5e-3 7.6e-4 5.1e-3 1.1e-3 0.02 0.012
X3 20.94 0.23 0.012 5.6e-3 0.12 0.01 0.10 0.15
X4 0.28 0.87 0.021 4.7e-3 0.13 5e-3 0.044 0.28
X5 0.073 0.086 20.55 0.7 0.37 0.051 0.15 0.11
X6 0.079 0.077 20.55 -0.7 0.38 0.074 0.14 0.12
X7 0.014 0.055 0.082 0.061 0.19 0.74 0.49 0.21
X8 0.015 0.053 0.074 0.065 0.21 20.66 0.61 0.19
X9 5.9e-3 0.024 0.043 1.7e-5 0.11 0.042 0.37 0.20
X10 0.054 0.049 0.46 4.1e-3 0.52 0.027 0.15 0.68
X11 0.10 0.039 0.32 3.2e-3 -0.55 0.033 0.38 0.52
Variance explained (%) 62 18 9 7 2 0.7 0.5 0.3
X8. A more detailed analysis of each feature’s influence on from the ML model) and the actual values by the following for-
the building costs will be shown in Section prediction of the mula:
building and maintenance costs Z as a function of the fea- sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
tures X and Section prediction of the required resources Y X n
ð^y i yi Þ
MSE ¼ , (5)
as a function of the building features X. i¼1
n
Machine learning regression algorithms where n, ^y , and y are the number of samples, the predicted
In this study, 13 ML algorithms were applied to identify the values from the regression, and the actual values, respectively.
most suitable models to estimate the costs Z and the required Hence, the lower the value of MSE, the better the
resources Y with respect to the building features X. As a basic model predicts.
ML algorithm, regression is a training process to label the input The coefficient of determination R2 represents a measure of
independent variables with the output dependent variables. It how much the model replicates the actual values on the basis of
becomes a supervised learning problem and can be solved using the set of various errors:
various ML algorithms. The next section will provide a brief
description of three families of the ML regression algorithms. X
n
2
SSE ¼ yi ^y i , (6-a)
Linear regression-based ML models i¼1
supervised learning algorithm that uses a linear approach SST ¼ ðyi y Þ2 , (6-b)
i¼1
for a prediction problem. In the context of this study, the
values of the independent features X are associated with SSE
R2 ¼ 1 , (6-c)
those of the dependent variables Z by a linear relationship SST
as follows: where SSE, SST, and y represent the residual sum of squares,
Z ¼ b0 þ b1 X1 þ b2 X2 þ . . . þ b11 X11 (3) the total sum of squares, and the mean of the actual values,
respectively. Thus, the closer to 1 the value of R2 is, the better
the model predicts and the higher its accuracy.
As mentioned above, Z may be either the building cost per
unit surface Z1 or the maintenance cost per unit surface Z2.
The total budget (Z1þZ2) is also considered. Four linear Genetic algorithm for optimization models
regression-based algorithms, namely Linear Regression
A Genetic Algorithm (Goldberg 2006) is a computational
(Seber and Lee 2012), Lasso Regression (Tibshirani 1996),
Ridge Regression (Hoerl and Kennard 1970), and Elastic- method for solving constrained and unconstrained optimization
Net Regression (Zou and Hastie 2005), were tested in problems, which is based on the natural selection process that
this study. mimics biological evolution. This technique repeatedly modifies
Nonlinear regression-based ML models a population of individual solutions, which is the input that min-
The nonlinear regression-based ML models h are extended imizes the optimization function, to search for the optimal one
versions of the linear regression-based ML models, which created through the propagation of positive traits and mutation.
can be expressed as follows: The evaluation of each solution is rated by a target function
named fitness function; only the solutions with the highest fit-
Z ¼ hðX1 , X2 , . . . , X11 Þ (4) ness function score are allowed to move on to the next gener-
ation after a generation. The whole algorithm can be defined as
In this method, the data are fitted using a method of succes- a 4-step process: population representation and initialization,
sive approximation for the model’s hyperparameters. objective and fitness functions, selection and crossover, and
Decision Tree (Safavian and Landgrebe 1991), K-Nearest mutation, as shown in Figure 5.
Neighbour (Keller et al. 1985), Support Vector Machine
(Suykens and Vandewalle 1999), and Artificial Neural
Network (Jain et al. 1996) were tested in this study. Results and discussions
Boosting regression-based ML models In this section, the results of the ML models for different tasks
Boosting regression-based ML models or Ensemble regres- are presented. First, an overview of the four tasks described in
sion models aim to obtain a simple function with small Section introduction and inspired by real-world scenarios is dis-
residuals at almost every point in a sufficiently large sample. cussed to remind the ML models’ goals. Then, the results for
In this study, four models, namely AdaBoost (Roe et al.
these four tasks are detailed. Finally, practical implications and
2005), Extra Tree (Maier et al. 2015), Random Forest
contributions to construction management are discussed.
(Aldous 1993), Gradient Boosting (Friedman 2002), and
To develop ML-based models in this study, Python script
XGBoost (Chen et al. 2015), were tested.
codes were implemented using several Python libraries, such as
Numpy (numerical computing tool) (Van Der Walt et al. 2011),
Evaluation metrics Scipy (scientific and technical computing) (Jones et al. 2001),
In order to assess the performance of the building cost predic- Matplotlib (visualization) (Hunter 2007), Pandas (data loading
tion model using ML, two error metrics, including Mean and data-driven) (McKinney 2011), Scikit-learn (library for ML-
Squared Error (MSE) and coefficient of determination R2, were based regression models) (Pedregosa et al. 2011), and Keras
used in this study. The MSE measures how close the estimated (library for the artificial neural network) (Gulli and Pal 2017).
data are relative to the actual data by calculating the average of The training time was obtained from an Intel Core I5 Laptop
the squared deviation between the estimated values (obtained equipped with 6 GB RAM and 2 CPU cores.
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 7
Figure 5. Genetic algorithm framework for the building cost optimization with and without constraints.
Overview of tasks performed by machine learning models nodes in the second layer named hidden layer, and one node in
the last layer representing the costs. The ANN model architec-
In this section, four main tasks, previously described in Section
ture is shown in Figure 6. The model hyperparameters, such as
introduction, were performed using ML-based models. The first
the number of neurons in the hidden layers, the number of hid-
task was to predict the building cost per unit surface (Z1) and
den layers, the activation function, were selected using the
maintenance cost per unit surface (Z2) for a given set of features Bayesian optimization algorithm from the scikit-learn library
X. This task aims to provide the company with a fast and accur- (Aldous 1993).
ate tool to estimate the building costs based on the actual data- Table 2 lists the R2 metrics obtained from 13 ML regression
set. The second task was to identify potential options of the algorithms mentioned in Section machine learning regression
building features X that a client could afford for a given cost algorithms. It can be seen that the Boosting regression-based
(budget) Z (¼ Z1 þ Z2). This task aims to guide the company to models and ANN model exhibit the best accuracy when per-
the available building options to communicate with the client forming the cost prediction, in particular Gradient Boosting,
effectively. The third task was to predict the required resources Y XGBoost, and ANN with a very high-value R2 of 0.99. In other
for a given set of features X so that all necessary materials and words, the building and maintenance costs, as well as the total
labours can be stocked and planned before construction. This cost, can be predicted very well for a given set of building fea-
task aims to ensure that the risks of delays due to resource short- tures X. This observation is expected and can be explained by
age can be limited, and the costs can be accurately estimated the fact that the Boosting regression techniques and ANN
beforehand. The fourth task was to minimize the costs Z under a model use the gradient descent to minimize the error function.
given set of constrained features X according to the clients’ pref- However, these methods need much more time to converge as
erences. This task aims to ensure that the company’s competitive compared to the others. Regarding the linear regression-based
but profitable prices are always offered to the client. models, except the multiple linear regression that can reach
0.94 of R2 after only about 0.002 s of the training time, the
Prediction of the building and maintenance costs Z as a Lasso, Ridge, and Elastic-Net regression models show a rela-
function of the features X tively low value of R2. This lack of accuracy may be due to the
high correlation between the variables. In this case, these algo-
In order to evaluate the performance of the selected ML-based rithms will only retain one variable and set the other corre-
models described in Section machine learning regression algo- lated variables to zero. Thus, the dataset will lose information
rithms, the dataset was split into two parts for training and test- resulting in lower model performance. It can be noticed that
ing. The test dataset was used for the model validation using the all the linear regression models can converge rapidly after only
R2 metrics in Eq. (6-c). For the Boosting method, 2000 estima- 0.002-0.003s. This can be straightforwardly explained by the
tors were used during aggregation to maximize the model per- fact that solving a linear equation is much easier and less time-
formance. The ANN model was built from three layers with 11 consuming than a nonlinear Boosting equation. In terms of
nodes in the first layer corresponding to the input features, five accuracy and training time, the ANN model appears to be the
8 T. Q. D. PHAM ET AL.
most suitable to predict the building costs, as shown in lower R2 than the ANN model, cross-validation presents the
Table 2. same trend in the 3 cases Z1, Z2, and Z1 þ Z2, but with more
Figure 7 shows the actual and predicted results using the lin- scattering around the y ¼ x line (dashed red line in Figure 7)
ear regression and ANN models with cross-validation. Note that where x and y represent actual and predicted values of Z,
cross-validation was only performed for the linear regression and respectively.
ANN models due to their best quality in both training time and As shown in Figure 7, the ANN model’s predicted results are
accuracy. In the figure, the colour bar on the right side repre- in excellent agreement with the actual data. As a result, the ANN
sents the density of points. Since the linear regression shows a was used to perform the remaining studies in this paper, thanks
to its accuracy and fast training time. This result helps to estab-
lish the link between client requests and final cost described in
Figure 2 within a few seconds instead of many days of work per-
formed by different services of a construction company as in
current practice. This can result in a more efficient and
smoother interaction between company and clients, enhancing
customer satisfaction. Simultaneously, the costs incurred by the
internal services can be avoided; the company’s experience and
knowledge are streamlined and exploited, leading to a better esti-
mate for the building costs.
In order to investigate the influence of each building feature
on the building and maintenance costs, the linear regression
model was used to find the parameters with the highest coeffi-
cients bi as shown in Table 3. Note that the actual data were
normalized using Eq. (1); thus, the coefficients bi are called as
the normalized regression coefficients. According to statistical
Figure 6. Architecture of the Artificial Neural Network for task 1. theory, a coefficient with the highest value is the most influential
parameter. Also, a positive coefficient indicates an increasing
contribution of the feature to the output, while a negative coeffi-
Table 2. Values of R2 for different ML-based regression models. cient tends to decrease the output as the input feature increases.
R2 Training time (s) As listed in Table 3, among 11 features, X4 and X9 exhibit the
Regression model Z1 Z2 (Z1þZ2) Z1 Z2 (Z1þZ2)
highest regression coefficient values in absolute terms, leading to
the most influence on the building cost per unit surface Z1.
Multiple Linear 0.94 0.94 0.94 0.002 0.003 0.003
Lasso 0.13 0.12 0.13 0.003 0.003 0.004 Indeed, the X4 and X9 correspond to the floor depth and the
Ridge 0.12 0.13 0.13 0.003 0.003 0.004 floor height, respectively. Thus, it is reasonable that these key
Elastic-Net 0.15 0.13 0.13 0.003 0.003 0.003 geometrical building parameters can directly increase the build-
Decision Tree 0.95 0.83 0.9 0.03 0.03 0.03 ing cost. Whereas, in the case of the maintenance cost per unit
KNN 0.82 0.79 0.8 0.3 0.3 0.4
SVM 0.93 0.91 0.97 0.09 0.1 0.25 surface Z2, according to the regression coefficient values shown
AdaBoost 0.92 0.86 0.89 14.87 14.9 16.12 in Table 3, the two most important features are X9 (floor height)
Gradient Boosting 0.99 0.99 0.99 12.08 12.0 12.86 and X10 (kitchen width). Similar to the building cost Z1, it
Extra Tree 0.97 0.92 0.95 48.9 48.2 49.5 appears that X4 and X9 are the two most influential features in
Random Forest 0.97 0.92 0.95 44.94 46.3 47.7
XGBoost 0.99 0.99 0.99 16.87 18.21 18.62
the case of the total cost (Z1þZ2). These results are beneficial for
ANN 0.99 0.99 0.99 2.79 2.83 2.94 the sales representatives when choosing the appropriate features
to meet as much as possible the client’s requests with a
Figure 7. Scatter plot of the cross-validation result for (a) the linear regression models, and (b) the ANN models (the dashed red lines represent the y ¼ x line).
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 9
reasonable price. Also, the client and seller can quickly identify The result indicates that the ANN model is capable of predicting
the critical features that need to be modified, but always under a the inverse task consisting of a set of building features (X1 to
given budget. The customer’s waiting time to receive the quotes X11) as the outputs with one input in a short period of time. In
from the changes can be minimized, improving customer practice, this information can allow the company to quickly
satisfaction. select each feature’s value for a given budget, then communicate
it to the client. For example, under a given amount of budget,
the model is able to claim that the client can afford a building
Identification of the features X for a given value of the cost with a specific number of floors and windows, etc.
(budget) Z
This task aims to perform a multi-output regression with only
Prediction of the required resources Y as a function of the
one independent variable (input). Three types of input variables
building features X
were investigated, including the building cost per unit surface Z1,
the maintenance cost per unit surface Z2, and the total cost The required resources Y are critical information for company
(Z1þZ2). The ANN regression model was employed for this task operation, which can be used to prepare needed resources and
because of its high quality in training time and accuracy, as estimate the costs for a construction project. Table 5 lists the
shown in Section prediction of the building and maintenance regression results for two cases. Similar to Tasks 1 and 2 shown
costs Z as a function of the features X. In addition, it appears in Sections prediction of the building and maintenance costs Z
that the ANN model is the most suitable for this task since there as a function of the features X and identification of the features
is only one input variable with multiple outputs. The ANN X for a given value of the cost (budget) Z for the costs Z, the
architecture was selected as 1-3-5-7-11, and it contains five layers first case consists of predicting the required resources Y from the
densely connected to one another, as shown in Figure 8. The input features X. The inverse case aims to determine the building
first layer has one node representing the single input. The final features X as a function of the input variables corresponding to
layer represents the 11 features corresponding to the input fea- the required resources Y. Contrary to Task 1 (see Section predic-
tures of Task 1. tion of the building and maintenance costs Z as a function of
The Relu activation function (Li and Yuan 2017) and Adam the features X) in which the output contains only one variable,
optimizer (Kingma and Ba 2014) were used in the inverse regres- the first problem investigated in this part consists of a multi-out-
sion model for this task. The loss computed by the Mean put regression task, knowing the number of the required resour-
Squared Error (MSE) for training and validation (see Eq. (5)) is ces is 11, similar to the input features X. The first problem can
shown in Figure 9. Note that an epoch is defined as one cycle be summarized as follows:
when the network looks through the full training dataset. It can þ Train the
be seen that the models rapidly converge after about 150 epochs. dataset: ðX1 , X2 , . . . , X11 Þ, ðY1 , Y2 , . . . , Y11 Þ , Yi 2 R100001
Additionally, no significant difference in the loss can be observed
þ Predict the vector Y ¼ ðY1 , Y2 , . . . , Y11 Þ for a given X:
for three types of costs Z1, Z2, and (Z1 þ Z2) after model
convergence. It is observed that the R2 values, describing the model accur-
Table 4 shows that the values of MSE obtained by the ANN acy, obtained by the ANN models for the two problem cases can
regression model can reach only about 0.0577, 0.058, and 0.0638 reach 0.99 and 0.985, as shown in Table 5. This result is similar
for Z1, Z2, and (Z1þZ2), respectively. For the training time, the to that of the one-output regression in Task 1 for the costs Z
models can quickly converge after only a few tenths of seconds. (see Section prediction of the building and maintenance costs Z
Besides, there is no significant difference among the three cases. as a function of the features X). Due to the complicated structure
10 T. Q. D. PHAM ET AL.
Figure 9. Epoch-dependent MSE loss of the training and validation for (a) Z1, (b) Z2, (c) (Z1þZ2).
Table 4. MSE and training time obtained by the ANN model for task 2: identifi- Table 5. ANN model’s results for two regressions tasks: required resources Y vs.
cation of the features X as a function of the costs Z. building features X, and the inverse problem.
Input case Mean squared error (MSE) Training time (s) R2 Training time (s)
Z1 0.0577 19.7 Prediction of the required resources Y 0.990 21.2
Z2 0.0580 20.3 Prediction of the building features X 0.985 23.4
Z1þZ2 0.0638 17.6
Table 6. Maximum and minimum values of the building features X after standardization.
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
Max 1.55 1.21 1.47 5.42 1.69 1.68 1.52 1.53 1.05 1.10 1.28
Min 1.54 1.97 3.27 0.93 2.46 2.47 1.92 1.94 3.24 4.41 2.81
some building features are highly correlated to one another, the For case 1, the optimized values are equal to 0.12, 0.06,
Principal Component Analysis (PCA) was used in this task. and 0.08 for Z1, Z2, and (Z1þZ2), respectively. Besides, the
In this study, three ranges of features X were selected, corre- actual normalized minimum costs are 0 for all three cases. After
sponding to 100% (case 1, Xmin X Xmax), 80% (case 2, reconversion into their original space of 11 dimensions, the min-
1.1Xmin X 0.9Xmax), and 60% (case 3, 1.2Xmin imum costs are reduced by 7%, 1%, and 1% for Z1, Z2, and
X 0.8Xmax) of the minimum and maximum values of the Z1þZ2, respectively. Regarding the optimal values of the building
actual dataset, as listed in Table 7. Case 1 can be considered features for the costs, the value of the feature X3 is reduced as
equivalent to a non-constrained optimization problem. Note that compared to its actual value. Besides, the feature X4 increases in
this section aims to demonstrate that an optimization problem comparison with its actual value. Note that X3 denotes the num-
under constraint can help to take into account customers’ prefer- ber of floors that can be considered as one of the most import-
ences. For the sake of simplification, the building features were ant building features since a lower value of X3 can lead to a
obtained by eliminating the top and bottom 10% quantiles from lower building cost. In addition, when the number of floors X3
the dataset for case 2, and 20% for case 3, to facilitate discus- decreases, the depth of the building X4 should increase to ensure
sions. For customers’ world-real requirements, the constraints its balance (negative correlation between these two features as
may be different, and solving these real requirements is shown in Figure 6). That is why the value of X4 increases com-
straightforward. pared to its actual value in three types of costs. In brief, using
For this task, the fitness function was obtained by the output the ML-based optimization, the minimum costs can be reduced
of the ANN model that has been developed in Section prediction by 7% at best as compared to its actual value. This finding helps
of the building and maintenance costs Z as a function of the fea- the company identify the optimal features for more profit-
tures X. The ANN model took 11 building features as the input able prices.
(X1 to X11). The output consists of three types of building costs For case 2, the range of the features was reduced by removing
Z. As a consequence, three Genetic Algorithm frameworks were the 10% of maximum and minimum values from the dataset. In
separately developed to consider three types of building costs, this case, the actual normalized minimum costs become 0.11,
including the building cost per unit surface Z1, the maintenance 0.04, and 0.09 for these three types of costs, respectively. The
cost per unit surface Z2, and the total cost (Z1þZ2). The GA optimized values are 0.04, 0.03, and 0.03 for three costs Z1, Z2,
algorithm was performed for 100 generations corresponding to and (Z1þZ2), respectively. As listed in Table 7, the optimization
the initial population’s size. Besides, the crossover is defined as process can reduce up to 6%, 1%, and 4% of the minimum costs
the average of each individual, and the mutation is the addition for Z1, Z2, and (Z1þZ2) as compared to the actual values. Similar
of a random value from the range [-0.1, 0.1] to each individual. to case 1, a lower value of X3 and a higher value of X4 can lower
Using the PCA, the number of building features was reduced the costs. Different from case 1, in case 2, X8 and X11 are
from 11 to 8, as described previously in Section data processing, decreased compared to their real data, while the others are
leading to a new matrix of 10,000 rows with 8 columns. This increased. This may be due to the high device range to maintain
matrix was then multiplied by the inversion of the conjunction in the building, knowing that X8 and X11 correspond to the
matrix of 11 rows 8 columns, transforming the building fea- number of facilities. As a consequence, the building cost will
tures back into the original data matrix of 10,000 rows 11 col- be decreased.
umns. Note that the lowest value of the actual cost after Another range of the boundary value was considered in case
normalization is 0; thus, the obtained actual cost after optimiza- 3, in which the building features were limited to 80% of the
tion is expected to be negative in Case 1. maximum and 120% of the minimum. Similar to the other cases,
12 T. Q. D. PHAM ET AL.
the optimization result shows that a decrease in X3, X8, and X11 required resources Y as both variables Y and Z are functions
can reduce the three types of costs, as listed in Table 7. Besides, of X. This task represents an important mission to plan
the optimization algorithm reduces 1%, 5%, and 2% of the actual potential scenarios for risk management, especially when
minimum costs for Z1, Z2, and (Z1þZ2). It is noted that the facing the world of uncertainty such as the labour shortage
(Z1þZ2) value is the total budget that the client needs to pay for due to the pandemic or the lack of certain materials result-
owning the building. The result obtained by the optimization ing from supply chain disruption.
process will help the client select the most affecting variables to
reduce the building cost according to his/her budget. Similarly, It should be noted that the above implications may only be
the company can quickly identify the critical features to change valid for medium sized partner company who used to build for
according to the client’s feature constraints. All these functions thousands of relatively similar small projects such as few storey
will help to improve operational efficiency and customer satisfac- buildings where the owner often directly contacts for a proposal
tion, and save costs. without a need to go through the formal procurement process
In summary, about 1 to 7% of three types of costs can be and where the building is co-designed by the owner and the
reduced in comparison with the actual values using the Genetic building company.
Algorithm with the fitness function calculated from the Artificial
Neural Network. Furthermore, it shows that the more the fea-
tures are constrained, the less the minimum cost reduction can Conclusions
be achieved. This result may be explained by the fact that the
minimum value is searched for within a lower space in the case This study provides a fast and accurate Machine learning (ML)
of constraints than in the space without constraints. In case 1, and optimization framework, which allows a quick estimate for
the highest reduction in the minimum cost is observed in Z1. building costs, hence improving operational efficiency and com-
However, in case 2 and case 3, Z2 is the most reduced by the petitiveness of construction company.
optimization process. Practical constraints by an expert will be A dataset composed of 10,000 parametric building configura-
provided in the implementation phase of this project. tions, collected from end-to-end real-world activities in the com-
pany, was used to train and validate the ML models to perform
multiple tasks. First, the costs and required resources were pre-
Practical implications and contributions to dicted for a given set of building features. Then, the possible
construction management building features were identified for a given budget. Finally, the
As discussed in the previous sections, this paper aims to demon- optimization of the building costs with feature constraints was
strate how ML could be used in a construction company to conducted. Numerical data was carefully processed, and the
exploit real-world operational data. Some practical implications Principal Component Analysis method was applied to remove
can be drawn upon this study as follows: the correlation of data features. Multiple algorithms were tested
to explore the suitability of ML to estimate the building costs.
In this paper, four tasks are identified by asking the right Among the 13 ML regression algorithms used, the Artificial
and interesting questions stemmed from real-world activ-
Neural Network, Gradient Boosting, and XGBoost models appear
ities. This process is of great importance to assure that the
to be the most suitable to estimate the building costs and the
dataset and suitable algorithms can be selected to deliver the
required resources with an accuracy of 99% within less than a
expected values of the data projects. Identifying regularly
second of the training time. The number of floors, the floor
the pain points and opportunities for improvement can help
depth, the floor height, and the kitchen’s width are found to be
the company figure out the tasks with potential good return
of investment. the most influential in estimating the building costs. Artificial
It is shown in this paper that once a sizable dataset of high Neural Network models were also developed to identify options
quality from end-to-end activities is available, ML is a that a client can afford under a given budget. The optimization
powerful tool to streamline operations, save design and problem under constraints was successfully solved, helping cli-
engineering costs, and improve customer experience. Also, ents determine the optimal building costs according to their
the self-improvement of the ML-based models will help to preferences. The optimized building costs obtained by the opti-
integrate skills and knowledge of different services within mization framework using ML-models are 7% smaller than those
the company to better predict the building costs when the of the actual data, hence improving the company’s
models are implemented and more data are collected. Data competitiveness.
collection plays a vital role in any data-driven project. The This study showcases that ML models can be efficiently used
company should adopt a centralized and integrated in the construction sector to optimize the workflow for cost sav-
approach to assure that both quality and quantities of data ings and provide some practical implications for data-driven
across different services are collected, shared, and exploited. construction management.
The paper provides a systematic application of 13 ML mod- For future study, the optimization with the constrained set of
els and employs different optimization techniques in order the required resources will be performed in order to exploit in-
to perform the four tasks stemmed from real-world activ- depth the datasets. Also, further investigation on the sensitivity
ities. Not only the hyperparameters of the ML models of ML models will be performed.
should be carefully computed, but also the accurate inter-
pretation of the prediction and performance of the results
obtained from ML models is essential. Acknowledgement
The fourth task involving the optimization of the costs
(building and maintenance) Z under constrained building The support of Thu Dau Mot University for this project is greatly
features X can be extended to the constraints on the appreciated.
INTERNATIONAL JOURNAL OF CONSTRUCTION MANAGEMENT 13
Disclosure statement Limsombunchai V. 2004. House price prediction: hedonic price model vs.
artificial neural network. In New Zealand agricultural and resource eco-
No potential conflict of interest was reported by the author(s). nomics society conference; p. 25–26.
Lowe DJ, Emsley MW, Harding A. 2006. Predicting construction cost using
multiple regression techniques. J Constr Eng Manage. 132(7):750–758.
Madhuri CR, Anuradha G, Pujitha MV. 2019, March. House price prediction
References using regression techniques: A comparative study. In 2019 IEEE
Afonso B, Melo L, Oliveira W, Sousa S, Berton L. 2019. Housing prices pre- International Conference on Smart Structures and Systems (ICSSS); p.
diction with a deep learning and random forest ensemble. In Anais do 1–5.
XVI Encontro Nacional de Intelig^encia Artificial e Computacional. SBC; Maier O, Wilms M, von der Gablentz J, Kr€amer UM, M€ unte TF, Handels H.
p. 389–400. 2015. Extra tree forests for sub-acute ischemic stroke lesion segmentation
Aldous D. 1993. The continuum random tree 2I. Ann Probab. 167:248–289. in MR sequences. J Neurosci Methods. 240:89–100.
Bhagat N, Mohokar A, Mane S. 2016. House price forecasting using data Matel E, Vahdatikhaki F, Hosseinyalamdary S, Evers T, Voordijk H. 2019.
mining. IJCA. 152(2):23–26. An artificial neural network approach for cost estimation of engineering
M, Kilibarda M, Lisec A, Bajat B. 2018. Estimating the performance of
Ceh services. Int J Constr Manag. 1–14. DOI: 10.1080/15623599.2019.1692400
random forest versus multiple regression for predicting prices of the McKinney W. 2011. pandas: a foundational Python library for data analysis
apartments. IJGI. 7(5):168. and statistics. Python High Performance Scientific Comput. 14(9):1–9.
Chen JH, Ong CF, Zheng L, Hsu SC. 2017. Forecasting spatial dynamics of Park B, Bae JK. 2015. Using machine learning algorithms for housing price
the housing market using support vector machine. Int J Strategic Prop prediction: The case of Fairfax County, Virginia housing data. Expert Syst
Manag. 21(3):273–283. Appl. 42(6):2928–2934.
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H. 2015. Xgboost: Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, …
extreme gradient boosting. R package version 0.4-2, 1(4). Duchesnay E. 2011. Scikit-learn: Machine learning in Python. J Machine
Friedman JH. 2002. Stochastic gradient boosting. Comput Stat Data Anal. Learn Res. 12:2825–2830.
38(4):367–378. Peng Z, Huang Q, Han Y. 2019. Model research on forecast of second-hand
Gao G, Bao Z, Cao J, Qin AK, Sellis T, Wu Z. 2019. Location-centered house
house price in chengdu based on xgboost algorithm. In 2019 IEEE 11th
price prediction: a multi-task learning approach. arXiv preprint arXiv:
International Conference on Advanced Infocomm Technology (ICAIT).
1901.01774.
Garcıa de Soto B, Agustı-Juan I, Joss S, Hunhevicz J. 2019. Implications of IEEE; p. 168–172.
construction 4.0 to the workforce and organizational structures. Int J Phan TD. 2018. Housing price prediction using machine learning algorithms:
Constr Manag. 1–13. DOI: 10.1080/15623599.2019.1616414 The case of Melbourne city, Australia. In 2018 International Conference
Ghosalkar NN, Dhage SN. 2018. Real estate value prediction using linear on Machine Learning and Data Engineering (iCMLDE). IEEE; p. 35–42.
regression. In 2018 Fourth International Conference on Computing Rardin RL, Rardin RL. 1998. Optimization in operations research (Vol. 166).
Communication Control and Automation (ICCUBEA). IEEE; p. 1–5. Upper Saddle River, NJ: Prentice Hall.
Goldberg DE. 2006. Genetic algorithms. USA: Springer. Risbeck MJ, Maravelias CT, Rawlings JB, Turney RD. 2015. Cost optimiza-
Gulli A, Pal S. 2017. Deep learning with Keras. Birmingham, England: Packt tion of combined building heating/cooling equipment via mixed-integer
Publishing Ltd. linear programming. In 2015 American Control Conference (ACC). IEEE;
Hoerl AE, Kennard RW. 1970. Ridge regression: Biased estimation for nonor- p. 1689–1694.
thogonal problems. Technometrics. 12(1):55–67. Roe BP, Yang HJ, Zhu J, Liu Y, Stancu I, McGregor G. 2005. Boosted deci-
Hong J, Choi H, Kim WS. 2020. A house price valuation based on the ran- sion trees as an alternative to artificial neural networks for particle identi-
dom forest approach: the mass appraisal of residential property in South fication. Nucl Instrum Methods Phys Res, Sect A. 543(2-3):577–584.
Korea. Int J Strategic Prop Manag. 24(3):140–152. Safavian SR, Landgrebe D. 1991. A survey of decision tree classifier method-
Huang CH, Hsieh SH. 2020. Predicting BIM labor cost with random forest ology. IEEE Trans Syst, Man, Cybern. 21(3):660–674.
and simple linear regression. Autom Constr. 118:103280. Sanni-Anibire MO, Zin RM, Olatunji SO. 2020. Machine learning model for
Hunter JD. 2007. Matplotlib: A 2D graphics environment. IEEE Ann Hist delay risk assessment in tall building projects. Int J Constr Manag. 1–10.
Comput. 9(03):90–95. DOI: 10.1080/15623599.2020.1768326.
Jain AK, Mao J, Mohiuddin KM. 1996. Artificial neural networks: A tutorial. Seber GA., Lee AJ. 2012. Linear regression analysis. Vol. 329. New Jersey,
Computer. 29(3):31–44. USA: John Wiley & Sons.
Jones E, Oliphant T, Peterson P. 2001. SciPy: Open source scientific tools for Sha’ar KZ, Assaf SA, Bambang T, Babsail M, Fattah AAE. 2017.
Python. Design–construction interface problems in large building construction
Jui JJ, Molla MI, Bari BS, Rashid M, Hasan MJ. 2020. Flat price prediction projects. Int J Constr Manag. 17(3):238–250.
using linear and random forest regression based on machine learning Suykens JA, Vandewalle J. 1999. Least squares support vector machine classi-
techniques. In Embracing Industry 4.0. Singapore: Springer; p. 205–217.
fiers. Neural Process Lett. 9(3):293–300.
Keller JM, Gray MR, Givens JA. 1985. A fuzzy k-nearest neighbor algorithm.
Tepeli E, Taillandier F, Breysse D. 2019. Multidimensional modelling of com-
IEEE Trans Syst, Man Cybern. SMC-15(4):580–585.
plex and strategic construction projects for a more effective risk manage-
Kim GH, An SH, Kang KI. 2004. Comparison of construction cost estimating
models based on regression analysis, neural networks, and case-based rea- ment. Int J Constr Manag. 1–22. DOI: 10.1080/15623599.2019.1606493.
soning. Build Environ. 39(10):1235–1242. Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J Stat
Kim GH, Shin JM, Kim S, Shin Y. 2013. Comparison of school building con- Soc Ser B Methodol. 58(1):267–288.
struction costs estimation methods using regression analysis, neural net- Truong Q, Nguyen M, Dang H, Mei B. 2020. Housing Price Prediction via
work, and support vector machine. JBCPR. 01(01):1–7. Improved Machine Learning Techniques. Procedia Comput Sci. 174:
Kim GH, Yoon JE, An SH, Cho HH, Kang KI. 2004. Neural network model 433–442.
incorporating a genetic algorithm in estimating construction costs. Build Van Der Walt S, Colbert SC, Varoquaux G. 2011. The NumPy array: a struc-
Environ. 39(11):1333–1340. ture for efficient numerical computation. Comput Sci Eng. 13(2):22–30.
Kingma DP, Ba J. 2014. Adam: A method for stochastic optimization. arXiv Wold S, Esbensen K, Geladi P. 1987. Principal component analysis.
preprint arXiv:1412.6980. Chemometrics Intell Lab Syst. 2(1-3):37–52.
Kravanja S, Zula T. 2010. Cost optimization of industrial steel building struc- Zou H, Hastie T. 2005. Regression shrinkage and selection via the elastic net,
tures. Adv Eng Softw. 41(3):442–450. with applications to microarrays. J Royal Statistical Soc B. 67(2):301–320.
Le T, Hassan F, Le C, Jeong HD. 2019. Understanding dynamic data inter-
action between civil integrated management technologies: a review of use
cases and enabling techniques. Int J Constr Manag. 1–22. DOI: 10.1080/ Appendix
15623599.2019.1678863.
Lesic V, Martincevic A, Vasak M. 2017. Modular energy cost optimization Due to competitive advantage constraints and confidentiality con-
for buildings with integrated microgrid. Appl Energy. 197:14–28.
cerns, the attached dataset was standardized, and no explicit labels
Li Y, Yuan Y. 2017. Convergence analysis of two-layer neural networks with
relu activation. In Advances in neural information processing systems; p. of the variables are listed. The interested readers are invited to con-
597–607. tact the authors for access to the actual dataset.