Guidelines For Applied Machine Learning in Construction Industry-A Case of Profit Estimation PDF

Advanced Engineering Informatics 43 (2020) 101013
Contents lists available at ScienceDirect
Advanced Engineering Informatics

journal homepage: www.elsevier.com/locate/aei
Full length article
Guidelines for applied machine learning in construction industry—A case of T

profit margins estimation
Muhammad Bilal , Lukumon O. Oyedele
⁎
Big Data Analytics and Artificial Intelligence Lab (BDAL), Bristol Business School University of West of the England, Bristol, United Kingdom
ARTICLE INFO ABSTRACT
Keywords: The progress in the field of Machine Learning (ML) has enabled the automation of tasks that were considered
Applied machine learning impossible to program until recently. These advancements today have incited firms to seek intelligent solutions
Profit margin forecasting as part of their enterprise software stack. Even governments across the globe are motivating firms through
Construction simulation tool policies to tape into ML arena as it promises opportunities for growth, productivity and efficiency. In reflex,
Interpretable machine learning
many firms embark on ML without knowing what it entails. The outcomes so far are not as expected because the
Predictive modelling
ML, as hyped by tech firms, is not the silver bullet. However, whatever ML offers, firms urge to capitalise it for
their competitive advantage. Applying ML to real-life construction industry problems goes beyond just proto-
typing predictive models. It entails intensive activities which, in addition to training robust ML models, provides
a comprehensive framework for answering questions asked by construction folks when intelligent solutions are
getting deployed at their premises to substitute or facilitate their decision-making tasks. Existing ML guidelines
used in the IT industry are vastly restricted to training ML models. This paper presents guidelines for Applied
Machine Learning (AML) in the construction industry from training to operationalising models, which are drawn
from our experience of working with construction folks to deliver Construction Simulation Tool (CST). The
unique aspect of these guidelines lies not only in providing a novel framework for training models but also
answering critical questions related to model confidence, trust, interpretability, bias, feature importance and
model extrapolation capabilities. Generally, ML models are presumed black boxes; hence argued that nobody
knows what a model learns and how it generates predictions. Even very few ML folks barely know approaches to
answer questions asked by the end users. Without explaining the competence of ML, the broader adoption of
intelligent solutions in the construction industry cannot be attained. This paper proposed a detailed process for
AML to develop intelligent solutions in the construction industry. Most discussions in the study are elaborated in
the context of profit margin estimation for new projects.
1. Introduction roadmap; hence leads to unfavourable outcomes as we witnessed in

predictive policing [34]. Seemingly perfect ML models fail during the
Machine learning (ML) has been a renaissance in the IT industry. production that leads to investment loss, increased scepticism and lack
Today sectors like manufacturing are utilising ML algorithms from of future support. Unless a systematic methodology is exercised for
design to delivery of great consumer products [24]. This success has training and operationalising models, ML projects will tend to fail fre-
lured other sectors of the economy like construction industry into the quently. In true essence, ML has enormous potential to transform the
adoption of intelligence algorithms in their enterprise software [15]. construction industry by automating many perennial problems related
The latest progress in ML algorithms has enabled the automation of to project planning and control [5].
those non-trivial tasks which, a decade ago, were deemed impossible Since majority of ML solutions tend to either automate or facilitate
[23,36,4]. However, ML is not a silver bullet. These algorithms are far the decision-making process, it is imperative for application users to
fetched to achieve human-level performance on many tasks [25]. trust the capabilities of ML models [11]. It requires diverse knowledge
Merely, embarking on ML projects without the knowledge about the of the domain, the data and ML algorithms to train models that are
strengths and limitations of ML algorithms often lead to failures [30]. reliable, transparent and trusted. So the scope of Applied Machine
Even ML on simple tasks becomes difficult without a structured Learning (AML) goes beyond merely prototyping ML models. Currently,
⁎
Corresponding author.
E-mail address: [email protected] (M. Bilal).
https://doi.org/10.1016/j.aei.2019.101013
Received 1 March 2019; Received in revised form 31 July 2019; Accepted 8 November 2019
Available online 02 December 2019
1474-0346/ © 2019 Elsevier Ltd. All rights reserved.
M. Bilal and L.O. Oyedele Advanced Engineering Informatics 43 (2020) 101013
ML engineers undertake most modelling decisions based on their do- courtesy bids or if a project involves more risks. The relationship be-
main knowledge expertise and intuitions, which are occasionally tween profitability performance and project-specific factors is not
proven to be wrong and biased. AML emphasises to engaging the in- clearly understood. Estimators adjust margins based on their intuitions
dustry experts at early stages of the process where the domain is pro- or use a uniform rate to guide their decisions, which are sometimes
gressively learned from the data through ML models and corroborated wrong or misleading. Therefore, a project started with a planned
by the industry experts [32,29]. This early engagement of users into the margin, eventually finishing with entirely another margin. The adverse
AML process will provide the domain experts enough time to under- implications of poorly designed projects are enormous as one such
stand the strengths and weaknesses of algorithms, whereas will also project can ruin the cash flow of the firm, and can lead to its bankruptcy
enable the ML engineers to learn the domain as the process progresses. [18]. Devising a systematic approach to explore the variability in profit
AML is, therefore, a more involved task where ML models are devel- margins by project attributes is indispensable. This knowledge is crucial
oped iteratively by gradually incorporating domain knowledge from to enable estimators to understand the profitability performance across
industry experts. Such ML models not only have great predictive ac- an array of projects and estimate margins correctly for new projects.
curacy and generalizability but also key to understand non-trivial re- The estimators are hard to convince to use ML for profitability
lationships between large volumes of data [6]. forecasting outrightly as they have their reasons. While they are keen to
The profit margin estimation for construction projects is an im- see ML is useful for their tasks, they usually seek clarifications about the
portant decision that business development teams in construction firms strengths and weaknesses of ML algorithms before they trust or let these
undertake during early design stages. However, accurately forecasting algorithms to influence their decisions. ML interpretability for adoption
profit margins is difficult since many factors influence this choice [19]. of intelligent systems in construction industry is crucial. In profitability
At the moment, project teams decide profit margins based on intuitions forecasting, estimators hold many assumptions about how project
or uniform rates, which are unreliable methods [2]. Margins often specific attributes drive profit margins. Some estimators believe that
erode as projects progress. The projects started with planned margins projects in specific regions or from particular clients shall always be
end up with entirely different margins at completion [3]. The profit- estimated at higher rates [33]. We need to confirm these assumption
ability performance of firms can be significantly improved if margins using ML. The estimators mostly claim their assumptions stem from
are rightly set based on project-specific details; thereby reducing the their experience. However, not all assumptions are right or wrong un-
overall bankruptcy ratio of construction firms [33,1]. Most firms posses less proven otherwise. Besides, most estimators follow heuristics such
data about historical projects under the compliance for the Building as uniform rates for overhead costs and profit margins [1]. One lim-
Information Modelling (BIM). This data reside in relational, CSV or itation of these approaches is their inability to reflect variations that
XLS data sources. Most importantly, this data contains every detail of can be induced by project-related risks or contingencies. Profit margins
when and how margins change in projects. ML algorithms can be uti- shall be derived to include the negative as well as the positive impact of
lised to understand the relationship between project attributes and project specific attributes [18]. Such provisions call for advanced di-
profit margins. Such models have the potential to improve the accuracy gital technologies to support estimators to testify such assumptions
of profit margin estimation significantly. quickly. ML algorithms have a huge role to play to enable compre-
This paper aims to propose the AML process for guiding the devel- hension of large amounts of data as well as to train highly reliable
opment of useful ML models for the construction industry. The process predictive models. If ML is not devised with interpretability in mind, it
is gleaned from leading the development of the Construction Simulation will be hard to operationalise such algorithms regardless of whatever
Tool (CST), which is project analytics software, utilising data to its ML algorithm we employ or whatever accuracy we claim. The industry
fullest for facilitating the decision making across numerous activities has a record for the moderate intake of emerging technologies.
involved in the construction project lifecycle. The process provides a Several researchers have developed ML models to forecast the
roadmap for executing ML projects in the industry. Most discussions are profitability performance of construction projects. A major drawback of
focused for construction IT folks. In this article, the AML process is these models is their only inclusion for high-level project attributes
broken down into tasks, and the description of each task along with which significantly affect their predictive accuracy for new projects.
technical details, are provided. The aim is to ensure robust sanity These algorithms fall into two categories. The first category uses ML
checks are exercised before rolling out ML models into the production. models for classifying projects into Bid/No-Bid classes
These guidelines are general in the sense that these can be used to [12,22,13,35,38,37,8]. The second class harnesses ML to regress profit
develop ML models for the majority of construction industry tasks. margin for new projects [7,31,21,20,27,10,28]. Table 1 shows popular
However, discussions are contextualised with a real-life example of ML works used in the industry in the domain of profitability perfor-
profit margin estimation to solidify the concepts and showcase how mance. It is obvious that these approaches use rudimentary ML ap-
various design choices are undertaken and how domain knowledge is proaches and used fewer data for training. This paper demonstrates the
progressively learned as the process progresses. development of an ML model for profit margin estimation. While we
This paper is organised as follows: Section 2 describes literature presented a significant performance increase through better training
review on the need for profit margin estimation in construction pro- algorithms and more data, the main focus of this work is set to highlight
jects. Section 3 and Section 4 present the process and guidelines for the key steps for developing robust ML models. These steps are presented as
AML with practical demonstrations from profit margin estimation. a unified Applied Machine Learning (AML) process. A main caveat is
Lastly, Section 5 provides concluding remarks, discusses lessons learnt that we do not expect ML folks to have prior domain knowledge as it
and future directions of the work. can bias their judgment, and lead to wrong modelling decisions. They
acquire knowledge progressively as the process advances.
2. Literature review
3. Applied machine learning process
The construction industry is a highly competitive landscape and
continually fluctuating industry. Firms strive to deliver projects with The guidelines for applied machine learning (AML) are described in
great profitability performance. This zeal puts these firms in constant this section. These guidelines are arranged into a structured process,
pursuit of maximising profits on awarded contracts, whereas lowering consisting of several tasks. Fig. 1 shows an overview of the proposed
margins on new bids to get more work. Getting the profit margins right AML process. Various tasks in the process are gleaned from our ex-
is paramount to making many decisions at various project lifecycle perience of developing numerous ML models in the CST. We chose
stages. Several external, organisational and project related factors affect those tasks which are often found useful in producing highly robust ML
profit margins [33]. The contractors generally set higher margins on models. In most cases, the final model obtained by following these steps
2
was a better fit for production deployment in the CST.

Literature Source(s)
The AML process consists of six well-defined tasks (see Fig. 1). After
defining the ML problem and establishing the performance metrics,
access to the data is obtained, which are then integrated and curated for
[20,21]
robust ML training. Next, several standard data pre-processing activities
[31]
[27]
[28]
[7]
are recommended to get the data into the right format. After that, the
baseline model is trained using various design strategies, which, in our
Model is hard to generalise as high-level project attributes of
example of profit margin estimation using the random forest, are sub-
Model contains one layer which clearly can’t capture non-
Fewer data used in training and evaluation method is not

sampling and hyperparameter tuning. Once we attain a reasonable level
linearity between project attributes and profit margins.
It is hard to use this sytem as firms can rarely access

of accuracy for the baseline model, we then move onto the interpretable
Model is specific to geographic location and hard to
machine learning step. This step is crucial as it enables ML engineers

not only to understand the domain progressively but also better explore
the competencies of the baseline model. Several revisions are made to
the data and predictive models during this step. Critical insights related
fewer projects are used for training.
to Trust, Bias, Transparency, Confidence and Interpretability of ML are

investigated. The early engagement of domain experts enables them to
trust these models by knowing their capabilities and limitations. Fi-
nally, other ML algorithms are also used to train models for the given
problem to see if a better alternative can be found. Once the best
competitors’ data.
learning algorithm is decided, we train the final model using all the
Key Limitations
data and using the optimal hyperparameters. The resulting model is

generalise.
then deployed in the enterprise solution for real-life scoring. These AML
robust.
tasks are further explained in the subsequent sections.
3.1. ML task definition & data selection

Thirty (35) projects
Twenty-five (25)
Twenty-six (26)
Total Projects
Not available
According to Mitchell [26], ML programs learn from experience (E)

Twenty (20)
with respect to task (T) and performance measure (P) if its performance
projects
projects
projects
(P) improves at task in (T) with experience (E). In the above formalism,
terms (T), (P) and (E) are the first things to define in the AML process.
generate rules for the neural network. RMSE is used rightarrow evaluate model performance.
regression neural networks (GRNN), radial basis function neural network (RBFNN) and back
Bid Expert integrates FaRM with cost estimation model for predicting margins. Tool used past
This system uses a hybrid approach where fuzzy clustering and genetic algorithm are used for
classifying projects, and support vector regressor (SVR) is employed for training the ML model
ANN is used to develop profit forecasting model where de-compositional KT-1 algorithm to
Support vector machine (SVM)-based algorithm is used to develop a model for forecasting
3.1.1. Defining task & evaluation metrics

for profit margin forecasting. System outperformed other ML algorithms like generalised
DBID uses ANN to forecast the profit margins by harnessing ANN with GA. Monte Carlo
profit margins. R DTREG package is used for training the model. R2 and Mean Absolute
simulation is used for sensitivity analysis to assess the winning probability concerning
The first step and arguably the most important aspect of ML is to

clearly define the ML task and its intended use. The results of ML model
would be worthless if the task is wrongly defined. It involves describing
the problem and listing all assumptions. More importantly, we need to
explain intended use of ML, including its implications for productivity
and efficiency for target users. In the case of CST, we set the goal of ML
task to predict the profit margin for new opportunity based on project
attributes (see Table 2). As discussed briefly, profit margin estimation is
tricky task as many factors influence its choice. Most estimators de-
Percentage Error (MAPE) are used for model evaluation.
termine margins based on their intuition or uniform rate which are not
reliable approaches. The intended use is to employ this model to predict
profit margin during early estimation stage. We assume that detailed
design has been completed and we have detailed project specific in-
formation for all attributes to make such prediction.
propagation neural network (BPNN).
Approach or Methodology Employed
data of firm and of its competitors.
The next step in the process is to specify performance metric for

List of ML models used in profitability performance estimation.
evaluating ML models. A performance metric reveals how effectively a

variability in profit margins.
model captures the relationship between input and the target features
in the given data. It provides a uniform basis for comparing different
ML algorithms. The type of ML task influences the type of evaluation
metric employed. Supervised ML tasks fall into two categories [14]. A
regression task involves the prediction of a continuous value, whereas a
classification task aims to predict a label. We first determine the ML
task type and then specify the performance metric. In the CST case, our
task is multivariate regression problem where ML will predict profit
margins, i.e. a continuous value. The most common metric for regres-
Project Profit Forecasting
SVM Forecasting System
sion tasks is Root Mean Squared Error (RSME), which computes the
difference between the predicted value and actual value. In the case of
CST, we used Root Mean Squared Log Error ((RMSLE), which is the log
difference between actual and predicted profit margins. The reason
Tool Name
Bid Expert
behind this choice is the fact that we don’t want the performance metric
System
InMES
DBID
to penalize huge differences in the predicted and true values when these
values are huge numbers. The profit of 10, 000 from 100, 000 worth
Table 1
project is same as the profit of 100, 000 from 1, 000, 000 worth project.
Sr.#
In such scenarios like profit estimation, only the percentage differences
3
Fig. 1. An overview of applied machine learning process.
matter, so instead of RMSE we selected RMSLE to observe percentile 437, 000 construction projects, completed in the last 20 years. We used
differences. Eq. (1) shows the formula for RMSLE. all data for training except for the last six months that are used to create
n
validation set.
1 Once data is available, the next step is to load it into the analytical
RMSLE = (log (y + 1) log (y + 1))2
n i=1 (1) toolbox of your choice. During the process, data is split into various
datasets like one for training models and one or more for validation to
We also used R to keep track of the accuracy of the ML model. R2
2
see how model performs on unseen data. These datasets shall be pre-
explains the ratio of the accuracy of the predictive model (SSR) to that pared in the same way. One caveat is to follow resource-naming
of baseline model SST. Eq. (2) shows the formula for R2. strategy and stick to it throughout the AML process. We will see that
SSR several copies of data are created during the process, and if we do not
R2 = 1 adhere to naming conventions, it leads to strange modelling errors. In
SST (2)
this study, we used Anaconda as analytical toolbox, where Jupyter is
One useful trick we found handy is to define a function print_- used for writing code scripts for data processing and model develop-
score() that takes the model as input and prints a tuple containing ment. We used Python programming language and employed read_csv
Training Error, Validation Error, Training Accuracy, and () method from pandas library to load our data from Hadoop into
Validation Accuracy. This function is generally invoked each time X_raw data frame. We split X_raw into two data frames X and y. The
we revise our model to see its performance in response to a change in data frame X holds the input features, whereas y holds the target at-
the data or hyperparameters. The usability of print_score() greatly tribute. These data frames are further split into training and validation
depends on naming conventions used for the datasets. The function will sets. We used terms X_train, X_valid, y_train and y_valid for
not be useful if dataset names change frequently. these data frames. Test data is generally advised to set aside at the
beginning in X_test data frame until the final model gets ready, and
3.1.2. Data selection, integration, naming & loading use it for final evaluation. The files on Hadoop are named as
The last part of our formalism is the experience E that means data Train.csv, Test.csv, and Valid.csv. After preprocessing is per-
for training and evaluating ML models. The availability of enough formed on data, it is good practice to store data into python feather
quality data is crucial for training reliable ML models [17]. With recent format for later use.
advances in Big Data, firms have started to store all sorts of data. In the
construction sector, firms store data to comply with Building Informa- 3.2. Data preparation and pre-processing
tion Modelling (BIM), but these data reside in different data sources.
One of the challenging tasks in AML is to identify reliable sources of Data plays the most crucial role in ML projects and would rarely be
data. Once sources are identified, we integrated them into a Common in the format required by ML algorithms. Oftentimes, data shall be
Data Environment (CDE) where technologies like Hadoop are vastly cleansed, filtered, normalised, sampled, and transformed in various
used. In the case of CST, several sources were identified including ways before algorithms can leverage it for learning. ML engineers spend
Oracle Financials, Business Objects, Google Earth, Key- most of their time and efforts in data preparation [16]. It is always
hole Markup Langauge (KML), Oracle CRM, and many non-stan- recommended to glance through the data and check for both its format
dardised Excel and CSV files. We used SQL for data curation and in- (i.e. structure and data types) and contents [39,14]. Data may not be
tegration. The data is first stored into staging tables in Hadoop and then what you expect even if you have read its descriptions. Many trans-
transported to the relational database. Data elements for the same formations are normally recommended before ML models are trained.
project across different data sources were joined using unique project Following sections explain essential tasks that are performed on most
identifier which stayed immutable in all systems. The data contained datasets.
4
Table 2
List of project attributes for profit margin estimation.
Sr.# Attribute Name Attribute Description
1 Opportunity ID Unique code attributed to the opportunity (8 characters long)

2 Opportunity title Unique code attributed to the opportunity (8 characters long)
3 Client The customer whom the opportunity is from
4 Market Workstream category to which the opportunity belongs to
5 Region Part of the country opportunity belongs to as defined by BB
6 kV Voltage (s) along the route of opportunity
7 km Total length of the route of the opportunity
8 Project type The type of opportunity based on BB defined categories
9 Start date Proposed construction start date of the opportunity
10 End date Proposed construction finish date of the opportunity
11 Duration (days) Proposed total duration of the opportunity
12 Outage (s) Total duration of all proposed outage (s) on the opportunity
13 Key milestones Key milestones on the project
14 Duration (days) Duration of key milestones on the project
15 Termination Points of termination at each end of the route
16 Contract type Client’s chosen contract
17 ITT date Date/Proposed date which we were invited to tender for the opportunity
18 Tender submitted Date/Proposed date that the tender is submitted to the cl
19 Contract awarded Date/proposed date that the client offers the opportunity
20 Mobilisation date Date/proposed date for initial mobilisation on to project
21 BB role The role of BB in this partnership
22 Scope & complexity Scope of the project
23 Scope Scope of the project
24 % of work % of BB work within scope
25 Ground type Ground conditions for the route area
26 Route length Length per ground type
27 HDDs in river HDDs in river
28 Total length in river (m) Total length in river (m)
29 HDDs in rail HDDs in rail
30 Total length in rail (m) Total length in rail (m)
31 HDDs in road HDDs in road
32 Total length in road (m) Total length in road (m)
33 HDDs through utilities HDDs through utilities
34 Total length in utilities (m) Total length in utilities (m)
35 HDDs in other HDDs in other
36 Total length in other (m) Total length in other (m)
37 Joint bays No of points where two cable ends are jointed
38 Material cost Material attributes that influence project profitability
39 No of circuits Number of circuits
40 No of phases Number of phases
41 Conductor The type and size of conductor within the power cable
42 Cable length (m) Total length of the power cable throughout the route
43 Duct size Power cable duct size
44 Duct length Power cable duct length
45 Fibre cable length Fibre cable length
46 Duct size Fibre cable duct size
47 Duct length Fibre cable duct length
48 Pilot cable Pilot cable length
49 Duct size Pilot cable duct size
50 Duct length Pilot cable duct length
51 Straight joints Total amount of joint accessory between the same cables
52 Transition joints Total amount of joint accessory between two different ca
53 Tiles No of tiles for the entire route of the opportunity
54 JB tiles No of JB tiles for the entire route of the opportunity
55 Tapes No of tapes for the entire route of the opportunity
56 Cable markers No of cable markers for the entire route of the opportunity
57 Link boxes No of link boxes on the entire route of the opportunity
58 Supplier The attributes below are associated with materials supply managem
59 Power cables Desired supplier (s) for power cable (please choose client if
60 Joints Desired supplier (s) for cable joints (please choose client if free
61 Terminations Desired supplier (s) for the terminations used at each end of
62 Backfill materials Desired supplier (s) for the backfill materials (please
63 Subcontractors The attributes below are associated with subcontractor work
64 Scaffold Desired subcontractor (s) for scaffold works
65 Labour Desired subcontractor (s) for agency labour
66 HDD Desired subcontractor (s) for Horizontal directional drilling
67 Testing Desired subcontractor (s) to carry out route testing
68 Access Desired subcontractor (s) for access works
69 Resource The attributes below are associated with the key resources on a project
70 Bid manager Proposed bid manager for the opportunity
71 CM Proposed commercial manager for the opportunity
72 Estimator Proposed estimator for the tender stage
73 PM Proposed project manager for the delivery stage
74 QS Proposed quantity surveyor for the delivery of the opportunity
(continued on next page)
5
Table 2 (continued)
Sr.# Attribute Name Attribute Description
75 Planner Proposed planner for the delivery of the opportunity

76 PDM Proposed design manager for the opportunity
77 PE Proposed project engineer for the delivery of the opportunity
78 SA Proposed safety advisor for the delivery of the opportunity
79 Others Other attributes of a project
80 Variations/Compensation events Any disruption to the agreed workflow which is not the contractor’s fault and has cost implication
81 Risks Potential mishaps during the delivery of the project
82 Opportunity/innovation An instance where an estimated cost is partially reduced or totally avoided
3.2.1. Data consistency verification and elapsed_time. Similar transformations are also desirable for
Several issues may arise when data is moved between the systems. It geospatial attributes to derive attributes like distances, etc. Once at-
is essential to ensure that you are working on a consistent copy of the tributes are derived and appended, original attributes are dropped from
data. Therefore, after data is loaded into data frames, we shall check its the data frames.
consistency. It is always suggested to explore data and at least reconcile In our CST example, we created derived attributes from several
the total number of columns and rows. This verification shall not be date attributes such as project_start_date and pro-
confused with Exploratory Data Analysis (EDA), which is suggested in ject_end_date. For these two attributes, 26 derived attributes are
most of the ML textbooks [9]. Experts in ML recently noticed that we appended to X_raw data frame. We omitted time attributes like hour,
shall avoid EDA before predictive modelling as it can bias our judge- minute and second as our problem does not require granularity up to
ment. ML engineers make decisions like adding or discarding attributes that level. In addition, there were no geospatial attributes in our
based on EDA are sometimes found wrong & misleading. They shall dataset, so no geospatial transformations are performed. We created
progressively acquire domain knowledge during the AML process and several python functions such as expand_date_attr() and ex-
make critical modelling decisions informed by their implication on pand_gis_attr() to automatically extract derived columns for input
model performance, not judgments that are build through the EDA and date and geospatial attributes.
intuition. In the case of CST, we loaded data into X_raw data frame and
checked it’s structure, columns, data types and number of rows using 3.2.4. Data types conversion
python commands display(X_raw), X_raw.columns(), X_raw.- The next step in the AML process is data types transformation. Data
dtypes() and len(X_raw). We compared the output of these com- types describe the domain of attributes. In ML, there are generally two
mands with statistics from actual data stored on Hadoop. There were no main types of data. 1) Categorical data that stores values corre-
inconsistencies found in our data frame. In addition, we also checked sponding to discrete categories for a columns. For example,
the top and bottomed 10 rows to ensure character encodings of the Workstream attributes is categorical and holds values like Cabling,
columns have no issues. Substation, or Overhead lines. 2) Numerical attributes store
numerical values, which can be integers or real numbers like costs,
3.2.2. Target attribute transformation margin and net sales value (NSV). When we load our data, most
The first data transformation begins with modifying the target at- analytical toolboxes interpret numerals properly. However, these tools
tribute (y) based on the chosen performance metrics. As explained in often misinterpret categorical attributes into character data. ML en-
the earlier section, performance metric plays a vital role in producing gineers shall manually transform character data back to categorical
reliable ML models. We can transform (y) on the fly through data data for the model to harness it for learning underlying patterns and
augmentation that is computationally intensive due to repeated pro- insights.
cessing in each epoch or use data preprocessing once before we start In our CST example, we noticed several attributes like Region and
training our models. A good practice is to avoid compute-intensive Workstream were stored as character objects in X_raw data frame. We
operations to save time and resources. Therefore, the y attribute shall created a python function to_cats() for automatically converting
be transformed once at the source than repeatedly inside the cost character attributes into pandas library category objects. This
function during training, which is a computationally-intensive and in- function uses python utility function as_type(‘category’) for data
feasible approach. types conversion. For ordinal categories, we can specify the order of
In our CST example, we intend to use RMSLE as the performance values inside the categorical attributes if it is necessary. By default,
metric, which is the log difference of actual and predicted profits on pandas maps textual descriptions with numerical categories in
projects. We performed this computation at once on the y_train and Jupyter but it treats these columns as numerals internally. Optionally,
y_valid data frames. We used the log() function from numpy library these attributes can be replaced with numerical codes outrightly.
for this transformation. This way we speed up various tasks of the AML However, it will adversely affect ML engineers result-interpretation
process through intelligent design choices. abilities since they would be required to remember all categories be-
hind these codes.
3.2.3. Feature extraction
Feature engineering is one of the most crucial tasks in the AML 3.2.5. Fix missing data
process. It involves a series of transformations to our data to enable the A common problem in almost every dataset is the issue of missing
algorithm to quickly learn underlying insights. Our approach to feature data. There are numerous reasons—ranging from a human error to
engineering is slightly different. We barely perform feature engineering incorrect sensor readings to a software bug that causes a value to be
at this stage of the AML process except for the date and geospatial missing. In some cases, missing data is absolutely no issue since it will
attributes. This section explains commonly used transformations for get populated at a later stage in the business process. A simple solution
date attributes for enabling the model to understand temporal de- to deal with missing values is to delete all rows containing nulls.
pendencies in the data. We usually transform date attributes into However, it will drastically shrink the dataset size if the majority of
several derived attributes, including day, week, month, year, rows constitute nulls. Sometimes, the presence of nulls can be a pattern
quarter, month_first_day, month_last_day, quarter, weekday, of interest and can provide additional insights to ML algorithms for
weekend, quarter_start_day, quarter_end_day, week_number, learning the relationship between input and the target attributes.
6
Therefore, missing values must be dealt with great caution. Some 3.3. Training baseline estimator
analytical libraries like pandas in python automatically create an ad-
ditional category of 1 for null entries if we change an attribute data So far, all categories in the data are replaced with numerical codes,
type to category. However, ML engineers need to fix missing values in nulls are imputed, and our data is split into the input and target attri-
all continuous attributes. The most vastly used approach for dealing butes. The next step of the AML process is to train the baseline esti-
with nulls is imputation, which replaces nulls with an estimate. A mator. This section provides guidelines for creating baseline estimators.
common imputation technique is Mean imputation where mean of an Though it involves extensive experimentation to find one with rea-
attribute substitutes all nulls in that column. sonable accuracy. The baseline estimator sets the stage for the next AML
In our CST example, we used mean imputation to populate nulls in task, where we use ML to guide engineers in making critical design
all continuous attributes. In addition, we also included additional columns decisions in an informed way. More importantly, feature engineering
to the data frame x_raw for all columns containing nulls to retain insights where baseline estimator informs ML engineers what to include or drop
that can be revealed from the pattern of nulls in those attributes. These from the feature list, based on the model’s accuracy. The following
new attributes are named by suffixing column name with text _na. For sections explain these steps in much detail.
example, the nulls in route_length attribute are substituted with mean
of route_length, and a new column route_lenght_na is appended to 3.3.1. Choose first ML algorithm
the data frame, storing the digit 1 for all rows with nulls otherwise 0. We Our first step to developing the baseline estimator is to decide the
shall also preserve these mean values as we will need these in future for kind of ML algorithm we would employ. A vast majority of ML tasks can
fixing nulls in the validation as well as test sets. Otherwise, the validation be modelled using two main classes of algorithms. Firstly, the
and test sets will not be fit for evaluating the ML models. Decision Tree Ensembles (DTE), including Random Forests or
Gradient Boosting Machines. DTEs are suitable for tasks involving
3.2.6. Scale transformation of continuous data structured data representing different facts like a relational table.
Another useful data preparation technique is scaling or normal- Secondly, the Deep Neural Networks (DNN) trained with Stochastic
isation of all continuous attributes in the dataset. Most ML engineers Gradient Descent (SGD) which work great for tasks involving un-
feel confused with scaling and normalisation as both transform our structured data like audio, vision, and natural language.
data. The key difference is that scaling modifies the range of data to In our CST example, we chose the random forest algorithm for
make it fit a given scale like 0 100 or 0 10 . Normalisation, on the training our baseline estimator. A random forest is a bunch of decision
other hand, shifts data distribution such that it can be described as the trees, containing a bunch of decisions to classify data into several
normal distribution. Scaling is a good choice for techniques like sup- clusters. We used popular python library sklearn for this purpose and
port vector machines (SVM) or k-nearest neighbours (KNN). imported the RandomForestRegressor class to start training our
Normalisation works best with t-tests, ANOVA, linear regres- model. We began with all data and random forest with ten decision
sion, linear discriminant analysis (LDA) and Gaussian trees. The results were remarkable. The accuracy (R2 ) for the model was
naive Bayes. 96.72% . However, we do not know whether the estimator is good, or it is
In our CST example, we did not apply any scaling due to our choice merely overfitting our data which is a key challenge for ML engineers.
of the algorithm, i.e. random forest. The tree-based ML algorithms This issue can be elaborated using Fig. 2 where three models are de-
have no requirements as such of the data to be normally distributed or picted. The first model fits the data using a straight line (linear & biased
normalised. However, we applied Box-Cox transformation for all case), and the third model fits through every data point, making a
continuous attributes when we fitted the deep neural network for profit curved line (variance). These two models are not considered reliable as
margin estimation. they are either unable to learn the relationship due to underfitting or
instead memorised the data entirely due to overfitting. Such models
3.2.7. Encoding the categorical attributes perform poorly on the unseen real-life data and are not desirable. The
The categorical attributes contain non-numerical values, which second model though looks better in terms of learning abilities and
sometimes are not suitable for most of the ML algorithms. We can generalisability, so we intend to get models with similar characteristics.
transform categorical attributes in two ways. Firstly, integer-en-
coding where unique numbers are assigned to categories as we did 3.3.2. Crafting great validation sets
during the data type conversion task. This approach suits scenarios The issue of bias and variance of a model leads us to the issue of
where categories contain inherent order like regions. Integer-en- creating great validation sets. There is no way to know the learning
coding does not work for categorical attributes with no ordinal re- abilities of an estimator without a good validation set. We shall not
lationship. We are likely to get unexpected results from our analysis if confuse validation sets with test sets. Validation sets are generally
we treat nominal attributes as ordinal. Secondly, one-hot encoding carved out of the training set whereas test sets are suggested to set aside
is another alternative for categorical transformation, where one binary at the beginning of the AML process and shall not be used during
attribute is added to the dataset for each category in the column and a training phase until we are ready to test our final estimator. This
value of 1 is recorded for respective category in the row otherwise 0. strategy is to ensure two-staged evaluation of estimators and has been
These newly created attributes are sometimes called dummy attribute. found phenomenal to achieve several benefits. More importantly, it
In our CST example, we introduced the idea of a threshold max_- enables ML engineers to understand the capabilities of models in terms
num_cats to guide the algorithm to employ integer encoding if of overfitting and underfitting. The role of validation sets is also crucial
categories are above the max_num_cats otherwise use one-hot en- to find out the most optimal hyperparameters of an ML algorithm for
coded for categorical encoding. After some experimentations, we the given dataset. We usually merge training and validation sets before
found max_num_cats of five (5) a good threshold for choosing between training our final estimator. So, carving out a robust validation set is a
categorical encoding options for this dataset. The value of max_num_- key stage in the AML process. A simple caveat is to choose those data
cats shall be stored for future use to transform the validation and test that resemble closely to the real-life scenarios if model will be deployed
datasets. All basic data preparation steps end here. We usually split our in the enterprise solutions. We usually create several validation sets and
data at this stage into an input matrix X and a target vector y. All data then perform some experimentation to decide about a good validation
shall be in the numerical format and stored in feather format using set.
python to_feather() function. Feather is fast on-disk format for data In our CST example, we created three validation sets from our
frames and is used by ML engineers for data exchange and interoper- training data. The first validation set was a random sample of 1% pro-
ability of datasets between analytical tools. jects from our training data. The second validation set entails all
7
Fig. 2. Underfitting and overfitting of estimator.
projects that were completed during the last month in our data. The last Where n is the number of random samples used to train each
one involved projects which were completed in the last quarter in the tree. We performed a bit of experimentation to obtain a good
data. We then trained several models and checked their performance on sample size for this data set. Our experimentation revealed n of
these validation sets. We choose the validation set that holds a linear size 50, 000 a good option. Using the subsampling, the accuracy
relationship between the training and validation scores for all models. of baseline estimator (R2) went to 87.93% and 86.19% for training
In our CST example, the validation set containing projects from the last and validation scores. This model is more reliable than our first
month demonstrated linear relationship between training and valida- model with training and validation accuracies (R2) of 97.13%
tion scores. This validation set is chosen for subsequent experimenta- and 86.04% , respectively. There is a huge drop in performance of
tion in our study. As discussed earlier, we always stick to the naming the earlier baseline estimator due to overfitting.
conventions for data sets since they are used internally by our generic ii). Hyperparameter Tuning: Hyperparameters are the knobs for ML
print_score() function to check the errors and accuracies of models engineer to adjust ML algorithms for the given task. Their values
during our experimentation. cannot be learned from data; instead, ML engineers need to decide
these values before the training begins. Their right choice can
3.3.3. Devising performance tuning strategy greatly boost the learning abilities of ML models. The whole idea of
The next step in the AML process is to formulate a tuning strategy to hyperparameter tuning is to explore the search space for various
improve the performance of our initial estimator. There is no single best combinations of parameters that would yield superior performance
strategy to tune the estimator. However, a structured approach to for ML models. It can be extremely computation-intensive operation
performance tuning is key to conduct an effective experimentation. if we check for all permutations. Approaches like Grid Search and
Most modelling decisions at this task depends on our choice of ML al- Randomised Search are vastly used techniques by ML engineers.
gorithms. We chose the random forest in this study. Two areas require Over the period, engineers get familiar with these hyperparameters
careful consideration during performance tuning of random forests. and mostly know when they would work. Hyperparameters vary
These include subsampling to quickly conduct experimentation and from one algorithm to another. Once we find out good parameter
hyperparameter tuning to understand what options would enhance values, we start training more ML models and perform extensive
the estimator learning abilities on the given task. feature engineering until we get to our final model. In our CST
example, we tuned the model for n_estimator, max_depth,
i). Subsampling: It is generally not advisable to utilise all training min_samples_leaf, max_features and oob_score. These hy-
data for the baseline estimator at first. Otherwise, our experi- perparameters are only relevant to the random forest algorithm.
mentation will take considerable time, which is not good to carry The n_estimator specifies the number of trees in the random
out an effective exploration of different design strategies. In the case forest. The key idea behind random forest to combine several weak
of a random forest with ten decision trees (estimators), it will take estimators to get one powerful estimator. So it always worth ex-
considerable amounts of time to train each estimator with all data. ploring different number of trees and see what would work. The
So, we pick different random subsets of data to construct each de- values like 1, 5, 10, 25, 40, 80 , and 100 are tried. The n_estima-
cision tree during our training phase. In addition to reducing tors = 80 achieved model’s performance for training and valida-
training time, this strategy also introduces randomness by picking tion accuracies to 95.80% and 88.37% , respectively. The min_sam-
different subsets of data for estimators, which is key for random ples_leaf specifies the minimum number of samples required at
forest algorithm to learn complex domains. The more the trees vary each node. The values of 1, 3 and 5 are tested, and model with
in a forest, the better would be the predictive accuracy of the model. min_samples_leaf = 3 gained better training and validation ac-
This strategy of using a small subset of data during training is curacies of 97.20% and 90.19% , respectively. The max_depth spe-
generally called subsampling. In our CST example, the scikit- cifies maximum depth of the tree in random forest. While we tried
learn library does not provide any provision to customise its de- several values, this parameter didn’t contribute much to the accu-
fault mechanism which uses entire data during training the model. racy of our model. The max_feature specifies the maximum
We wrote the following code to override the default behaviour in
scikit-learn library to achieve subsampling:
8
features to be included at each split for the best fit. Values of none, (but are not limited to):
0.5, sqrt, and log2 are tested. The training and validation per-
formance of 97.02% and 90.66% respectively is achieved with a. How confident are you of the predictions?
max_feature = 0.5. b. What attributes drive the predictions?
c. How attributes interact with others to drive the predictions?
d. How much attributes contribute to the predictions?
3.3.4. Choosing baseline estimator e. How good the model can extrapolate unseen data?
This is the last step in training our baseline estimator task. We
carried out extensive experimentation on hyperparameter tuning and Most tasks in interpretable machine learning require ML engineers
subsampling to learn which configurations of the random forest will to work with domain experts actively during the AML process. It en-
yield us the best estimator. Once these steps are complete, we train our ables the ML engineers to progressively learn the domain and make
baseline estimator and evaluate its predictive performance. It is the informed decisions to revise data and models. Tasks in this section are
baseline estimator obtained from best design strategies and hy- vital for many ML operations ranging from debugging models to feature
perparameters is used for interpretable machine learning. In our CST engineering to future data collection to facilitate human decision
example, we trained baseline model using random forest for profit making and building trust. Interpretable machine learning is the cor-
margin estimation with subsampling (n = 50,000), n_estima- nerstone of our proposed AML process. We will see several interpretable
tors = 80, max_features = 0.5, min_samples_leaf = 3, and machine learning techniques for answering the questions above.
oob_score = True. The baseline estimator has the training and vali-
dation accuracies of 97.02% and 91.67% , respectively. Algorithm 1. Predictions Confidence Assessment Algorithm
3.4. Interpretable machine learning 3.4.1. Checking predictions confidence of baseline estimator
A common aphorism in statistics is that all models are wrong, but
Machine learning (ML) is a great technology, but industry folks will some are useful. Nobody can claim their model is 100% accurate re-
never take it for our word. They ask a series of questions out of their gardless of what data or algorithm we choose. This inability is partly
curiosity to understand ML capabilities and weaknesses as intelligent because real-world phenomena have many perspectives which are hard
solutions are getting deployed in their line of work software. They will to capture in one model holistically. This aphorism brings us to an
not use ML with confidence until they fully trust it, which requires important inquiry which we mostly received from end-users. They want
significant verifications. A piece of simple advice for ML folks is to let to know the competencies of our ML models. In our CST example, we
the model corroborate what industry folks already know. People in the were often asked about how the confidence of models varies across
industry will start trusting our models even if they do not know much different types of projects. To elaborate on this analysis, we will first
about ML technology. For this reason, our applied machine learning explain the concept of confidence in random forests. Random forest is
(AML) process put special emphasis on explaining our models which go an ensemble of trees where trees make predictions, and random forest
beyond just making good predictions. It is about enabling models to aggregates their predictions into the outcome. So, if each tree predicts a
answer the critical question of the end-users of the system. Otherwise, slightly different profit margin, it indicates higher confidence.
the wider adoption of ML into the construction industry will not be Otherwise, huge disagreement in the predictions from trees is a sign
achieved. This field of exploring the strengths and weaknesses of ML that the model is less confident. The only reason predictions of trees
models by analysing their internals is called interpretable machine vary in random forest is if projects, we are predicting for, are either
learning, which will be the focus of this section.
A common misconception about contemporary ML is the assump-
tion that these models are black boxes. Nobody knows what they learn Table 3
and how they make predictions. Their predictions are great, but what Confidence of estimator by project size.
logic drove that prediction. Surprisingly, most ML engineers are barely Project Size Confidence Score
aware of the approaches for analysing ML models. The entire field of
interpretable machine learning is to understand the competencies of Large 0.034508
Small 0.029514
our ML models. This field is vast. We intend to cover a few techniques
Medium large 0.027730
to answer some key questions asked by the industry folks. The chosen Medium 0.026984
questions are the ones asked by most of the industry folks whenever we Mini 0.026122
deployed ML algorithms as part of the CST. These questions include
9
importance of attributes. Attribute importance is an excellent topic in

ML, which reveals the significance of an attribute on model perfor-
mance. Several algorithms are devised to perform attribute importance.
The popular ones include permutation importance and model reliance.
We used permutation importance in this study because it is fast and
model agnostic approach. One benefit of permutation importance is
that we compute it without training ML models on every subset of our
data. There is no need for training models as it is always performed
after the model is trained. This concept of permutation importance is
straightforward. The importance of an attribute can be determined from
the increase in the model prediction error after we shuffle data in that
attribute. We leave rest of attributes intact. The prediction error of the
model shall increase if that attribute is vital to the model. The shuffling
of data is performed to break the actual relationship between that at-
Fig. 3. Data distribution by project size. tribute and the target attribute. The higher the prediction error go, the
more important the attribute is to the model. Algorithm 2 illustrates key
non-existent in our data, or are sparse. Random forest has no rules to steps involved in the permutation importance.
make consistent predictions; thereby, those projects end up in entirely
Algorithm 2. Permutation Importance
leaf nodes in different trees. This variation of individual tree predictions
can be analysed using standard deviation (SD). We can figure out the In our CST example, we employed PermutationImportance
confidence for our models from that ratio of SD of predictions. The class from python eli5.sklearn library. Fig. 4 show the importance
model will be confident of its predictions if SD across all tree predic- of all attributes used for predicting the profit margins. Attributes with
tions is low otherwise less/no confidence. There is no built-in library to higher importance scores are at the bottom of the plot. The model finds
check the predictions confidence of ML models. We performed several these attributes important in driving its predictions. The attributes
steps, as shown by the Algorithm 1, to perform this analysis. having lower scores are least significant to the baseline model. Since
In our CST example, we explored the predictions confidence of our there is a degree of randomness to the exact performance by shuffling
baseline estimator for different types of projects. For the sake of brevity, data, this library shuffles data several times to ensure that the real re-
we will only discuss this analysis for one categorical attribute, i.e. lationship between attributes and target is fully broken. These scores
project size. In real-life, we analyse predictions confidence across can be negative for some attributes, which reveals that the model at-
all categorical attributes to truly understand the strengths and weak- tributes no significance to these attributes. Occasionally, it is worth
nesses of our models. Confidence score in the algorithm is the ratio of removing such those attributes.
the standard deviation of tree-level predictions over the predicted profit Fig. 4 shows that project complexity in terms of risks, opportunities
margin. Table 3 shows the confidence score for all categories in the and the distance which construction route travels through the rivers,
project size attribute. The higher the confidence score, the lesser roads, rails and utilities, are amongst the most important attributes.
the confident the model is on predictions for that category. In this case, Besides, the allocation of resources in terms of the project manager
large projects category has the confidence score of 0.0345 which (PM), quantity surveyor (QS), commercial manager, design manager,
shows that our baseline estimator is less confident for predictions on suppliers and subcontractors are also meaningful in predicting profit
large projects. There are several reasons for this lack of confidence. margins. Materials, though, are not as significantly important to the
One major reason is the uneven distribution of projects in our data set. model. At this stage, the ML engineers work in tandem with domain
Fig. 3 shows the distribution of projects in our data by the project experts to discuss results and learn the required domain knowledge.
size. The distribution clearly shows that large projects are fewer in When we shared these results with estimators, they agreed with the
our data which has caused a drop in the confidence of our model. Based attribute importance chart. Our model was able to verify the knowledge
on this insight, we requested additional data with a predictive theory estimators had, which helped them to gain confidence in our model.
that more data will improve predictions confidence of our model for While attribute importance is a great tool for ML engineers to explain
large projects. Once additional data for large projects is received the importance of attributes, it is also useful to drive several feature
and re-trained our baseline model. We found that the confidence score engineering steps. It is worth recognising that our proposed AML pro-
of our baseline model has slightly improved. In this way, we were able cess delayed feature engineering so far. In the beginning, we performed
to use insights to inform our modelling strategies, which was the future a minimum of necessary feature engineering but not based on domain
data collection requirements. relevance, which most people suggest. The chart in Fig. 4 shows that
many attributes fall on the long tail are least important and could be
deleted. With several experimentations, we found that we can delete
3.4.2. Checking attributes driving predictions
almost half of attributes without significantly losing the model
The next question often asked of ML engineers is about the
10
Fig. 4. Attribute importance chart.
performance. We discussed our findings with end-users and deleted adversely affects the performance of ML models but also obscure some
some of those attributes as soon as industry folks agreed. Most of these important attributes by revealing similar information during the
attributes had the importance of less than 0.05. The removal of attri- training process. It incapacitates the model’s abilities to understand
butes slightly changed the importance of other attributes. This is partly attributes importance. The performance of models can be significantly
due to the correlation between various attributes. When we deleted the improved by removing such confounding attributes from our dataset.
least significant attributes, the remaining correlating attributes became As a result, we will achieve simpler but more robust models which will
more visible and their importance scores improve. This type of deletion be based on fewer input attributes. Such models are always efficient
enables us to develop efficient models with fewer attributes. and are hard to overfit. However, the choice of removing attributes
shall not be entirely subjected to the intuitions of ML engineers or
3.4.3. Identifying multicollinearity & removing attributes domain experts. Instead, their judgements shall be informed with some
In the previous section, we slightly talked about the issue of mul- objective evidence. To this end, there are several statistical approaches
ticollinearity and its effect on model when we delete attributes. to find out the similarity between attributes in datasets. Unsupervised
Multicollinearity occurs when input attributes correlate with each ML techniques like clustering can play a significant role to inform ML
other. In simple words, a subset of attributes is likely to supply over- engineering during the decision of removing attributes. Feature im-
lapping information to our estimator. Multicollinearity not only portance, performed in the previous section, is useful to have a feel of
11
Fig. 5. Agglomerative clustering for attributes similarity.
least important attributes and remove them from our data. Identifying In our CST example, we used hierarchical agglomerative clustering
correlated attributes from the dataset are a key step in our proposed for finding the similarity between input attributes. We calculated the
AML process. Supervised ML approaches like clustering are specialised Spearman ranked correlation and turned it into a distance matrix.
for this type of exploration. Their accuracy exceeds other traditional ML Distance matrix provides necessary information to construct the hier-
approaches. archy between attributes based on their similarity. Fig. 5 shows the
results of agglomerative clustering using dendrogram, where attributes
Algorithm 3. Multi-Collinearity Assessment Algorithm
12
are divided into various groups. We can quickly spot similar attributes attributes are involves in rules in tree nodes. Temporal attributes like
in the dendrogram as they fall closer to each other under a single year is one example of attributes that can restrict a model to perform
parent. While clustering provides an abstract grouping of attributes that reliably in real-life scoring. A tree model cannot accurately predict for
worth investigation for multicollinearity. This information shall still be projects which are from a year that is not available in the training data.
verified from domain experts. We shared these clusters with our in- The deletion of such features is one of the key step in the AML process
dustry folks who acknowledged the results of the clustering algorithm. to overcome overfitting and achieving robust ML models.
The clustering algorithm suggested eleven (10) groups where one or In our CST example, there were several attributes with date ele-
several attributes are collinear. For example, the clustering algorithm ments such as Start date, End date, ITT date, Tender submis-
revealed that attributes Total length in river (m), Total length sion date, Contract award date and Mobilisation dates.
in rail (m), Total length in road (m), Total length in Besides, attributes like Opportunity id and Opportunity title
utilities and Total length in other (m) in cluster 3 are are sort of unique identifiers and have no relevance for predicting the
correlating with Horizontal Directional Drilling (HDD) in profit margin of a construction project. We performed thorough ex-
river, HDD in rail, HDD in road, HDD in utilities and HDD in perimentation to see how the removal of these type of attributes affects
other, respectively. We then used Algorithm 3 to find out the feasi- the performance of our model. In theory, the validation score of the
bility of removing attributes from these group. The overall aim is to model shall improve if we remove attributes that hinder the gen-
remove those attributes where the accuracy of the model does not drop eralisability of our model. We derived Project Duration from at-
too much. According to experts, horizontal directional drilling (HDD) is tributes like Start date and End date and then dropped those at-
a specialised activity and one of the highly risky activity on a project. It tributes from our training data. All unique identifiers are also removed
is a method of creating cable trenches underneath rivers, rail, bridges from the data. The predictive accuracy of the model is checked before
and public roads such that the path of the cable continues. The system the deletion of all temporal or unique identifiers.
dictates that ‘no of HDDs in…’ is not as strong as ‘total length of HDDs In most cases, the model was able to achieve higher accuracy on
in…’. The is backed up by the industrial experts, and the argument is validation data. The final model has fewer attributes but has a higher
that the total length of HDDs in any of the above-mentioned scenario accuracy of 95.23% and 93.44% on training and validation data, re-
allows the project team to develop a fail-safe mitigation strategy to spectively. Our model is intended to be used for real-life scoring on
performing this task. The total length of HDDs translates to risk levels, future projects. So, we removed the time-dependent attributes and
risk pot percentages, subcontractor hire, drill and drill bit hire, permit unique identifiers from our data. In this way, we were able to increase
orders and environmental distress. The more there is to do the more the extrapolation of our ML model for future projects and tackle the
risks there are. Therefore, we dropped Horizontal Directional issue of overfitting.
Drilling (HDD) in river, HDD in rail, HDD in road, HDD in
utilities and HDD in other and left their counterparts.
Similarly, the route length of a project was found to have several 3.4.5. Model interpreter
attributes correlated to it such as the power cable length, no of circuits, Another common end-users requirement is to show the breakdown
no of phases, no of joint bays, fibre cable length etc. Experts explained predictions. In our CST example, how the model factored out project-
that the route length plus the no of circuits and the no of phases come specific nuances to build the profit margin. The end-users this ex-
together to give the total length of power cable needed. As you can see planation for clients to explain the output is higher or low than their
here, there are two correlation levels between the route length and expectations. There are several techniques to breakdown prediction for
cable length and between no of circuits and phases and cable length. ML models. In the case of random forest, each tree node contains a
But the ultimate parent attribute to all is the route length. Likewise, value which is the sum of the target feature for all the rows contained in
there is one correlation between the route length and total length of that specific node. So, when we traverse the tree from top to down, this
fibre and route length and pilot cables to be used on the project. value fluctuates on every split of the node. This fluctuation can be used
Whereas in the first example of the HDDs and total length in HDDs to calculate the contributions of attributes to final predictions.
areas where some attributes were completely removed from the system, In our CST example, we used python package treeinterpreter
the system shows that in this instance between route lengths and its for this purpose. This package has a predict() function which accepts
correlated and sub correlated attributes, some attributes still demon- the model and the row for which we want to predict the profit margin
strate very strong contributions to the profitability such as the cable and returns predictions, bias and contributions of each attribute to-
length and no of joint bays. Therefore, not all child-attributes were wards the prediction. We used a power transmission and distribution
deleted in this example, only those whose contributions were com- project as a case study. The scope of the project was to design, excavate,
pletely mundane and those that industry experts have acknowledged lay, joint and terminate 2500mm2 copper cables through a mildly rural
their redundancy were deleted. After we dropped several attributes and urban area between two power poles. The model estimated a
from each group, the training and validation accuracy of our model margin of 17.86% and displayed a rare functionality of ML applications
slightly dropped from 93.33% and 91.32% to 89.89% and 87.12% respec- by revealing the process to calculate the project margin, as shown in
tively. Fig. 6 using waterflow chart. The industry folks realised that the model
has learnt that a 3.5 km, 110 kV cabling project with just one outage is
3.4.4. Extrapolating estimator widespread and the firm makes a profit from these. Whereas with at-
Most ML models are excellent at interpolation, which means that tributes such as HDD in rail, and the firm rarely approve projects that
they can predict with higher accuracy what is known to them (i.e. data require directional drilling through a rail track because of the com-
from training sets). However, another important feature of robust ML plexity and social and economic distress associated with it. Therefore,
models is extrapolation, which is about making predictions for what is the system is right to push the margin of this attribute higher than the
outside of these known, i.e. validation or unseen real-life data. others. Likewise, the system evaluated the possible profitable variations
Extrapolation reveals the generalisability of a model. Overfitting is a to be made from the project and understand that this is too low for this
major cause of restricting the extrapolation capabilities of ML models. type of project; hence, it raises it’s associated margin higher. These
However, we can disclose the overfitting of a model by devising good experts explored each percentage against their knowledge and data to
validation sets. Oftentimes, ML engineers use attributes during the validate the process the system has adopted. At the end of the ex-
model training, which restrict models from extrapolation. This issue ploration, the experts praised the model and expressed their support for
becomes more prominent for tree-based models as their predictive ac- the further development of the tool.
curacy drastically declines during production on unseen data if such
13
Fig. 6. Tree interpretability for profit margin forecasting.
3.5. The final model (X) and test (X_test) sets. This accuracy is reasonably high. In addi-
tion, the generalisability of the model is significant for the unseen data
So far, we have performed a lot of experimentation to achieve two from the X_test set. The test data is eventually merged and the final
major things. Firstly, we explored the hyperparameters space to see model is trained. We stored the model into .pkl file to be utilised in
what values would enhance the learning abilities of the underlying our opportunities analytics dashboard for production deployment.
algorithm for the given dataset. Secondly, we examined the data
through Interpretable Machine Learning to understand the suitability of 3.6. Production deployment
input attributes for the ML task. During the AML process, we realised
that some attributes are insignificant to our modelling task, which were The last step in our proposed AML process is the production deploy-
discarded. We also observed that some attributes correlate with others ment, where we put our models for real-life scoring. ML technology is
and need to be removed as they confuse the model. Therefore, we re- elegant, but through enterprise solutions, we take it to the next level of
moved such attributes. ML engineers shall have a good idea at this AML producing actionable insights. These models often form a part of complex
process stage about what data and hyperparameters would work for the software. There are several technological challenges associated with pro-
given task. We start training our final estimator at this stage. We duction deployment. For prototyping our models, we tend to use python or
slightly talked about the test set at the beginning that it is advised to R languages. However, the technologies behind enterprise applications
hold it out for later evaluation. Test sets become relevant at this stage to may use entirely different programming languages. We need to integrate
know the final accuracy of our estimator before the final model is ML models into these systems to accrue the real benefits of automation.
trained. To this end, we first merge the training and validation sets. There are two widely used approaches to integrate ML models into en-
Then use the test set to evaluate our trained model using merged data. terprise solutions. The first is to rewrite the code for the model into the
However, we train the final model on all datasets, including training, language of the system. This sounds interesting, but it entails enormous
validation and test sets. There is a substantial reason to utilise entire programming efforts. Besides, most programming languages are not sui-
data for training the final model. The model will not be able to learn table to efficiently perform heavy computation required by these models.
patterns of variations from the validation and test sets if we skip those Secondly, we can use web services for ML models to ensure language-
datasets during the training. Consequently, the predictive accuracy of agnostic deployment. This is the most popular option. Most software en-
the model will drift significantly for unseen data due to gaps in his- gineers use web services for invoking ML models from their enterprise
torical data used for its training. We shall ensure to combine all datasets applications. The first approach is more suitable for onboard AI systems
before start training the final estimator. The training of the final model where devices have limitations for speed, memory and connectivity,
relatively takes more time. We shall stick to the best values of hy- whereas the second approach is vastly adopted approach in most en-
perparameters that were discovered during the process. terprise business applications.
In our CST example, we combined two datasets X_train and In our CST example, we aimed to utilised our ML model in the
X_valid into X tensor and then trained the random forest model. The Construction Simulation Tool (CST). The model discussed in this study
model has attained the accuracy of 96.77% and 94.91% on the training is used for profit margin estimation on ML-based opportunities
14
Fig. 7. Machine learning (ML)-driven opportunities analytics dashboard.
analytics dashboard. Fig. 7 shows the screenshot of the opportunities neural network for training ML models for these tasks. The discussion of
analytics dashboard. This is one of the key analytics dashboards pro- which algorithm is superior over the other is beyond the scope of this
vided in the CST. This dashboard contains all key information for un- study. These classes of algorithms have been successfully employed in
derstanding the suitability of project opportunities. Main users of this diverse applications across different industries. Our focus in this section
dashboard are bid managers and estimators. This dashboard gets access is to showcase the effectiveness of our proposed AML process towards
to a large number of predictive models for various tasks. During the creating robust ML models. Overall, these algorithms were able to train
model invocation, CST passes the key details of an opportunity to models with reasonable accuracies. For complex tasks, like predicting
predictive models which predict key commercial information of the the contingency pot which is the additional profit that can be made
project including costs, margin, cash flow, project plan and risks to through effective handling of contingencies in the project, we were able
name a few. Most ML models are designed to guide estimators about to train the deep neural network with predictive accuracies of 73.34%
specific tasks. CST is underpinned by a comprehensive benchmarking and 70.09%, respectively. Whereas, some easier tasks such as Bid/No Bid
system which highlights different values into red, amber and green where model informs the decisions of whether the firm shall bid the
colours. The overall aim is to highlight weak aspect of the given op- given project or not, we attained 98.45% and 97.56% accuracies on
portunity quickly. CST automatically fetches design details of the con- training and test sets. In most cases, we were able to train models with
struction project and formulate input query for the ML model without accuracies above 80% on relatively challenging tasks. While we had
any human intervention. We deployed the profit margin estimation access to vast amounts of relevant past data along with domain ex-
model as a Flask-based web service. CST uses REST API based business pertise, the most distinguished aspect of the work was the proposed
service to invoke this model. The result from the web services are ob- AML process. It is clear from the results that the difference between the
tained as JavaScript Object Notations (JSON) files, which are then training and validation scores for both classes of algorithms is low. The
deciphered by Structured Query Language (SQL) and shown it on the average difference between the training and validation scores for the
dashboard. The final decision is in the hands of the users who can deep neural networks is 3.28%, whereas that for the random forest is
modify or entirely override predictions of the model. In case of revi- 3.35%. This minor difference reveals the robustness of our ML models,
sions to models predictions, CST logs these changes and use them to which is hugely desirable for models intended to be used in real-life
refine its predictions through lifelong learning. applications.
4. Evaluation of applied machine learning process 5. Conclusions
We evaluated the proposed AML process by creating several ma- In this paper, we presented guidelines for Applied Machine Learning
chine learning models for various tasks involved in construction project (AML) for the construction industry. A common experience in most
planning and delivery. Table 4 shows the list of those tasks along with industries is that seemingly impressive ML models fail when deployed
their descriptions. We primarily employed random forest and deep to real-life applications. The fallout includes losing the support for ML-
15
based automation as people start suspecting ML capabilities and re-
Validation Accuracy
luctant to pursue it further. Surprisingly, ML engineers, as well as
construction folks, seldom know tools and techniques for developing
87.23
91.97
92.78
87.17
71.45
60.55
82.56
84.01
60.33
94.91
86.19
80.83
89.1
robust ML models. ML engineers often make modelling choices about
their data and algorithms based on intuition, which are often bias and
Random Forest
misleading. As a result, ML models are not up to the expectations as ML

steps were either omitted or not executed in proper way. While ML is an
Training Accuracy
engineering task, there are several ways to create robust ML systems.

Currently, there is no literature source that provides construction IT
90.44
94.56
95.53
89.44
90.67
74.31
74.23
84.56
85.44
63.45
96.77
87.56
85.78
folks about guidelines for developing great predictive models. This
paper fills this void and presents the AML process in detail, which we
learned over several years while developing the Construction
Simulation Tool (CST). The process is elaborated in the context of profit
Validation Accuracy
margin estimation. Our AML process is evaluated for several ML tasks to

ensure it works.
Deep Neural Network (DNN)
89.42
83.56
86.51
81.05
90.11
82.34
78.34
76.56
82.08
70.09
84.34
91.07
97.56
Interpretable Machine Learning is desirable for using these models
in enterprise solutions. This explanation is key for end-users to trust ML
systems. Besides, users require to understand the limitations of models
so that they can manually override in case ML is making wrong pre-
Training Accuracy
dictions. Our proposed process included interpretable machine learning

for ML engineers as key step to expose the internals of models. The ML
94.73
86.61
88.01
84.06
91.34
83.54
81.45
85.87
84.61
73.34
89.34
94.33
98.45
engineers are not expected of prior domain knowledge; they learn it

through ML models as they progress through the process. The process
stimulates critical modelling tasks like feature engineering as ML en-
gineers learn more about the domain from industry folks. This study is a
Predicting additional profit that can be made by bringing innovation in terms of materials or design strategy to the
Predcting the %age of labour cost with respect to net sales value (NSV) for the project to achieve planned margins
Predicting the profit margin that can be made keeping in view project complexity, resourcing, supply chain and
Predcting the %age of general expenses with respect to net sales value (NSV) for the project to achieve planned
Predcting the %age of subcontract cost with respect to net sales value (NSV) for the project to achieve planned
Predcting the %age of plants & equipment cost with respect to net sales value (NSV) for the project to achieve
part of developing a machine learning-based construction simulation

Predcting the %age of materials cost with respect to net sales value (NSV) for the project to achieve planned
tool which aims to harness historical data to automate or facilitate

Predicting additional profit that can be made through effective handling of contingencies in the project
activities across construction activities involving opportunity selection,

design optimisation, construction estimating and project execution. CST
Predicting the %age of payments that the firm shall let retain the client until the project finishes.
has employed a large number of ML models to support users’ tasks at

various stages of the lifecycle. This paper explains our proposed AML
Predicting additional profit that can made through effective mitigation of identified risks
process, which was key to developing robust ML models in CST.
Acknowledgments
Predictng the first day when firm will start making profit from the project
The authors would like to express their sincere gratitude to Innovate

UK through grant application number 54832–413479; file number
Predicting whether the company shall bid for or not the given project
102473 and Engineering and Physical Science Research Council

(EPSRC) through grant reference number EP/S031480/1 for providing
financial support to carryout this study.
To predict the win probability of a project opportunity.
References
[1] A. Akintoye, M. Skitmore, Profitability of uk construction contractors, Construct.

Manage. Econ. 9 (1991) 311–325.
[2] A. Bagies, C. Fortune, Bid/no-bid decision modelling for construction projects,
Procs 22nd Annual ARCOM Conference, Birmingham, 2006, pp. 511–521.
Machine Learning Task Description
[3] N. Banaitiene, A. Banaitis, Risk management in construction projects, in: Risk

Management-Current Issues and Challenges, IntechOpen, 2012.
[4] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J.B. Tenenbaum, W.T. Freeman, A. Torralba,
Visualizing and understanding generative adversarial networks, arXiv preprint
arXiv:1901.09887, 2019.
[5] M. Bilal, L.O. Oyedele, J. Qadir, K. Munir, S.O. Ajayi, O.O. Akinade, H.A. Owolabi,
planned margins
H.A. Alaka, M. Pasha, Big data in the construction industry: A review of present
other attributes
Evaluation of AML process for several ML tasks.
status, opportunities, and future trends, Adv. Eng. Informat. 30 (2016) 500–521.
[6] I. Bose, R.K. Mahapatra, Business data mining—a machine learning perspective,
margins
margins
margins
Informat. Manage. 39 (2001) 211–225.

project
[7] P.-T. Chang, L.-T. Hung, P.-F. Pai, K.-P. Lin, Improving project-profit prediction
using a two-stage forecasting system, Comput. Ind. Eng. 66 (2013) 800–807.
[8] S. Christodoulou, Bid mark-up selection using artificial neural networks and an
Machine Learning Task
entropy metric, Eng. Construct. Archit. Manage. 17 (2010) 424–439.

[9] V. Cox, Exploratory data analysis, Translating Statistics to Make Decisions,
Springer, 2017, pp. 47–74.
Margin start day
Subcontract cost
Contingency pot
Innovation pot
Retention (%)
Materials cost
[10] I. Dikmen, M.T. Birgonul, Neural network model to support international market
Profit margin
General cost
Labour cost
Bid/No Bid
entry decisions, J. Construct. Eng. Manage. 130 (2004) 59–66.

Plant cost
[11] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine

Risk pot
Win(%)
learning, arXiv preprint arXiv:1702.08608, 2017.

[12] M.F. Dulaimi, H.G. Shan, The factors influencing bid mark-up decisions of large-and
medium-size contractors in singapore, Construct. Manage. Econ. 20 (2002)
Task No.
601–610.
Table 4
[13] A. Enshassi, S. Mohamed, A. El Karriri, Factors affecting the bid/no bid decision in
10
11
12
13
1
2
3
4
8
9
the palestinian construction industry, J. Financ. Manage. Property Construct. 15
16
(2010) 118–142. [27] O. Moselhi, T. Hegazy, P. Fazio, Dbid: analogy-based dss for bidding in construc-
[14] J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning volume 1. tion, J. Construct. Eng. Manage. 119 (1993) 466–479.
Springer series in statistics New York, 2001. [28] O. Moselhi, T. Hegazy, P. Fazio, Expert: an expert system for strategic bidding, in:
[15] P. Gomber, R.J. Kauffman, C. Parker, B.W. Weber, On the fintech revolution: in- Proceedings of Annual Conference of Canadian Society for Civil Engineering,
terpreting the forces of innovation, disruption, and transformation in financial Fredericton, NB, Canada, 1993.
services, J. Manage. Informat. Syst. 35 (2018) 220–265. [29] S. Mullainathan, J. Spiess, Machine learning: an applied econometric approach, J.
[16] G. Hulten, Machine learning intelligence, Building Intelligent Systems, Springer, Econ. Perspect. 31 (2017) 87–106.
2018, pp. 245–261. [30] A. Ng, AI Transformation Playbook: How to lead your company into the AI era,
[17] M.I. Jordan, T.M. Mitchell, Machine learning: Trends, perspectives, and prospects, Mcgraw-Hill, New York, 2018.
Science 349 (2015) 255–260. [31] S. Petruseva, P. Sherrod, V.Z. Pancovska, A. Petrovski, Predicting bidding price in
[18] S.-Y. Kim, T.-A. Huynh, et al., Improving project management performance of large construction using support vector machine, TEM J. (2016).
contractors using benchmarking approach, Int. J. Project Manage. 26 (2008) [32] C. Rudin, G.-Y. Vahn, The big data newsvendor: Practical insights from machine
758–769. learning, 2014.
[19] H. Leon, H. Osman, M. Georgy, M. Elsaid, System dynamics approach for fore- [33] Z. Rui, F. Peng, K. Ling, H. Chang, G. Chen, X. Zhou, Investigation into the per-
casting performance of construction projects, J. Manage. Eng. 34 (2017) 04017049. formance of oil and gas projects, J. Natural Gas Sci. Eng. 38 (2017) 12–20.
[20] H. Li, P.E. Love, Combining rule-based expert systems and artificial neural networks [34] A. Shapiro, Reform predictive policing, Nature News 541 (2017) 458.
for mark-up estimation, Construct. Manage. Econ. 17 (1999) 169–176. [35] A.A. Shash, Factors considered in tendering decisions by top uk contractors,
[21] H. Li, L. Shen, P. Love, Ann-based mark-up estimation system with self-explanatory Construct. Manage. Econ. 11 (1993) 111–118.
capacities, J. Construct. Eng. Manage. 125 (1999) 185–189. [36] A. Toshev, C. Szegedy, Deeppose: Human pose estimation via deep neural networks,
[22] C.-T. Lin, Y.-T. Chen, Bid/no-bid decision-making–a fuzzy linguistic approach, Int. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
J. Project Manage. 22 (2004) 585–593. 2014, pp. 1653–1660.
[23] M.-T. Luong, Q.V. Le, I. Sutskever, O. Vinyals, L. Kaiser, Multi-task sequence to [37] M. Wanous, A. Boussabaine, J. Lewis, To bid or not to bid: a parametric solution,
sequence learning, arXiv preprint arXiv:1511.06114, 2015. Construct. Manage. Econ. 18 (2000) 457–466.
[24] J. Manyika, A future that works: Ai, automation, employment, and productivity, [38] M. Wanous, H.A. Boussabaine, J. Lewis, A neural network bid/no bid model: the
McKinsey Global Institute Research, Tech. Rep., 2017. case for contractors in syria, Construct. Manage. Econ. 21 (2003) 737–744.
[25] E.B. Mendelson, Artificial intelligence in breast imaging: Potentials and limitations, [39] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learning
Am. J. Roentgenol. 212 (2019) 293–299. Tools and Techniques, Morgan Kaufmann, 2016.
[26] T.M. Mitchell, Does machine learning really work? AI Magazine 18 (1997) 11.
17

Guidelines For Applied Machine Learning in Construction Industry-A Case of Profit Estimation PDF

Uploaded by

Copyright:

Available Formats

Guidelines For Applied Machine Learning in Construction Industry-A Case of Profit Estimation PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guidelines For Applied Machine Learning in Construction Industry-A Case of Profit Estimation PDF

Uploaded by

Copyright:

Available Formats

Advanced Engineering Informatics 43 (2020) 101013

Contents lists available at ScienceDirect

Advanced Engineering Informatics

Full length article

Guidelines for applied machine learning in construction industry—A case of T

ARTICLE INFO ABSTRACT

1. Introduction roadmap; hence leads to unfavourable outcomes as we witnessed in

was a better fit for production deployment in the CST.

Model contains one layer which clearly can’t capture non-

Fewer data used in training and evaluation method is not

linearity between project attributes and profit margins.

It is hard to use this sytem as firms can rarely access

machine learning step. This step is crucial as it enables ML engineers

to Trust, Bias, Transparency, Confidence and Interpretability of ML are

data and using the optimal hyperparameters. The resulting model is

tasks are further explained in the subsequent sections.

3.1. ML task definition & data selection

According to Mitchell [26], ML programs learn from experience (E)

3.1.1. Defining task & evaluation metrics

The first step and arguably the most important aspect of ML is to

data of firm and of its competitors.

The next step in the process is to specify performance metric for

evaluating ML models. A performance metric reveals how effectively a

SVM Forecasting System

In such scenarios like profit estimation, only the percentage differences

Fig. 1. An overview of applied machine learning process.

1 Opportunity ID Unique code attributed to the opportunity (8 characters long)

Sr.# Attribute Name Attribute Description

75 Planner Proposed planner for the delivery of the opportunity

Fig. 2. Underfitting and overfitting of estimator.

importance of attributes. Attribute importance is an excellent topic in

Fig. 4. Attribute importance chart.

Fig. 5. Agglomerative clustering for attributes similarity.

Fig. 6. Tree interpretability for profit margin forecasting.

Fig. 7. Machine learning (ML)-driven opportunities analytics dashboard.

4. Evaluation of applied machine learning process 5. Conclusions

based automation as people start suspecting ML capabilities and re-

misleading. As a result, ML models are not up to the expectations as ML

engineering task, there are several ways to create robust ML systems.

margin estimation. Our AML process is evaluated for several ML tasks to

dictions. Our proposed process included interpretable machine learning

engineers are not expected of prior domain knowledge; they learn it

part of developing a machine learning-based construction simulation

tool which aims to harness historical data to automate or facilitate

activities across construction activities involving opportunity selection,

has employed a large number of ML models to support users’ tasks at

process, which was key to developing robust ML models in CST.

The authors would like to express their sincere gratitude to Innovate

102473 and Engineering and Physical Science Research Council

[1] A. Akintoye, M. Skitmore, Profitability of uk construction contractors, Construct.

[3] N. Banaitiene, A. Banaitis, Risk management in construction projects, in: Risk

Informat. Manage. 39 (2001) 211–225.

entropy metric, Eng. Construct. Archit. Manage. 17 (2010) 424–439.

entry decisions, J. Construct. Eng. Manage. 130 (2004) 59–66.

[11] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine

learning, arXiv preprint arXiv:1702.08608, 2017.

the palestinian construction industry, J. Financ. Manage. Property Construct. 15

You might also like