Sharpening The Blade Missing Data Imputation Using Supervised Machine Learning
Sharpening The Blade Missing Data Imputation Using Supervised Machine Learning
A
Department of Industry, Science, Energy and Resources
B
Data61
July 2020
Abstract
Incomplete data are quite common which can deteriorate statistical inference, often affecting
evidence-based policymaking. A typical example is the Business Longitudinal Analysis Data
Environment (BLADE), an Australian Government's national data asset. In this paper, motivated by
helping BLADE practitioners select and implement advanced imputation methods with a solid
understanding of the impact different methods will have on data accuracy and reliability, we
implement and examine performance of data imputation techniques based on 12 machine learning
algorithms. They range from linear regression to neural networks. We compare the performance of
these algorithms and assess the impact of various settings, including the number of input features
and the length of time spans. To examine generalisability, we also impute two features with distinct
characteristics. Experimental results show that three ensemble algorithms: extra trees regressor,
bagging regressor and random forest consistently maintain high imputation performance over the
benchmark linear regression across a range of performance metrics. Among them, we would
recommend the extra trees regressor for its accuracy and computational efficiency.
JEL Codes: C15, C55, C63
Keywords: artificial intelligence, machine learning, data imputation and government administrative
data
For further information on this research paper please contact:
Marcus Suresh
Email: [email protected]
Creative Commons
Attribution 4.0 International Licence
CC BY 4.0
All material in this publication is licensed under a Creative Commons Attribution 4.0 International
Licence, with the exception of:
Creative Commons Attribution 4.0 International Licence is a standard form licence agreement
that allows you to copy, distribute, transmit and adapt this publication provided you attribute the
work. A summary of the licence terms is available from
https://creativecommons.org/licenses/by/4.0/
Wherever a third party holds copyright in material contained in this publication, the copyright
remains with that party. Their permission may be required to use the material. Please contact
them directly.
Attribution
Disclaimer
The views expressed in this report are those of the author(s) and do not necessarily reflect those
of the Australian Government or the Department of Industry, Innovation and Science.
This publication is not legal or professional advice. The Commonwealth of Australia does not
guarantee the accuracy or reliability of the information and data in the publication. Third parties
rely upon this publication entirely at their own risk.
For more information on Office of the Chief Economist research papers please access the
Department’s website at: www.industry.gov.au/OCE
Key points
1. We employ artificial intelligence to facilitate the generation of
synthetic data for the purposes of imputing high-value targets in the
Business Longitudinal Analysis Data Environment (BLADE).
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 1
1. Introduction
On a daily basis, a multiplicity of important decisions affecting human lives are
made. However, in nearly all instances, real-world data are incomplete and
suffers from varying degrees of sparsity. This can deteriorate statistical
inference and affect evidence-based policymaking. This is traditionally
addressed by dropping missing data, but this leads to unreliable outcomes if
the residual data is not representative of the whole dataset. A popular and cost-
effective remedy is to impute synthetic data, however, the current methods
usually remain rudimentary (Bakhtiari, 2019) and inconsistent across agencies
and datasets.
2. Related Work
Most statistical and machine learning algorithms cannot handle incomplete
data-sets directly (Khan, Ahmad, & Mihailidis, 2019). As such, there have been
a plethora of strategies developed to cope with missing values. Some
researchers suggest directly modelling datasets with missing values (Bakar &
Jin, 2019). However, this means that for every dataset and most statistical
inference, we need to build up sophisticated models which are labour-intensive
and often computation-intensive. Alternatively, people often use a two-phase
procedure -- obtaining a complete dataset (or subset) and then apply
conventional methods to analyse the datasets. There are roughly three classes
of methods:
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 2
between the features and lead to poor imputation (Little & Rubin,
2014).
The third class of methods are building statistical or machine learning models
based on data or domain knowledge to impute missing values. They usually
take into account various covariance structures, such as temporal dependence
for time series or longitudinal data, and cross-variable dependence (Jin, Wong,
& Leung, 2005 and Little & Rubin, 2014). These methods impute missing
values based on a distribution conditional on other features and often have the
best performance. In this paper, we focus on these model-based methods.
Based on the MAR assumption, there are several other more robust statistical
imputation methods, ranging from hot/cold deck imputation, maximum
likelihood, expectation maximisation (EM) (Jin, Wong, & Leung, 2005 and
Rubin, 1976), multivariate imputation by chained equations, to Bayes
imputation (Little & Rubin, 2014). These methods are often restricted to
relatively small datasets. For example, Khan, et al. (2019) performed an
extensive evaluation of ensemble strategies on 8 datasets by varying the
missingness ratio. Their results showed that bootstrapping was the most robust
method followed by multiple imputation using EM. Bakar and Jin (2019)
proposed Bayesian spatial generalised linear models to infill values for all the
statistical areas (Level 2) in Australia.
Machine learning and data mining techniques are capable of extracting useful
and often previously unknown knowledge from Big Data. Recently, Yoon, et al.
(2018) designed a novel method for imputing missing values by adapting the
Generative Adversarial Nets (GAN) architecture where they trained two
models: a generative model and a discriminative model, and used a two-player
minimax game. It is worth noting we cannot evaluate deep learning methods
due to security restrictions in the current ABS computing environment, but they
remain a possibility in the future.
Surveying the related work reveals that imputation strategies range from simple
list-wise deletion to sophisticated neural networks. To date, no study has used
the Australian Government's national statistical asset to evaluate supervised
machine learning methods for imputation.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 3
3. The BLADE dataset and Missing Values
BLADE is the Australian Government's national statistical asset which
combines business tax data and information from ABS surveys with data about
the use of government programs on all active Australian businesses from
FY2001-02 to FY2015-16.
Figure 3.1 is a snapshot of the entire BLADE extract for FY2015-16 using a
nullity matrix. The nullity matrix converts tabular data matrices into boolean
masks based on whether individual entries contain data (which evaluates to
true) or left blank (which evaluates to false). The Indicative Data Items are
observed largely in their entirety because this information is compulsory, as
illustrated by the dense vectors. Data sourced from the BAS and PAYG fields
appear more sparse given that they only apply to certain types of firms such as
those that are employing staff or engaging in exports.
Source: BLADE
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 4
Figure 3.2: Correlation Heat map
Source: BLADE
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 5
Figure 3.3: Data nullity correlations using hierarchical clustering algorithms
Source: BLADE
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 6
Euclidean distance of the remaining clusters. The more monotone the set of
features, the closer their total distance is to zero, and the closer their average
distance (the y-axis) is to zero. Cluster leaves which linked together at a
distance of zero fully predict one another's presence. In this specific example
the dendrogram glues together the features which are required and therefore
present in every record. The 3 broad clusters discovered resemble the
underlying structure of Figure 3.2. In the first cluster, features from the
Indicative Data Items are fully observed, followed by features from BAS and
the PAYG Withholding Tax Statement.
4. Methodology
4.1 Process
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 7
Figure 4.1: Repeated K-Fold cross-validation
We brief the 12 learning algorithms (Pedregosa, et al., 2011) below. They were
seeded with the Scikit-learn v0.20.3 default hyper-parameters.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 8
Bayesian Ridge: A ridge regression technique using uninformative priors such
as a spherical Gaussian on w like:
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 9
Figure 4.2: Multi-layer Perceptron
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 10
where n is the number of observations, yi is the i-th observed value, ŷi is its
predicted value and y̅ is the mean of y.
4.4 Conditions
1
The 3 input features are Capital Expenditure, Wages and FTE/Turnover (depending on the target
feature). The 7 input features include the preceding features in addition to Export Sales, Imported
Goods with Deferred GST, Non-Capital Purchases and Headcount. The 14 input features include
all preceding features and GST on Purchases, GST on Sales, Other GST-free sales, Amount
Withheld from Salary, PAYG Tax Withheld, Amount Withheld from Salary, Amount Withheld from
Payments and Amount Withheld from Investments.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 11
Table 4.1: Experiment conditions
5. Experiment Evaluation
5.1 Algorithm comparisons for Turnover
We first examine Turnover as a target feature, comparing the results of all
algorithms, input features and time spans, as shown in Table 5.1. In all cases,
the set of 14 features perform better than 7 features, itself performing better
than 3 features. This applies to all algorithms and metrics. For this reason, we
present results from the 14 feature set and examine the impact of the number
of input features on performance.
Using our performance metrics, the ensemble algorithms provide clearly better
results than the other types of regressors. In particular, the Bagging Regressor
(BR) and Random Forest Regressor (RF) exhibit the lowest MAE at 0.060,
closely followed by the Extra Tree Regressor (ETR) at 0.063. The errors are
an order of magnitude lower than for most linear methods for which the best
MAE is 0.253, for our baseline Linear Regression (LR). The Multi-layer
Perceptron's (MLP's) MAE is larger than that of the ensemble methods, yet
competitive at 0.078. It is well ahead of the Generalised Additive Models (GAM)
at 0.134.
Looking at RMSE, the trends are confirmed and the same three ensemble
methods again perform best. This time the ETR exhibits the lowest error at
0.174, but BR and RF are very close with 0.177. Again, the MLP's performance
is inferior but reasonably close at 0.185, followed by GAM at 0.244. The linear
methods are clearly inferior, and the LR's best RMSE is at 0.381.
As expected, these trends are replicated for sMAPE and MSE, preserving the
same rank ordering observed previously. In terms of R2, the ETR is the best at
93.9%, closely followed by RF and BR, confirming the results from the
individual metrics through strong correlation.
Based on these results, the rest of this paper will focus on the top 3 performing
algorithms -- BR, RF and ETR -- and refer to LR as a baseline.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 12
Table 5.1: Results for Turnover, 3FY (2014-16)
Linear
Regression 14 0.253 0.381 4.62% 0.145 70.82% 333
Decision
14 0.071 0.236 1.39% 0.056 88.79% 2,003
Tree
Ridge
14 0.253 0.381 4.62% 0.145 70.82% 58
Regression
Bayesian
14 0.253 0.381 4.62% 0.145 70.82% 416
Ridge
LassoCV 14 0.253 0.381 4.62% 0.145 70.82% 1,407
OMPursuit
CV 14 0.262 0.392 4.79% 0.154 69.05% 672
Random
14 0.060 0.177 1.16% 0.031 93.70% 17,527
Forest
ML
14 0.078 0.185 1.48% 0.034 93.35% 85,805
Perceptron
GAM 14 0.134 0.244 2.47% 0.060 87.98% 9,472
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 13
5.2 Impact of Input Features
Focusing on the top 3 algorithms and the LR as the baseline, we now compare
the relative performances corresponding to the 3 input feature conditions. In
the base condition, we only use 3 features from the dataset, then increase to 7
and finally 14. We use domain knowledge in the selection of features that reflect
well-established drivers of productivity growth (Solow, 1956), being capital and
labour inputs in the base condition. Similarly, in the second condition, we
include the same features in the prior condition and expand it to include imports
and exports and other expenditures. In the third condition, we use all
continuous features as inputs. While the MAE decreases only slightly for the
LR baseline, by 5.3% from 3 to 7 features and 16.8% from 3 to 14 features, the
improvements are more dramatic for the ensemble regressors, as shown in
Figure 5.1. They register error reductions of 45.8-47.0% when moving from 3
to 7 input features, and 80.3-80.6% when moving from 3 to 14 input features.
As expected, the trends are very similar for RMSE, as shown in Figure 5.2.
The improvements for LR are 4.3% from 3 to 7 features, and 12.8% from 3 to
14 features. While more moderate for RMSE than for MAE, the ensemble
methods display again a strong improvement as the number of features
increases, in the range 32.5-33.7% from 3 to 7 features, and 60.0-61.7% from
3 to 14 features.
Source: BLADE
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 14
Figure 5.2: RMSE of Turnover prediction
Source: BLADE
In some cases, only a single year of data may be available to impute missing
data, which precludes algorithms from potentially learning from prior knowledge
(time series patterns). We examine this by producing the results of the MAE
and RMSE metrics for all algorithms over a single financial year, FY2016 in
Table 5.2.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 15
Table 5.2: Results for Turnover, 1FY (2016)
The same experiment was carried out using FTE as the target, as it is one of
the most sparse vectors in the entire dataset and has a substantially different
distribution to Turnover.
As illustrated in Figure 5.3, the differences between algorithms are smaller than
for Turnover. Performance still increases as more input features are used, with
the best result achieved by ETR with 14 input features registering a MAE of
0.060. This value is very close to ETR's performance on Turnover with 14 input
features (0.063). However, using 3 features only, ETR's performance, 0.079, is
superior to 0.316 for Turnover.
The same pattern applies to most algorithms and can be looked at in terms of
improvement as more features are added. For BR, ETR and RF, moving from
3 to 7 features improves MAE by 8.4-11.5%, while from 3 to 14 features
improves MAE by 20.5-24.2%. These ranges are much lower than that
observed for the same algorithms applied to Turnover (45.8-47.0% and 80.3-
80.6%) as we have seen earlier. The improvement for LR is also very modest
this time, 0.8% from 3 to 7 features, and 3.0% from 3 to 14 features.
The differences in results obtained across the targets with different distribution
help us qualify the resilience of the algorithms and hence their potential
applicability to other microdata sets. In essence, the best performing algorithms
manage to reach similar levels of performance as more features are added,
indicating that using more features are indeed useful. However, in some cases,
the gain in performance may be modest, in which case fewer features may be
used to decrease processing time.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 16
Figure 5.3: MAE of FTE prediction
Source: BLADE
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 17
Figure 5.4: Processing time (log-seconds)
Source: BLADE
6. Discussion
The experiment presented in this paper demonstrates the benefits of using
machine learning-based imputation algorithms on national microdata sets such
as BLADE. The high-performance outcomes achieved should encourage
statistical and government agencies to reliably improve their imputation for
greater data coverage. Our results help practitioners make the best decisions
in terms of algorithms and input features, based on their dataset and analysis
needs, while understanding the impact of different imputation methods.
We also quantified how more input features could substantially improve the
imputation performance. Interestingly, the benefits were less pronounced for
FTE, possibly because (i) less training data are available, only about a third of
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 18
Turnover; and (ii) FTE has a more complicated non-linear relationship to input
features because part-time effort may not be reflected linearly to Turnover.
The main limitation of our work stems from keeping the process simple to
ensure easy adoption and higher generalisability. However, tuning the
algorithms' hyper-parameters to each dataset could substantially improve
imputation performance. It may also dramatically reduce processing time.
Another potential limitation lies in using logarithm transformation to address
data skewness. Practitioners will need to adapt scaling techniques to the
characteristics of their data.
7. Conclusion
We conducted a comprehensive experimental evaluation of machine learning-
based imputation algorithms on the Australian Government's national statistical
asset -- BLADE. Using two target features with distinct characteristics,
Turnover and FTE, we compared 12 machine learning-based imputation
algorithms and found that the extra trees regressor, bagging regressor and
random forest consistently maintain high imputation performance over the
benchmark linear regression across the performance metrics outlined at
Section 4.3.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 19
8. References
Australian Bureau of Statistics. (2019). The Business Longitudinal Analysis
Data Environment (BLADE).
Bakar, K., & Jin, H. (2019). A real prediction of survey data using Bayesian
spatial gen-eralised linear models. Communications in Statistics-Simulation
and Computation, 1-16.
Jin, H., Wong, M., & Leung, K. (2005). Scalable model-based clustering for
large databases based on data summarization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 27(11), 1710-1719.
Khan, S., Ahmad, A., & Mihailidis, A. (2019). Bootstrapping and multiple
imputationensemble approaches for missing data. Journal of Intelligent and
Fuzzy Systems.
Little, R., & Rubin, D. (2014). Statistical analysis with missing data. New York:
John Wiley.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
. . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 2825-2830.
Yoon, J., Jordon, J., & van der Schaar, M. (2018). Missing data imputation
using generative adversarial nets.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 20
ABS Disclaimer
The results of this study is based, in part, on ABR data supplied by the Registrar
to the ABS under A New Tax System (Australian Business Number) Act 1999
and tax data supplied by the ATO to the ABS under the Taxation Administration
Act 1953. These require that such data is only used for the purpose of carrying
out functions of the ABS. No individual information collected under the Census
and Statistics Act 1905 is provided back to the Registrar or ATO for
administrative or regulatory purposes. Any discussion of data limitations or
weaknesses is in the context of using the data for statistical purposes, and is
not related to the ability of the data to support the ABR or ATO’s core
operational requirements. Legislative requirements to ensure privacy and
secrecy of this data have been followed. Only people authorised under the
Australian Bureau of Statistics Act 1975 have been allowed to view data about
any particular firm in conducting these analyses. In accordance with the
Census and Statistics Act 1905, results have been confidentialised to ensure
that they are not likely to enable identification of a particular person or
organisation.
Sharpening the BLADE: Missing Data Imputation using Supervised Machine Learning 21