Automated Data Preprocessing For Machine Learning Based Analyses
Automated Data Preprocessing For Machine Learning Based Analyses
Abstract—Data preprocessing is crucial for Machine Learning [6]- [8]. However only a few papers [1] [2] have investigated
(ML) analysis, as the quality of data can highly influence advanced Type 2 preprocessing steps (mainly feature genera-
the model performance. In recent years, we have witnessed tion). The validation of these preprocessing steps together with
numerous literature works for performance enhancement, such
as AutoML libraries for tabular datasets, however, the field of AutoML libraries has not been studied yet. We have considered
data preprocessing has not seen major advancement. AutoML three relevant AutoML libraries, namely AutoGluon [8], Au-
libraries and baseline models like RandomForest are known toSklearn [6], and H2O [7]. The different preprocessing steps
for their easy-to-use implementation with data-cleaning and supported by these AutoML libraries are summarized in Table
categorical encoding as the only required steps. In this paper, I. It can be inferred from Table I that advanced preprocessing
we investigate some advanced preprocessing steps such as feature
engineering, feature selection, target discretization, and sampling steps like feature engineering and feature selection have not
for analyses on tabular datasets. Furthermore, we propose an been implemented in these AutoML libraries. In this paper,
automated pipeline for these advanced preprocessing steps, which we first investigate some advanced Type 2 preprocessing steps
are validated using RandomForest, as well as AutoML libraries. and later propose an automated preprocessing pipeline based
The proposed preprocessing pipeline can also be used for any on our research. A validation study of the proposed pipeline is
ML-based algorithms and can be bundled into a Python package.
The pipeline also includes a novel sampling method - “Bin- conducted on both the baseline model and AutoML libraries.
Based sampling” which can be used for general purpose data The automated preprocessing pipeline is designed with the
sampling. The validity of these preprocessing methods has been main objective of automatically generating new features from
assessed on OpenML datasets using appropriate metrics such as existing input features to improve the performance metric.
Kullback-Leibler (KL)-divergence, accuracy-score, and r2-score. If the dataset size is large, feature engineering can take a
Experimental results show significant performance improvement
when modeling with baseline models such as RandomForest and longer computational time. Therefore, required research in
marginal improvements when modeling with AutoML libraries. the sampling field is evident, for which we propose the Bin-
Index Terms—AutoML; Preprocessing; Feature Engineering; Based sampling method as an alternative to Random Sampling.
Feature Generation; Feature selection; Sampling. Before proceeding with feature engineering, unnecessary, ir-
relevant, and highly insignificant features are removed since
I. I NTRODUCTION these features have neutral or significantly low information
Data preprocessing is a crucial step in Machine Learning gain for the target variable. These three techniques, viz. feature
(ML) as the quality of data can have a significant influence engineering, feature selection, and sampling are the main
on its performance. Preprocessing is performed to prepare aspects of this paper. Therefore, in Section II, related work for
the compatible dataset for analysis as well as to improve these three techniques is briefly presented. Further, in Section
the performance of the ML model. Preprocessing steps can III, methodology is presented. In Section IV, experiments and
be roughly categorized into two types: model compatible the results obtained are tabulated. We conclude the work in
preprocessing (Type 1) and quality enhancement preprocessing Section V.
(Type 2). A common example of model compatible prepro-
II. R ELATED W ORK
cessing step is the encoding of string values to either Label
Encoded values or One-Hot Encoded values based on the A. Feature Engineering
model requirements. Preprocessing steps like data cleaning Feature engineering is the process of generating new fea-
and missing value imputation fall into the category of Type tures with the help of domain knowledge. The construction
1, while other generic preprocessing steps like standardization of novel features for the enhancement of predictive learning
and normalization, and cyclic transformation fall into Type 2 is time-intensive and often requires field expertise. With the
category. The focus of this paper is Type 2 preprocessing for appropriate addition of features, predictive models can show
supervised learning of tabular datasets. significant performance improvement. Cognitio by Khurana et
In recent years, the ML field for tabular datasets has been al. [1] demonstrated a novel method for automated feature
heavily researched for performance enhancement of ML-based engineering in supervised learning. Cognito performs row-
models, especially automated Machine Learning (AutoML) wise transforms over instances for all valid features, each
TABLE I
P REPROCESSING STEPS INCLUDED IN DIFFERENT AUTO ML SOLUTIONS
producing a new column or columns. The number of possible (ANOVA) F-test for feature selection in the context of email
transformations is an unbounded space considering various spam classification.
combination of features. These function transforms could be
C. Sampling
unary, binary, or multiple transforms [1]. As the number of
transforms can increase exponentially based on the number of ML algorithms should be trained on a complete dataset
input columns, the pruning step is included by Cognito for because a higher amount of data can improve performance.
feature selection to ensure a manageable size of the dataset. But a sample set can help get a quick overview of data quality
Katz et al. introduced a framework ExploreKit [2] for as well as determine its characteristics. The popular approach
automated feature generation. Katz et al. demonstrate new of sampling is Random sampling with or without replacement
feature findings by using unary operators such as inverse, [15]. The literature for sampling dates back to 1980 when
addition, multiplication, division, etc., as well as higher- Cochran [14] first introduced the concept of stratified sam-
order operators. The huge number of features generated in pling. Stratified sampling divides samples into homogeneous
ExploreKit are pruned and validated using a Ranking Model. subgroups and later the data is randomly sampled from these
A two-step approach is proposed by ExploreKit where the subgroups. Rojas et al. [15] in their survey concluded that
generated features in the first step are ranked based on meta- the majority of data scientists use random sampling, stratified
features in the second step. sampling, or sampling by hand. Section III-B elaborates on
Galhotra et al. [3] focus on an automated method to utilize the stratified sampling technique in detail.
domain structured knowledge to perform feature addition. III. M ETHODOLOGY
They further developed a tool KAFE (Knowledge Aided Fea- In this paper, we present an automated preprocessing
ture Engineering) to attain knowledge about similar analyses pipeline that includes advanced preprocessing methods which
from 25 million tables available on the internet. Hoag et al. are generally not available in AutoML libraries. The aim of
presented a neural network approach to generate new features this research is to develop an automated pipeline that can
from relational databases [4]. They use a set of Recurrent improve the performance of predictive modeling on tabular
Neural Networks (RNNs) that takes as input a sequence of datasets. The proposed pipeline can be used with any ML
vectors and outputs a vector (with new generated features). algorithms as well as for AutoML libraries. The novelty
The Data Science Machine [5] developed a deep feature aspects of this paper are as mentioned below:
engineering algorithm for relational databases and cannot be
• Hybrid Feature Engineering (HFE) method
generalized to tabular datasets.
• Generalized automated preprocessing pipeline
• Sampling technique
B. Feature selection
The proposed preprocessing pipeline is validated by analyzing
In contrast to feature engineering where new features are its implementation on OpenML datasets [18]. This section con-
generated from the existing ones, feature selection implies sists of four preprocessing steps each representing an element
selecting useful features from the available set of input fea- of the proposed pipeline, namely feature selection, sampling,
tures, i.e., a subset of features. Tereno et al. [10] consider target discretization, and feature engineering. These steps are
various search strategies for feature selection namely heuristic described in detail below with their pseudo algorithms.
and probabilistic. They illustrate the importance of removing
unnecessary features based on class separability measures. The A. Feature selection
elimination of non-relevant features or features with negligible Feature selection implies selecting important features from
importance can significantly reduce computation time and the list of input features or in other words eliminating less
resources [10]. Aliferis et al. [11] introduced an algorithmic significant features. As mentioned in Section II, a popular
framework to learn local casual structure for target structure approach for feature selection is the correlation coefficient.
that is later used to select features. The most popular approach In this paper, we present a mixture of variance and correla-
for feature selection includes correlation, Bayesian error rate, tion analysis for feature selection. Inspired by [10], feature
information gain, entropy measures, etc., [12]. Elssied et al. selection has been categorized into three parts:
[13] demonstrate the use of a one-way Analysis of Variance • removal of redundant features
Fig. 1. Block diagram for preprocessing pipeline together with the integration of AutoML module
&