0% found this document useful (0 votes)
54 views

Automated Data Preprocessing For Machine Learning Based Analyses

This paper explores advanced data preprocessing techniques, including feature engineering, selection, target discretization, and sampling, proposing an automated pipeline validated with RandomForest and AutoML libraries, demonstrating significant performance enhancements, particularly with baseline models, and marginal improvements with AutoML libraries, while also introducing a novel sampling method called "Bin-Based sampling" applicable for general-purpose data sampling.

Uploaded by

Akshay IconPro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Automated Data Preprocessing For Machine Learning Based Analyses

This paper explores advanced data preprocessing techniques, including feature engineering, selection, target discretization, and sampling, proposing an automated pipeline validated with RandomForest and AutoML libraries, demonstrating significant performance enhancements, particularly with baseline models, and marginal improvements with AutoML libraries, while also introducing a novel sampling method called "Bin-Based sampling" applicable for general-purpose data sampling.

Uploaded by

Akshay IconPro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

Automated Data Preprocessing for Machine


Learning Based Analyses
Akshay Paranjape Praneeth Katta Markus Ohlenforst
IconPro GmbH IconPro GmbH IconPro GmbH
Aachen, Germany Aachen, Germany Aachen, Germany
[email protected] [email protected] [email protected]

Abstract—Data preprocessing is crucial for Machine Learning [6]- [8]. However only a few papers [1] [2] have investigated
(ML) analysis, as the quality of data can highly influence advanced Type 2 preprocessing steps (mainly feature genera-
the model performance. In recent years, we have witnessed tion). The validation of these preprocessing steps together with
numerous literature works for performance enhancement, such
as AutoML libraries for tabular datasets, however, the field of AutoML libraries has not been studied yet. We have considered
data preprocessing has not seen major advancement. AutoML three relevant AutoML libraries, namely AutoGluon [8], Au-
libraries and baseline models like RandomForest are known toSklearn [6], and H2O [7]. The different preprocessing steps
for their easy-to-use implementation with data-cleaning and supported by these AutoML libraries are summarized in Table
categorical encoding as the only required steps. In this paper, I. It can be inferred from Table I that advanced preprocessing
we investigate some advanced preprocessing steps such as feature
engineering, feature selection, target discretization, and sampling steps like feature engineering and feature selection have not
for analyses on tabular datasets. Furthermore, we propose an been implemented in these AutoML libraries. In this paper,
automated pipeline for these advanced preprocessing steps, which we first investigate some advanced Type 2 preprocessing steps
are validated using RandomForest, as well as AutoML libraries. and later propose an automated preprocessing pipeline based
The proposed preprocessing pipeline can also be used for any on our research. A validation study of the proposed pipeline is
ML-based algorithms and can be bundled into a Python package.
The pipeline also includes a novel sampling method - “Bin- conducted on both the baseline model and AutoML libraries.
Based sampling” which can be used for general purpose data The automated preprocessing pipeline is designed with the
sampling. The validity of these preprocessing methods has been main objective of automatically generating new features from
assessed on OpenML datasets using appropriate metrics such as existing input features to improve the performance metric.
Kullback-Leibler (KL)-divergence, accuracy-score, and r2-score. If the dataset size is large, feature engineering can take a
Experimental results show significant performance improvement
when modeling with baseline models such as RandomForest and longer computational time. Therefore, required research in
marginal improvements when modeling with AutoML libraries. the sampling field is evident, for which we propose the Bin-
Index Terms—AutoML; Preprocessing; Feature Engineering; Based sampling method as an alternative to Random Sampling.
Feature Generation; Feature selection; Sampling. Before proceeding with feature engineering, unnecessary, ir-
relevant, and highly insignificant features are removed since
I. I NTRODUCTION these features have neutral or significantly low information
Data preprocessing is a crucial step in Machine Learning gain for the target variable. These three techniques, viz. feature
(ML) as the quality of data can have a significant influence engineering, feature selection, and sampling are the main
on its performance. Preprocessing is performed to prepare aspects of this paper. Therefore, in Section II, related work for
the compatible dataset for analysis as well as to improve these three techniques is briefly presented. Further, in Section
the performance of the ML model. Preprocessing steps can III, methodology is presented. In Section IV, experiments and
be roughly categorized into two types: model compatible the results obtained are tabulated. We conclude the work in
preprocessing (Type 1) and quality enhancement preprocessing Section V.
(Type 2). A common example of model compatible prepro-
II. R ELATED W ORK
cessing step is the encoding of string values to either Label
Encoded values or One-Hot Encoded values based on the A. Feature Engineering
model requirements. Preprocessing steps like data cleaning Feature engineering is the process of generating new fea-
and missing value imputation fall into the category of Type tures with the help of domain knowledge. The construction
1, while other generic preprocessing steps like standardization of novel features for the enhancement of predictive learning
and normalization, and cyclic transformation fall into Type 2 is time-intensive and often requires field expertise. With the
category. The focus of this paper is Type 2 preprocessing for appropriate addition of features, predictive models can show
supervised learning of tabular datasets. significant performance improvement. Cognitio by Khurana et
In recent years, the ML field for tabular datasets has been al. [1] demonstrated a novel method for automated feature
heavily researched for performance enhancement of ML-based engineering in supervised learning. Cognito performs row-
models, especially automated Machine Learning (AutoML) wise transforms over instances for all valid features, each

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 1


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

TABLE I
P REPROCESSING STEPS INCLUDED IN DIFFERENT AUTO ML SOLUTIONS

Name AutoSklearn AutoKeras TPOT AutoGluon H2O


Balancing yes no no yes yes
Categorical encoding yes yes yes yes yes
Imputation yes yes no yes yes
Standardization/Normalization yes yes no yes yes
Densifier, PCA,
Introduce
Others minority coalescence, Data augmentation Feature selector None
”unknown category”
select percentile

producing a new column or columns. The number of possible (ANOVA) F-test for feature selection in the context of email
transformations is an unbounded space considering various spam classification.
combination of features. These function transforms could be
C. Sampling
unary, binary, or multiple transforms [1]. As the number of
transforms can increase exponentially based on the number of ML algorithms should be trained on a complete dataset
input columns, the pruning step is included by Cognito for because a higher amount of data can improve performance.
feature selection to ensure a manageable size of the dataset. But a sample set can help get a quick overview of data quality
Katz et al. introduced a framework ExploreKit [2] for as well as determine its characteristics. The popular approach
automated feature generation. Katz et al. demonstrate new of sampling is Random sampling with or without replacement
feature findings by using unary operators such as inverse, [15]. The literature for sampling dates back to 1980 when
addition, multiplication, division, etc., as well as higher- Cochran [14] first introduced the concept of stratified sam-
order operators. The huge number of features generated in pling. Stratified sampling divides samples into homogeneous
ExploreKit are pruned and validated using a Ranking Model. subgroups and later the data is randomly sampled from these
A two-step approach is proposed by ExploreKit where the subgroups. Rojas et al. [15] in their survey concluded that
generated features in the first step are ranked based on meta- the majority of data scientists use random sampling, stratified
features in the second step. sampling, or sampling by hand. Section III-B elaborates on
Galhotra et al. [3] focus on an automated method to utilize the stratified sampling technique in detail.
domain structured knowledge to perform feature addition. III. M ETHODOLOGY
They further developed a tool KAFE (Knowledge Aided Fea- In this paper, we present an automated preprocessing
ture Engineering) to attain knowledge about similar analyses pipeline that includes advanced preprocessing methods which
from 25 million tables available on the internet. Hoag et al. are generally not available in AutoML libraries. The aim of
presented a neural network approach to generate new features this research is to develop an automated pipeline that can
from relational databases [4]. They use a set of Recurrent improve the performance of predictive modeling on tabular
Neural Networks (RNNs) that takes as input a sequence of datasets. The proposed pipeline can be used with any ML
vectors and outputs a vector (with new generated features). algorithms as well as for AutoML libraries. The novelty
The Data Science Machine [5] developed a deep feature aspects of this paper are as mentioned below:
engineering algorithm for relational databases and cannot be
• Hybrid Feature Engineering (HFE) method
generalized to tabular datasets.
• Generalized automated preprocessing pipeline
• Sampling technique
B. Feature selection
The proposed preprocessing pipeline is validated by analyzing
In contrast to feature engineering where new features are its implementation on OpenML datasets [18]. This section con-
generated from the existing ones, feature selection implies sists of four preprocessing steps each representing an element
selecting useful features from the available set of input fea- of the proposed pipeline, namely feature selection, sampling,
tures, i.e., a subset of features. Tereno et al. [10] consider target discretization, and feature engineering. These steps are
various search strategies for feature selection namely heuristic described in detail below with their pseudo algorithms.
and probabilistic. They illustrate the importance of removing
unnecessary features based on class separability measures. The A. Feature selection
elimination of non-relevant features or features with negligible Feature selection implies selecting important features from
importance can significantly reduce computation time and the list of input features or in other words eliminating less
resources [10]. Aliferis et al. [11] introduced an algorithmic significant features. As mentioned in Section II, a popular
framework to learn local casual structure for target structure approach for feature selection is the correlation coefficient.
that is later used to select features. The most popular approach In this paper, we present a mixture of variance and correla-
for feature selection includes correlation, Bayesian error rate, tion analysis for feature selection. Inspired by [10], feature
information gain, entropy measures, etc., [12]. Elssied et al. selection has been categorized into three parts:
[13] demonstrate the use of a one-way Analysis of Variance • removal of redundant features

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 2


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

Fig. 1. Block diagram for preprocessing pipeline together with the integration of AutoML module

• removal of highly correlated features


k
• elimination of insignificant features with one-way X
ANOVA F-test for classification and correlation coeffi- SSR = ni (X̄o − X̄i )2
i=i
cient for regression
Xk Xni
SSE = (X̄i − Xij )2
1) Redundant features: For each categorical input feature, i=i j
the number of categories is measured. Two extreme cases are SST = SSR + SSE
eliminated based on the number of unique values:
dfr = k − 1
k
X (2)
• Constant feature: single category features, eg., Machine dfe = ni − k
No., Nationality, etc., i=1
• Feature with the number of categories equal to the SSR
number of samples, eg., Name, Email Id, etc., M SR =
dfr
SSE
M SE =
In both cases, the information gain is zero and hence the dfe
features are removed. M SR
F value =
M SE
2) Correlation Threshold: In this step, a one-dimensional
where k is the number of groups, ni is the number of samples
correlation among the input features is calculated. Pearson’s
in group i, X̄o is the global mean of all samples of the features
correlation [19] coefficient as shown in (1) is calculated for all
and X̄i is the mean of samples in subgroup i, dfr is regression
input features. Features are merged based on their correlation
degree of freedom and dfe is error degree of freedom. M SR
coefficient. Features with a correlation coefficient close to
is regression mean square, M SE is error mean square, SSR
1 would provide similar information for modeling and can
represents the regression sum of squares, SSE is the error
thus be removed to save the computational power. Equation 1
sum of squares and SST is the total sum of squares.
shows the correlation between two features denoted as x and
For the regression datasets, F-value is calculated via a
y.
univariate linear regression test. Once the F-values are avail-
able for all features, they are ranked based on p−values
Pn as suggested by Charles Poole [9]. Experimentally, it is
i=1 (xi− x̄)(yi − ȳ) found that a relative comparison of p−values is a better
r = pPn pPn (1)
i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2 evaluation of the significance of the feature. The features
are arranged in decreasing order of −log(p−values) and the
maximum drop of −log(p−value) over the features is computed
by calculating their second-order gradients. The maximum
where, xi , yi are the ith elements for features x and y. drop in the −log(p−value) is considered a threshold for that
particular dataset. Features with −log(p−value) less than this
3) Analysis of Variance: After the elimination of insignif- threshold are considered insignificant. In [9], a p−value of
icant features from steps 1 and 2, the rest of the features are 0.05 is considered an appropriate threshold value for selecting
analyzed based on a one-way ANOVA test for a classification the insignificant features. Experimentally, it is found that a
task. Each feature is divided into subgroups corresponding to few datasets have many features with p−values > 0.05. To
its target value. The one-way analysis of variance is carried out overcome this situation where many features could be removed
with these subgroups to find the F-value, as described below. because of a static threshold, the second-order gradient method

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 3


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

is used. By using a second-order gradient, the maximum


drop of the −log(p−value) can be detected. The advantage 
of this method is the threshold for the p−value and it is set 
dynamically. The least important features are identified with 

&RXQW

&RXQW
the threshold as mentioned below:

( ) 
threshold = arg max ∇ (− log (p valuessorted )) 
(3)
f eatures =f eatures(p  
value<threshold)    
Based on the above three methods, the most important features
IHDWXUHYDOXH IHDWXUHYDOXH
are selected for later processing.
Fig. 2. Bin-based Sampling: Comparison of distribution for a feature before
and after Bin-based sampling. (left: original distribution; right: sampled
B. Sampling distribution)
The quality of a sampling method can be accessed based
on the divergence of a sampled dataset with the original
distribution. As mentioned in Section II, the most widely used where N denotes the number of samples, s is the sample point
sampling techniques are random sampling, stratified sampling, and bi denotes the ith bin. The pseudo-algorithm for Bin-
and hand-pick sampling [15]. Stratified sampling is the basis Based sampling is provided in Algorithm 2.
for our proposed approach, Bin-Based sampling. Algorithm 1: Stratified sampling
1) Stratified Sampling: Stratified sampling [14] by Cochran
Function Subsets(Data):
is a well-known sampling technique that closely resembles
for category/bin in f eaturei from Data do
the original distribution statistics. In stratified sampling, a
if num features in Data > 1 then
population is divided into sub-populations strata followed by
Strata ← Subsets(Data.drop(f eaturei ))
random sampling from these strata. The proportionate alloca-
else
tion of sampling steps for the creation of strata is summarized Strata ← category/bin
in Algorithm 1. These sub-populations are formed based on
the nested grouping of columns. One of the disadvantages of
stratified sampling is its non-realistic runtime which makes return Strata
it difficult for real-time applications. Therefore, we propose Strata ← Subsets(Data)
a new sampling technique, Bin-Based sampling. Two major for Stratum in Strata do
advantages of bin-based sampling are: stratumSamples ← RandomSample(stratum)
• faster than stratified sampling
StratifiedSample ← Concat(stratumSamples)
• preserves original distribution better than random sam-
Result: StratifiedSample
pling
2) Bin-based sampling: The motivation behind bin-based Algorithm 2: Binbased Sampling
sampling is to reduce the time complexity while maintaining for F eaturei in Features do
the original population statistics. To achieve this, input features if F eaturei is Categorical then
are divided into different bins based on their distribution. After for Category in CategoriesF eaturei do
the binning process, random samples are drawn from each Sample ← RandomSample(Category)
bin for every feature. A union set of sample collection from end
every feature is now the bin-based-sampled population. Figure featureSample ← Concat(Sample)
2 shows the histogram of a feature before and after sampling. else
We can see that the distribution is preserved with Bin-Based Bins ← Discretize(F eaturei )
sampling and it simplifies the probability of choosing a sample for Bin in Bins do
by lowering the number of possibilities. The joint conditional Sample ← RandomSample(Bin)
probabilities in stratified sampling are simplified into the end
conditional probabilities of each feature, as mentioned below. featureSample ← Concat(Sample)
end
1
Prandom (s) = end
N
BinbasedSample ← Concat(featureSample)
P (s|bi )P (bi )
Pbin−based (s) = (4) Result: Binbased Sample
P (bi |s)
1 3) Sampling size: An optimal sampling size should ensure
Pbin−based (s) = P (s|bi ) = that the information loss is minimum. Cochran [14] has stated
size(bi )

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 4


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

an optimal size for sampled population based on the size of or r2-score (regression). Low-rank features are removed and
the population. higher rank features are added to the original dataset so that
more information can be extracted.
Z 2 p(1 − p) 3) Hybrid Feature Engineering: Inspired by Cognito and
no = (5)
e2 ExploreKit, we propose a new Hybrid Feature engineering
where e is desired level of precision (i.e., margin or error), p is approach for feature engineering, which can be implemented
the estimated proportion of the population, and Z is the z-score in real-time with a modified ranking algorithm and additional
distribution value, defaulting to 0.475 for 95% distribution. For usage of the Bin-Based sampling method. In Hybrid feature
Bin-Based sampling, a sample size with a z-score distribution engineering, new features are generated with a single trans-
value of 0.475 is chosen, as suggested by Cochran [14]. formation on a feature or features at a time. This transfor-
mation are either a unary or a binary operator. For a feature
C. Target Discretization D and unary transforms τ , τ (D) feature is generated. For
With the help of target discretization, a numerical output features D1 , D2 and binary transform τ , τ (D1 , D2 ) feature
feature can be converted into categorical values, thereby is generated. These features are subjected to feature selection
transforming a regression task into a classification task. Based as described in Section III-A instead of using the Ranking
on the baseline regression model, a regression task will Model directly. After the elimination of features with feature
be converted to a classification task if the regression r2- selection, most of the insignificant features are eliminated and
score metric value is significantly low or unacceptable. The we are left with a relatively less amount of features. These
prediction of categorical values has less degree of freedom features are then ranked with the Ranking Model. High-ranked
than the prediction of numerical values. Taking advantage of features are selected and added to the original dataset.
this fact, a classification analysis with AutoML might give a Consider F = f1 , f2 , ..., fn is a set of features, T =
reasonable classification accuracy rather than concluding the τ1 , τ2 , ..., τn is a set of transform functions and F ′ = F XT
datasets as not suitable for analysis. Here, each data point denotes the set of new generated features. With the help of
in the continuous domain is converted into a discrete class the Feature Selection technique, the most significant features,
domain. Different types of target discretization methods can I ⊂ F XT can be selected from the generated features. After
be considered based on domain expertise. As an automated Feature Selection, the set I is fine-tuned with Ranking Model
solution, we have considered the discretization of the target R. (c.f. Algorithm 3, Algorithm 4)
variable based on its z-score values.
I ← R(F XT ) (6)
D. Feature Engineering
Algorithm 3: Hybrid Feature Engineering
Feature engineering is the process of generating new fea-
tures or transforming features from the existing set of features for Operator in Unary/Binary Operators do
using domain knowledge. Feature engineering is performed for Feature in Numerical-Features do
to leverage the performance of the ML model. Since domain NewFeatures ← Operator(Feature)
knowledge is not always available, we suggest an automated end
way to feature engineering - Hybrid Feature engineering, allN ewF eaturesoperator ←
inspired by Cognito [1] and ExploreKit [2]. They are briefly FeatureSelection(NewFeatures)
described below. end
1) Cognito: As mentioned in Section II, Cognito generates allNewFeatures = RankingModel(allNewFeatures)
new features by performing unary and binary operations on the Result: Generated Features
input features. As the number of input features and transform Algorithm 4: Ranking Model
operators increases, the number of newly generated features
thresholdf = baseModel(Dataset)
increases exponentially in the order of O(f.dk+1 ) where f is
for i = 0 to allNewFeatures do
the number of features, d is the number of combinations and k
Dataset.append(allNewFeatures[i])
is the number of transforms. For a feature D and transforms τ1
RankedFeatures = []
and τ2 , τ1 (D), τ2 (D), τ1 τ2 (D) are generated. These generated
featureScore = baseModel(Dataset)
features have to be pruned to reduce the complexity of the
if featureScore ≤ thresholdf then
problem. Feature selection is done on the generated features
continue
using information gain as a proxy measure of accuracy.
else
2) ExploreKit: ExploreKit [2] takes a similar approach
RankedFeatures.append(allNewFeatures[i])
towards feature engineering. Together with the unary and
end
binary operators, ExploreKit considers higher-order operators.
return RankedFeatures
Among all the generated features, each feature is added to
end
the dataset, and the rank of the feature is determined with the
Result: Ranked features
Ranking Model. The Ranking Model ranks the newly gener-
ated features based on either accuracy score (classification)

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 5


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

IV. E XPERIMENTS AND R ESULTS O( n3 ). The sampling technique is validated using Kullback-
This section describes various experiments that are con- Leibler (KL) divergence (7) considering the original distribu-
ducted together and their validation results. The goal of this tion as the reference distribution [16].
validation study is to ensure that the proposed methods have Z inf  
P (x)
a positive influence on the datasets. Validation is done on DKL (P ∥Q) = P (x)log dx (7)
− inf Q(x)
OpenML datasets [18]. RandomForest model is used as a
baseline model. The proposed auto-preprocessing pipeline is where P (x) is the original distribution and Q(x) is the
also benchmarked against top-performing AutoML libraries sampled distribution. We can observe that KL-divergence for
[17], namely AutoSklearn, Autogluon, and H2O. The results Bin-Based sampling is comparatively much lower than random
are consistently compared with (w) and without (w/o) the sampling. Stratified sampling has the least KL-divergence,
auto-preprocessing libraries. Figure 1 illustrates the flow of but 10 times the computation time as can be inferred from
the preprocessing pipeline. Table IV. These experiments are conducted on a Linux VM
with 256 GB of RAM. Experiments also revealed that Bin-
A. Experimental Setup Based sampling failed to perform well for the datasets of
For an individual experiment, a RandomForest model with smaller sizes. The reason for this is that performing a binning
30 estimators was chosen for both classification and regression operation and extracting random samples from each bin might
tasks. K-fold cross-validation was used to benchmark the cause a loss of information. Therefore, for such datasets of
results. All experiments are conducted on a Linux Virtual small size, sampling is not useful.
Machine with 16 GB of RAM and 4 cores. Experiments are
E. Hybrid Feature Engineering
conducted without using dask multiprocessing for AutoML
libraries. Significant improvement in performance is achieved for a
few OpenML datasets. These are summarized in Tables V
B. Datasets and VI. A performance improvement for 35% of datasets is
OpenML datasets were used for benchmarking the results observed when testing over 50+ OpenML datasets. All tests are
[6]. A comparative study shows that Hybrid Feature Engi- performed on a cross-validation split. It should be noted that
neering (HFE) performs better than modeling without HFE for performance improvement is achieved only if new significant
around 35% of cases, no change in performance was observed features are generated after the pruning step. The performance
for the rest of the datasets. This trend can be observed for with HFE is higher or similar, in no case did we encounter a
both classification and regression tasks. It has to be noted that decrease in performance.
an increase in performance can be expected only if we have
F. Overall Pipeline
new features after the pruning method as explained in Section
III. Only 35% of the OpenML datasets reported new features An auto preprocessing pipeline, as shown in Figure 1,
after the pruning step. Results of these datasets are provided is used together with AutoML libraries mainly AutoGluon,
in Tables II - VIII. AutoSklearn, and H2O for benchmarking. Significant improve-
ments compared to the baseline model are not achieved with
C. Feature selection AutoML libraries for all datasets shown in Tables V and
A comparative study was conducted to compare the per- VI. The results with AutoML libraries are shown in Tables
formance with and without feature selection. It can be seen VII and VIII. The stopping criteria for training of AutoML-
that for all datasets performance remains the same after the libraries are set with the runtime limit. The benchmarking
elimination of less significant features, as described in Section for AutoML libraries is done on a single core. The runtime
III-A. Reduction in the training time is not very significant for across all AutoML libraries is set to 10 minutes and the results
the baseline model, as the model has only 30 estimators and are 4-fold cross-validated. Overall, the combination of auto-
the size of the datasets is comparatively less. For MNIST, preprocessing and AutoML libraries performed better or was
a total of 64 features out of 784 features were eliminated similar to “only AutoML libraries”. As mentioned in Table I,
maintaining the same accuracy. Feature selection is the first AutoML libraries do not consider feature engineering in their
step in the preprocessing pipeline. The results of the analysis preprocessing step.
are summarized in Tables II and III.
V. C ONCLUSION
D. Bin-Based sampling A significant amount of recent work in the field of
Bin-based sampling is used to reduce computation time for automated-Machine Learning is being done, but the same
the baseline model used for feature engineering. Stratified has not been the case for data preprocessing. This paper
sampling is known to sample a good representation from a reviews and suggests some advanced preprocessing steps that
population but at the cost of computation time with a time can either be used individually or combined as a pipeline.
complexity O(n2 ) where n is the number of input features. Although performance improvements cannot be ensured for
Random Sampling, on the other hand, has the complexity of all datasets, datasets that have inter-feature dependency can
O(1) and Bin-Based sampling, as explained in Section III-B be observed to perform better. For example, length and width

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 6


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

TABLE II
VALIDATION OF FEATURE SELECTION TECHNIQUE FOR CLASSIFICATION TASK

Accuracy w/o Accuracy with Number of features Difference


OpenML dataset
feature selection feature selection removed in accuracy
11 0.607 0.607 3 0
54 0.753 0.753 0 0
188 0.579 0.579 0 0
333 0.908 0.908 3 0
335 0.977 0.977 2 0
470 0.661 0.661 4 0
1459 0.588 0.588 0 0
1461 0.692 0.692 2 0
23381 0.560 0.560 5 0
amazon-employee-access 0.943 0.943 3 0
australian 0.857 0.857 2 0
bank-marketing 0.692 0.692 2 0
credit-g 0.761 0.761 2 0
sylvine 0.941 0.941 7 0

TABLE III
VALIDATION OF FEATURE SELECTION TECHNIQUE FOR REGRESSION TASK

r2-score w/o r2-score with Number of features Difference


OpenML dataset
feature selection feature selection removed in r2-score
537 0.484 0.484 0 0
495 0.616 0.616 5 0
344 0.999 0.999 2 0
215 0.948 0.948 1 0
189 0.579 0.579 0 0
507 0.391 0.390 0 0

TABLE IV
S AMPLING COMPARISON ON O PEN ML DATASETS CALCULATED OVER 100 TRIALS

Mean of Mean of Mean of


Time (in sec) Time (in sec)
OpenML dataset KL-divergence KL-divergence KL-divergence
Bin-Based sampling Stratified sampling
Bin-Based sampling Stratified sampling Random sampling
183 0.017 0.173 0.057 0.359 5.230
223 0.067 0.079 0 0.273 7.600
287 0.076 0.356 0.027 0.399 4.807
307 0.0 0.097 0.006 0.214 7.572
528 0.0 0.0215 0.0 0.054 0.489
537 0.190 0.886 1.160 2.052 133.939
550 0.0 0.011 0.004 0.302 0.738
Amazon-employee-access 0.019 0.460 0.753 0.466 2.112
Blood-transfusion 0.065 0.002 0.001 0.062 0.069
Phoneme 0.0 0.168 0.143 0.580 1.383

of a workpiece can be combined to form a new feature “area engineering and this we would like to focus on in our future
of the workpiece”, which can have a significant impact on work.
the ML-based model. The proposed method does it without
domain knowledge in an automated manner. This paper also R EFERENCES
introduces a new sampling method that can be used for general [1] U. Khurana, D. Turaga, H. Samulowitz and S. Parthasrathy, “Cognito:
application as well as for ML-based modeling. We used the Automated Feature Engineering for Supervised Learning.” 2016 IEEE
Bin-Based sampling method during the Feature Engineering 16th International Conference on Data Mining Workshops (ICDMW),
2016: pp. 1304-1307.
step to generate new features and select them using a Ranking [2] G. Katz, E. C. R. Shin and D. Song, ”ExploreKit: Automatic
Model. Usage of sampled data for Feature Engineering has sig- Feature Generation and Selection,” 2016 IEEE 16th International
nificantly reduced the preprocessing time. It can be concluded Conference on Data Mining (ICDM), 2016, pp. 979-984, doi:
10.1109/ICDM.2016.0123.
that a significant performance improvement of around 4-7% is [3] S. Galhotra, U. Khurana, O. Hassanzadeh, K. Srinivas, H. Samu-
observed for the analysis conducted with the baseline model lowitz and M. Qi, ”Automated Feature Enhancement for Predictive
on OpenML datasets. For the same set of datasets, a marginal Modeling using External Knowledge,” 2019 International Conference
on Data Mining Workshops (ICDMW), 2019, pp. 1094-1097, doi:
improvement was observed for analysis with the AutoML 10.1109/ICDMW.2019.00161.
libraries. The proposed pipeline is currently not parallelized. [4] H. T. Lam, T. N. Minh, M. Sinn, B. Buesser, and M. Wistuba,
Parallelization can significantly reduce the time for feature “Neural Feature Learning From Relational Database.” arXiv: Artificial
Intelligence, 2018.

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 7


COLLA 2022 : The Twelfth International Conference on Advanced Collaborative Networks, Systems and Applications

TABLE V
H YBRID F EATURE E NGINEERING FOR CLASSIFICATION DATASETS WITH BASELINE MODEL

OpenML datasets Number of features Number of classes Accuracy before Accuracy after Percentage Gain New features
188 14 5 0.466 0.506 8.386 2
1461 7 2 0.692 0.718 4.748 2
1459 7 10 0.588 0.635 7.952 1
54 18 4 0.753 0.759 0.786 2

TABLE VI
H YBRID F EATURE E NGINEERING FOR REGRESSION DATASETS WITH BASELINE MODEL

OpenML datasets Number of features r2-score before r2-score after Percentage Gain New features
189 8 0.579 0.615 6.227 1
507 6 0.390 0.411 5.361 1
537 8 0.484 0.494 2.000 1
495 13 0.616 0.632 2.700 2

TABLE VII
OVERALL PREPROCESSING PIPELINE PERFORMANCE COMPARISON WITH AUTO ML LIBRARIES (C LASSIFICATION - ACCURACY )

OpenML datasets AutoGluon AutoSklearn H2O RandomForest


w/o w w/o w w/o w w/o w
188 0.728 0.726 0.674 0.696 0.717 0.739 0.466 0.506
1461 0.914 0.914 0.906 0.906 0.907 0.907 0.692 0.718
1459 0.815 0.82 0.919 0.919 0.922 0.927 0.588 0.635
54 0.858 0.857 0.839 0.839 0.707 0.708 0.753 0.759

TABLE VIII
OVERALL PREPROCESSING PIPELINE PERFORMANCE COMPARISON WITH AUTO ML LIBRARIES (R EGRESSION - R 2- SCORE )

OpenML datasets AutoGluon AutoSklearn H2O RandomForest


w/o w w/o w w/o w w/o w
189 0.913 0.913 0.902 0.903 0.913 0.918 0.579 0.615
507 0.731 0.741 0.753 0.753 0.762 0.761 0.390 0.411
537 0.815 0.821 0.862 0.865 0.861 0.869 0.484 0.494
495 0.496 0.495 0.494 0.494 0.441 0.442 0.616 0.632

[5] J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards [16] J. M. James, “Kullback-Leibler Divergence”. International Encyclopedia
automating data science endeavors.” 2015 IEEE International Confer- of Statistical Science, 2011.
ence on Data Science and Advanced Analytics (DSAA), 2015: pp. 1-10. [17] P. Gijsbers, E. LeDell, J. K. Thomas, S. Poirier, B. Bischl,
[6] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum and and J. Vanschoren, “An Open Source AutoML Benchmark.” ArXiv
F. Hutter, “Efficient and Robust Automated Machine Learning.” NIPS abs/1907.00909, 2019.
2015. [18] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “OpenML:
[7] E. LeDell and S. Poirier, “H2O AutoML: Scalable Automatic Machine networked science in machine learning.” SIGKDD Explorations 15(2),
Learning”. 7th ICML Workshop on Automated Machine Learning (Au- 2013, pp. 49-60.
toML), July 2020. [19] Kirch Wilhelm, “Pearson’s Correlation Coefficient.” Encyclopedia of
[8] N. Erickson et al., “AutoGluon-Tabular: Robust and Accurate AutoML Public Health, 2008. Springer, Dordrecht. https://doi.org/10.1007/978-
for Structured Data.” ArXiv abs/2003.06505, 2020. 1-4020-5614-7 2569 .
[9] C. Poole, “Low P-values or narrow confidence intervals: which are more
durable?” Epidemiology 12, 2001: pp. 291-294.
[10] T. Terano, H Liu and L. P. Arbee, “Chen Knowledge Discovery and
Data Mining.” 4th Pacific-Asia Conference, PAKDD 2000, Kyoto Japan,
2000.
[11] C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D.
Koutsoukos, “Local Causal and Markov Blanket Induction for Causal
Discovery and Feature Selection for Classification Part I: Algorithms and
Empirical Evaluation.” J. Mach. Learn. Res. 11, 2010, pp. 171–234.
[12] C. Jie, L. Jiawei, W. Shulin and Y. Sheng, “Feature selection in machine
learning: A new perspective.” Neurocomputing 300, 2018: pp. 70-79.
[13] N. O. F. Elssied, O. Ibrahim and A. H. Osman, “A Novel Feature Selec-
tion Based on One-Way ANOVA F-Test for E-Mail Spam Classification.”
Research Journal of Applied Sciences, Engineering and Technology 7,
2014: pp. 625-638.
[14] Cochran William, Sampling Techniques, 3rd edition, John Wiley and
Sons, 1978.
[15] J. A. R. Rojas, M. B. Kery, S Rosenthal, and A. Dey, “Sampling
techniques to improve big data exploration.” 2017 IEEE 7th Symposium
on Large Data Analysis and Visualization (LDAV), 2017: pp. 26-35.

Copyright (c) IARIA, 2022. ISBN: 978-1-61208-976-8 8

You might also like