DiffPrep Differentiable Data Preprocessing Pipeline Search For Learning Over Tabular Data
DiffPrep Differentiable Data Preprocessing Pipeline Search For Learning Over Tabular Data
ABSTRACT
Data preprocessing is a crucial step in the machine learning process
that transforms raw data into a more usable format for downstream
ML models. However, it can be costly and time-consuming, of-
ten requiring the expertise of domain experts. Existing automated
machine learning (AutoML) frameworks claim to automate data
preprocessing. However, they often use a restricted search space
of data preprocessing pipelines which limits the potential perfor-
Figure 1: A Typical ML Development Workflow
mance gains, and they are often too slow as they require training
the ML model multiple times. In this paper, we propose DiffPrep, a
method that can automatically and efficiently search for a data pre-
processing pipeline for a given tabular dataset and a differentiable 1 INTRODUCTION
ML model such that the performance of the ML model is maximized. Machine learning (ML), in particular, supervised ML is increasingly
We formalize the problem of data preprocessing pipeline search as a being used for solving challenging real-world problems in a wide
bi-level optimization problem. To solve this problem efficiently, we range of fields, such as medicine [19], finance [9], politics [34], etc.
transform and relax the discrete, non-differential search space into The workflow of developing an ML application may vary from
a continuous and differentiable one, which allows us to perform the projects but it typically involves four stages as shown in Figure 1,
pipeline search using gradient descent with training the ML model including data acquisition, data preprocessing, model training, and
only once. Our experiments show that DiffPrep achieves the best model evaluation [16].
test accuracy on 15 out of the 18 real-world datasets evaluated and Data preprocessing is an essential step in a typical ML workflow
improves the model’s test accuracy by up to 6.6 percentage points. because in practice, the raw data collected often contain data issues
and can rarely be directly used by ML models [13]. For example,
KEYWORDS data errors such as missing values and outliers will significantly
reduce the ML model performance if not cleaned [24]; certain mod-
data cleaning, data preprocessing, automated machine learning
els like k-nearest neighbors (KNN) [32] expect data in different
columns to have a similar range, which requires the raw data to be
ACM Reference Format:
Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong. 2023. DiffPrep: Differen-
normalized. In addition, real-world data often have multiple such
tiable Data Preprocessing Pipeline Search for Learning over Tabular Data. data issues [2]: a dataset may contain missing values, outliers, and
In Proceedings of 2023 International Conference on Management of Data a large discrepancy on feature scales. To handle this case, we may
(SIGMOD ’23). ACM, New York, NY, USA, Article 183, 16 pages. https: first impute missing values with mean imputation, then remove out-
//doi.org/10.1145/3589328 liers using the Z-score method, and finally normalize the data using
standardization. As a result, data preprocessing usually involves
multiple operators organized using a data preprocessing pipeline,
where the operators in the pipeline are applied sequentially and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed each operator transforms the data to tackle a specific data issue.
for profit or commercial advantage and that copies bear this notice and the full citation Designing data preprocessing pipelines is challenging for data
on the first page. Copyrights for components of this work owned by others than the scientists as it involves many design decisions on transformation
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission types, order, and operators [7, 14]. First, data scientists must decide
and/or a fee. Request permissions from [email protected]. which types of transformations (e.g., outlier removal, discretization,
SIGMOD ’23, June 18–23, 2023, Seattle, WA normalization) are needed and the order of different transforma-
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. tions in the pipeline. For example, outlier removal may or may not
https://doi.org/10.1145/3589328 be needed and can be applied before or after normalization. For
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
each type of transformation, there are multiple choices of operators– and order. Azure would search for the best operator for some trans-
for example, standardization, min-max scaling, and robust scaling formations, such as normalization, but the transformation types
are all commonly used operators for normalization. Data scien- and order are fixed. Auto-Sklearn [11] can determine the types of
tists need to decide which operator to use for each transformation. transformation needed and select suitable operators, but it has a
Furthermore, different features may need to be preprocessed differ- fixed order of applying transformations. Learn2Clean [5] considers
ently, which requires data scientists to design a feature-wise pipeline operators, types and order in its search space, but it uses the same
rather than using the same preprocessing pipeline for all features. pipeline to preprocess all features. Therefore, existing AutoML sys-
For example, when repairing missing values in a dataset, some fea- tems have only explored a small portion of the entire design space
tures may prefer mean imputation while others may prefer median of data preprocessing pipelines, which limits the performance gains
imputation, depending on their distributions. of downstream ML models.
Even for experienced data scientists, it is usually not clear how Limitation 2: Low efficiency on optimization methods. As the
to design a preprocessing pipeline that will lead to the best perfor- search space becomes larger, it is increasingly important that the
mance. Making such decisions heavily relies on domain knowledge, optimization methods can efficiently identify parameters with good
including the characteristics of the data, the types of downstream performance. Unfortunately, most existing AutoML systems use
ML models, and the data scientists’ experience. Traditional data optimization methods like random search or Bayesian optimization
cleaning works usually seek to design pipelines that optimize data (Figure 2), which require training the ML model multiple times and
quality independently of downstream applications [7]. However, do not scale well with the search space. Specifically, random search
since the ground-truth clean data of real-world datasets are rarely randomly samples parameters from the entire search space and
available, the data quality may not be accessible or accurately es- trains the ML model with each set of sampled parameters; Bayesian
timated. Moreover, previous works have shown that data clean- optimization builds a probabilistic model that maps parameters
ing or preprocessing without considering downstream ML models to model performance, which requires iteratively training the ML
can sometimes negatively impact the performance of ML mod- model with new optimal parameters and updating the probabilistic
els [24, 27]. In practice, data scientists often use the time-consuming model with the model performance. The optimization becomes
trial-and-error method to design preprocessing pipelines, which is particularly challenging if we want to support a larger search space
reported to account for 80% of data scientists’ time [35], or simply in data preprocessing, such as using feature-wise pipelines. For
use some default configurations which usually result in suboptimal example, if there are 𝑛 possible pipelines for one feature, the number
performance. of possible pipelines for data with 𝑐 columns is 𝑛𝑐 , which means
To reduce human effort in ML development, extensive study the space grows exponentially with the number of features. When
has been made on automated machine learning (AutoML) systems. there are many features or the ML model is large, random search
Existing AutoML systems like Azure AutoML [4] and H2O.ai [23] or Bayesian optimization can be computationally expensive and
can automatically perform data preprocessing and model training time-consuming.
without too much human involvement. Most AutoML systems con-
sist of a search space and some optimization methods [16]. The Our proposal. In this work, we propose DiffPrep, an automatic
search space is defined by a set of possible parameters (choices) in data preprocessing method that can efficiently select high-quality
ML development workflows. Our discussions in this paper focus data preprocessing pipelines for any given dataset and differen-
on data preprocessing-related parameters, such as the choices of tiable ML model. Unlike traditional data preprocessing or cleaning
transformation types and operators. The optimization methods are methods that focus on improving data quality independently of the
used to automatically select a combination of parameters from the downstream applications [7], DiffPrep co-optimizes data prepro-
search space that leads to the highest model performance. With re- cessing with model training: the goal is to select data preprocessing
spect to data preprocessing pipelines, we find that existing AutoML pipelines that maximize the validation accuracy (or minimize the val-
solutions show limitations in both the size of search space and the idation loss) of the ML model trained on the dataset. Since the model
efficiency of the optimization method: training aims at minimizing the training loss, while the pipeline
selection aims at minimizing the validation loss, we end up with a
Limitation 1: Limited search space for data preprocessing so-called bi-level optimization problem [26].
pipelines. The search space of an AutoML system determines the Different from existing AutoML systems, DiffPrep considers
upper bound of its model performance. In general, larger search feature-wise data preprocessing pipelines in addition to transforma-
spaces are more likely to contain high-performing configurations, tion type, operators, and order. Feature-wise pipelines significantly
which could lead to better performance [8]. Despite the importance broaden the search space for data preprocessing compared to exist-
of data preprocessing in ML workflows, we found that existing ing AutoML systems, since the number of possible pipelines grows
AutoML frameworks only consider a limited search space of pa- exponentially with the number of features. We consider two usage
rameters for data preprocessing pipelines. In Figure 2, we rank scenarios with DiffPrep. If users provide a pre-defined order, our
existing AutoML systems according to the size of the search space method (DiffPrep-Fix) will fix the order and automatically select
they considered, which is determined by four variables: the choices the types and operators to generate a pipeline for each feature. If
of transformation types, the choices of transformation order, the no order is provided, our method (DiffPrep-Flex) will automatically
choices of transformation operators, and whether the same pipeline select the order, type and operator to generate a pipeline for each
is used for each data feature. H2O only supports a single default feature. It would explore the entire design space of data preprocess-
preprocessing pipeline with fixed transformation operators, types ing pipelines, which is the largest search space in Figure 2. In both
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data SIGMOD ’23, June 18–23, 2023, Seattle, WA
Learn2Clean ✓ ✓ ✓ × Q-Learning
scenarios, we have empirically found that the larger search space operator), is a function defined by 𝑓 : X1 → X2 , where an original
improves the quality of the resulting pipelines. feature in feature space X1 , is mapped to a transformed feature in
Efficiently searching over this large search space is challenging: feature space X2 . In this paper, we focus on TF operators where
since the search space is discrete and non-differentiable, we need both input and output are scalars, which we refer to as feature-
to enumerate each option and train ML models repeatedly to find wise transformation operators. Other TF operators that transform a
high-quality pipelines. Our key insight is to make the ML model vector into another vector, such as principal component analysis,
performance differentiable with respect to preprocessing pipeline and entity resolution operators [6, 42] that take a vector as input,
choices so that we can leverage efficient optimization methods are not considered in this paper. We note that most TF operators
like gradient descent. To do so, we first parameterize the search used in practice fall into this category. For example, among the TF
space of data preprocessing pipelines such that each choice in operators provided by scikit-learn [31] in sklearn.preprocessing
the pipeline can be represented using binary parameters. We then module, 14 out of 18 are feature-wise TF operators. Some TF opera-
relax the search space to be continuous using the softmax function tors have parameters. For example, to use the Z-Score method, we
and Sinkhorn normalization to make the pipeline differentiable. need to specify the threshold 𝑘 to determine whether a value is an
This allows DiffPrep to use gradient descent as the optimization outlier (e.g., k = 3).
method to solve the bi-level optimization problem, which allows us Transformation Types. Based on the purpose of the transforma-
to optimize the pipeline and model simultaneously with training tion, TF operators can be grouped into different types, which we
the ML model only once. refer to as transformation types (TF types). For example, missing
Contributions. We make the following contributions in this paper. value imputation is a TF type that consists of TF operators such
• We propose DiffPrep, the first automatic data preprocessing as mean imputation and median imputation. Table 1 shows some
method to consider the design space of transformation types, examples of TF types and the TF operators for each TF type. For-
operators, order and feature-wise pipelines. mally, we define a TF type 𝐹 as a set of TF operators, denoted by
• We formalize the problem of automatic data preprocessing as a 𝐹 = {𝑓𝑖 : 𝑖 = 1, 2, ...}.
bi-level optimization problem and use gradient descent to solve Design Considerations for Transformation Operators. The
the bi-level optimization problem efficiently. possibilities for transformation operators are much greater than
• We conduct experiments on 18 real-world datasets to evaluate the what we cover in this paper. For example, it is possible to define
effectiveness of our method. The results show that our method custom conditions for selecting specific subsets of the dataset and
achieves the best test accuracy on 15 out of 18 datasets and to use custom functions to transform them. However, it is diffi-
improves the test accuracy by up to 6.6 percentage points. cult to search in such a vast search space. Furthermore, when co-
Organization. The rest of this paper is organized as follows. Sec- optimizing data preprocessing and model training, using a large
tion 2 introduces preliminaries for constructing a data preprocess- number of operators that can modify data arbitrarily can increase
ing pipeline and formally defines our studied problem. In Section 3, the risk of model overfitting, as observed in prior work [20]. There-
we discuss our method to automatically search a data preprocessing fore, DiffPrep focuses on a limited set of operators that are com-
pipeline given a fixed order of transformations. In Section 4, we monly supported by ML frameworks (e.g. scikit-learn) and widely
present our approach with flexible order of transformations. In used. We have found empirically that this set of operators already
Section 5, we show the experimental results. We discuss the related leads to improved performance on many real-world datasets. Ad-
work in Section 6 and conclude the paper in Section 7. ditionally, the operators supported by DiffPrep are designed for
general purpose to transform data based on some prior knowledge
rather than modifying the dataset arbitrarily. For instance, oper-
2 PRELIMINARY ators for outlier removal and missing value imputation can only
2.1 Transformation Operators and Types affect specific (usually small) subsets of the dataset, which are usu-
ally prone to be dirty data, while normalization and discretization
Transformation Operators. Generally, a data preprocessing op-
erator/algorithm, which we term as a transformation operator (TF
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
Table 1: Example of TF Types and TF operators Note that in the literature, the notion of data preprocessing
Transformation Types Transformation Operators prototype is also known as logical pipeline plan [37] or pipeline
Mean prototype [33], which is defined as a directed acyclic graph of TF
Numerical
Median types. This definition suggests that there is no repetition of a TF
Missing Value Features
Mode type in a prototype [14]. This is because in practice, a type of
Imputation
Categorical Most Frequent Value
Features Dummy Variable transformation would rarely be repeated (e.g., it does not make
Standardization sense to impute missing values twice). Following this widely-used
Normalization
Min-Max Scaling definition of prototype, we require that ∀𝑝 ≠ 𝑞,𝑇𝑝 ≠ 𝑇𝑞 . We want
Robust Scaling to point out that this assumption does not prevent users from using
Max Absolute Scaling
Z-Score (𝑘)
a TF type multiple times. If repetition is really needed, we can
Outlier define two TF types 𝑇1 and 𝑇2 on the same transformation and
MAD (𝑘)
Removal
IQR (𝑘) consider them as two different TF types. This allows for repeated
Uniform (𝑛) transformation types in a prototype.
Discretization
Quantile (𝑛)
Step 3: Operator selection. The third step is operator selection,
where given a data preprocessing prototype, data scientists would
select a specific TF operator for each TF type in the prototype. For
operators can only scale and shift the distribution of entire features, example, we use mean imputation to impute missing values, use the
but not completely distort the data distribution. Z-score approach to remove outliers and perform standardization
to normalize data. To select suitable operators for TF types, data
2.2 Data Preprocessing Pipeline Construction scientists have to consider not only the characteristics of data but
As we mentioned before, data preprocessing usually involves multi- also the downstream ML model. For example, for normalization, if
ple transformations that are combined in a pipeline and each feature the downstream model is k-nearest neighbors, min-max scaling is
can use a different pipeline. For simplicity, let us first assume that usually preferable as it transforms all the columns into the same
the data only contain one feature and we want to construct a data scale. In contrast, if the model is logistic regression, standardization
preprocessing pipeline for it. Figure 3 shows a typical data scien- may be better as it makes convergence faster. However, such heuris-
tists’ workflow to construct a data preprocessing pipeline, which tic rules may not work for every dataset, and in practice, selecting
involves the following steps. suitable operators usually requires trial-and-error. By selecting a TF
operator for each TF type, we can instantiate a data preprocessing
Step 1: Data exploration. The initial step in creating a data pre-
pipeline, which is formally defined as follows.
processing pipeline is usually data exploration. During this step,
data scientists examine the data to understand its characteristics
and identify any potential or existing issues. Some issues, such as Definition 2.2. (Data Preprocessing Pipeline). Given a data pre-
missing values, are relatively easy to detect, while others, such as processing prototype T , we define a data preprocessing pipeline
outliers, may require more involved data analysis [15]. GT as GT = {𝑔𝑖 : 𝑖 = 1, 2, ...}, where 𝑔𝑖 ∈ 𝑇𝑖 is the TF operator
selected for 𝑇𝑖 in the prototype and the 𝑖-th operator to be applied
Step 2: Prototype selection. The second step is called prototype in the pipeline.
selection, where based on the data issues, data scientists would select
the TF types to be involved and decide the order of TF types to be
applied [14, 33, 37]. For example, we choose to first impute missing Step 4: Pipeline evaluation. The final step is pipeline evaluation,
values, then remove outliers and finally perform data normalization. where data scientists would use the pipeline to transform the raw
Note that in a pipeline, the input of one operator is the output of its data and evaluate the performance of the pipeline by training and
previous operator. Therefore, different orders of TF types can result testing the end ML model on the transformed data. To transform
in totally different outputs. For example, if outlier removal occurs the raw data with the pipeline, we can sequentially apply the TF
before normalization, not only the input data for normalization are operator in the pipeline. Formally, let 𝑥 denote the raw feature of
changed, but also the statistics used to normalize the data (e.g., the one example and 𝑥𝑖 be the transformed feature after 𝑖 steps transfor-
minimum and maximum value of the column) are affected due to mation, where 𝑥 0 = 𝑥. Then, 𝑥𝑖 can be computed by applying 𝑔𝑖 (the
the removal of outliers, which can yield a significantly different 𝑖-th TF operator in the pipeline) on its input 𝑥𝑖 −1 as: 𝑥𝑖 = 𝑔𝑖 (𝑥𝑖 −1 ).
output compared with having outlier removal after normalization. The output of the pipeline is the output of the last TF operator
However, in practice, it is usually not clear how to decide the order in the pipeline. Formally, let 𝑠 denote the number of TF operators
of TF types and data scientists would simply choose some default in the pipeline and GT (𝑥) denote the final output of the pipeline
order based on their experience. The outcome of this step is a data GT . Then, we have GT (𝑥) = 𝑥𝑠 .
preprocessing prototype, which is an ordered sequence of TF types After pipeline evaluation, data scientists may refine the proto-
formally defined as follows. type and the pipeline based on the evaluation results. For example,
they may add/delete TF types in the prototype, change the order of
Definition 2.1. (Data Preprocessing Prototype). Let S = {𝐹 } de- TF types, replace the TF operator for some TF types, etc. The above
note the space of TF types. We define a data preprocessing prototype process will be repeated until the end ML model achieves desired
T as T = {𝑇𝑖 ∈ S : 𝑖 = 1, 2, ...}, where 𝑇𝑖 is the 𝑖-th TF type to be performance. As a result, constructing data preprocessing pipelines
applied and selected from the space. is an iterative process that requires substantial domain knowledge
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data SIGMOD ’23, June 18–23, 2023, Seattle, WA
and heavily relies on human experts to make decisions, which can The most straightforward way to solve the above bi-level opti-
be time-consuming and costly. mization problem is to train the downstream ML model by mini-
mizing 𝐿𝑡𝑟𝑎𝑖𝑛 (GT 1 , ..., G𝑐 , 𝒘) for each possible data preprocessing
T
2.3 Problem Statement pipeline in the space, and then select the one that has the minimal
validation loss 𝐿𝑣𝑎𝑙 (GT 1 , ..., G𝑐 , 𝒘 ∗ ). This naive approach is com-
The core idea of our approach is to formulate the decision-making T
process of data scientists as an optimization problem. Assume that putationally expensive as it requires training the downstream ML
each data example has 𝑐 features denoted by 𝒙 = [𝑥 1, 𝑥 2, ...𝑥 𝑐 ]. Let models as many times as the number of possible data preprocessing
T 𝑖 , G𝑖T denote the prototype and the pipeline for the 𝑖-th feature. pipelines. Consider a prototype with 𝑠 TF types, where each TF
Let 𝒘 denote the parameters of the ML model ℎ 𝒘 : X𝑐 → Y. The type consists of 𝑚 TF operators. Under this prototype, there are
model will take features of an example as input and generate a 𝑚𝑠 possible pipelines by choosing different TF operators for each
prediction 𝑦ˆ = ℎ 𝒘 (𝑥 1, 𝑥 2, ...𝑥 𝑐 ). Let 𝑙𝑜𝑠𝑠 (𝑦,
ˆ 𝑦) be the loss function TF type. Since different features can use different pipelines, with 𝑐
that returns a loss score given the prediction 𝑦ˆ and the ground features, we have 𝑚𝑠𝑐 possible pipelines, which is exponential to
truth label 𝑦. Then, the training loss and validation loss on the the number of TF types and the number of features. This is only
transformed data can be computed as: the number of possible pipelines under one prototype. If we con-
∑︁ sider using different prototypes, the space will become even larger.
1 1
𝐿𝑡𝑟𝑎𝑖𝑛 (GT , ..., G𝑐T , 𝒘) = 𝑙𝑜𝑠𝑠 (ℎ 𝒘 (GT (𝑥 1 ), ..., G𝑐T (𝑥 𝑐 )), 𝑦) Therefore, this naive approach is infeasible in practice.
𝒙,𝑦 ∈𝐷𝑡𝑟𝑎𝑖𝑛
∑︁ 3 DATA PREPROCESSING WITH FIXED
1 1
𝐿𝑣𝑎𝑙 (GT , ..., G𝑐T , 𝒘) = 𝑙𝑜𝑠𝑠 (ℎ 𝒘 (GT (𝑥 1 ), ..., G𝑐T (𝑥 𝑐 )), 𝑦) PROTOTYPE
𝒙,𝑦 ∈𝐷 𝑣𝑎𝑙
Let us first assume that we have a fixed pre-defined data processing
Problem Statement. Given a training set 𝐷𝑡𝑟𝑎𝑖𝑛 and a validation prototype T = {𝑇1, ...,𝑇𝑠 } that is used for all features. This is a com-
set 𝐷 𝑣𝑎𝑙 , a space of TF operators and TF types S, and a set of ML mon scenario where data scientists would like to skip the prototype
model parameters 𝒘, we would like to find a pipeline G𝑖T (with selection step and use a default prototype so that they can spend
more time on operator selection. This is also the setup of many
its prototype T 𝑖 ) from the space for each feature 𝑥 𝑖 , such that the
existing systems [5, 10, 33, 40] for automatic data preprocessing,
performance of the ML model trained and evaluated on the trans-
where the prototype is pre-defined by users and fixed in advance.
formed data is maximized. This data preprocessing pipeline search
Given this prototype, we need to assign each TF type in the
problem (DPPS) can be formulated as an optimization problem:
prototype with a TF operator to generate a pipeline and we need
min 1
𝐿𝑣𝑎𝑙 (GT , ..., G𝑐T , 𝒘 ∗ ) to generate a pipeline for each feature. As the number of possible
GT1 ,...,GT𝑐 choices is exponentially large, the problem is: how to efficiently
s.t. 𝒘 ∗ = arg min 𝐿𝑡𝑟𝑎𝑖𝑛 (GT
1
, ..., G𝑐T , 𝒘) find the optimal assignment for each feature such that the validation
𝒘 loss is minimized? This search problem is challenging because the
This is called a bi-level optimization problem in which one optimiza- search space is discrete and non-differentiable, which means we
tion problem is embedded within another [26]. In the inner opti- have to enumerate every possible case to find the optimal one. To
mization, we fix the preprocessing pipeline parameters GT 1 , ..., G𝑐 solve this problem, we first parameterize the search space such
T that each assignment of TF operators (i.e., each choice of pipelines)
and focus on finding the best model parameters 𝒘 ∗ to minimize
the training loss on the transformed training data. In the outer can be represented using binary pipeline parameters (Section 3.1).
optimization, we fix the model parameters and focus on finding Then, we relax the search space to be continuous so that the model
the best pipeline parameters to minimize the validation loss on loss is differentiable w.r.t the pipeline parameters (Section 3.2). The
the transformed validation data. An alternative problem formu- relaxation enables us to solve the bi-level optimization problem
lation is to optimize both the pipeline and model parameters to efficiently using gradient descent (Section 3.3).
minimize the training loss, which becomes a one-level optimiza-
tion. However, previous works (e.g., DARTS [25]) have shown that 3.1 Parameterization
one-level optimization would make it easier to overfit the training Without loss of generality, assume that each TF type 𝑇𝑖 in the pro-
data. Therefore, in this work, we use bi-level optimization to reduce totype consists of 𝑚 TF operators denoted by 𝑇𝑖 = {𝑓𝑖1, 𝑓𝑖2, ...𝑓𝑖𝑚 }
the risk of overfitting. In our experiments, we empirically verified (if different TF types contain a different number of operators, 𝑚
that the bi-level approach helps improve the model test accuracy denotes the maximum number of TF operators). To instantiate
compared to using a one-level optimization (Section 5.4). a pipeline, we need to select one specific TF operator for each
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
TF type. Each selection can be represented using a 𝑠 × 𝑚 matrix We can now rewrite the DPPS problem statement with a fixed
𝜷 = {𝛽𝑖 𝑗 }, 𝛽𝑖 𝑗 ∈ {0, 1} is defined as follows: prototype using 𝜷 matrices as follows, where the search space of
( pipelines is converted into a space of 𝜷 matrices.
1 𝑓𝑖 𝑗 is selected
𝛽𝑖 𝑗 = (1) min 𝐿𝑣𝑎𝑙 (𝜷 1, ...𝜷 𝑐 , 𝒘 ∗ ) (5)
0 Otherwise 𝜷 1 ,...,𝜷 𝑐
In other words, 𝛽𝑖 𝑗 is 1 if we select 𝑓𝑖 𝑗 for 𝑇𝑖 , otherwise it is 0. Note s.t. 𝒘 ∗ = argmin𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 1, ..., 𝜷 𝑐 , 𝒘) (6)
that because only one TF operator is selected for each TF type,
there is exactly one element each row in the matrix to be 1 and 0s Architecture. Figure 4 shows the architecture of DiffPrep with a
fixed prototype. For each input feature, DiffPrep will learn a set of
Í
elsewhere. Hence, we have 𝑗 𝛽𝑖 𝑗 = 1.
𝜷 parameters that defines a data preprocessing pipeline. Each 𝛽𝑖 𝑗
Example 3.1. Assume that the given prototype consists of three parameter is attached with a TF operator indicating whether this
TF types, i.e., T = {𝑇1,𝑇2,𝑇3 }. Each TF type consists of four possible operator is selected. The final output of DiffPrep pipelines will be
TF operators, i,e., 𝑇𝑖 = {𝑓𝑖1, 𝑓𝑖2, 𝑓𝑖3, 𝑓𝑖4 }. Consider the data prepro- the preprocessed features parameterized by 𝜷 parameters, which
cessing pipeline GT = {𝑓12, 𝑓21, 𝑓34 }. The selection of TF operators will be fed into ML models for training and evaluation.
that generates this pipeline can be represented using the following
matrix. 3.2 Differentiable Relaxation
0 1 0 0
𝜷 = 1 0 0 0
Although we have parameterized the search space of pipelines
0 0 0 1 using 𝜷 matrices, the search space is still discrete as 𝜷 matrices
are binary parameters, i.e., 𝛽𝑖 𝑗 ∈ {0, 1}. To make the search space
Each 𝜷 matrix uniquely defines a data preprocessing pipeline continuous, we relax 𝜷 to be continuous parameters that can take
under the given prototype and we can use 𝜷 matrix to compute the continuous values from 0 to 1, i.e., 𝛽𝑖 𝑗 ∈ [0, 1]. However, we need
output of the pipeline. Let 𝑥 denote the raw feature of one example to retain the constraints between parameters in the original binary
and 𝑥𝑖 be the transformed feature after 𝑖 steps transformation,
Í
matrix ( 𝑗 𝛽𝑖 𝑗 = 1) such that these parameters are semantically
where 𝑥 0 = 𝑥. Then, we can compute 𝑥𝑖 as: meaningful. To enforce that each row sums up to 1, we can define
𝑚
∑︁ 𝛽𝑖 𝑗 using a softmax function as:
𝑥𝑖 = 𝛽𝑖 𝑗 𝑓𝑖 𝑗 (𝑥𝑖 −1 ) (2)
exp(𝜏𝑖 𝑗 )
𝑗=1 𝛽𝑖 𝑗 = Í (7)
𝑘 exp(𝜏𝑖𝑘 )
Since only 𝛽𝑖 𝑗 associated with the selected TF operator is equal
to 1 and others are 0, Equation 2 returns exactly the output of the where 𝝉 = {𝜏𝑖 𝑗 } ∈ R𝑠 ×𝑚 are underlying parameters.
selected TF operator. The final output of the pipeline 𝑥𝑠 becomes a By enforcing these constraints, we can interpret 𝛽𝑖 𝑗 (Equation 1)
variable parameterized by 𝜷 matrix denoted by 𝑥𝑠 = 𝑥 (𝜷). Assume as the probability that 𝑓𝑖 𝑗 is selected for 𝐹𝑖 . We can still use Equa-
that the dataset has 𝑐 features and let 𝜷 𝑖 denote the parameters that tion 2 to compute the transformed data, which can be interpreted
defines the pipeline for the 𝑖-th feature 𝑥 𝑖 . Then, the training loss as the expected value of the transformed data. Similarly, we can
and the validation loss also become variables parameterized by all compute the expected value of training loss and validation loss
𝜷 matrices, which can be represented as: using Equation 3 and Equation 4.
∑︁
𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 1, ..., 𝜷 𝑐 , 𝒘) = 𝑙𝑜𝑠𝑠 (ℎ 𝒘 (𝑥 1 (𝜷 1 ), ..., 𝑥 𝑐 (𝜷 𝒄 )), 𝑦) 3.3 Bi-level Optimization
𝒙,𝑦 ∈𝐷𝑡𝑟𝑎𝑖𝑛
Making 𝜷 continuous allows us to solve the bi-level optimization
(3)
∑︁ problem efficiently using gradient descent instead of enumerating
1 𝑐 1 1 𝑐 𝒄 all possible 𝜷. To minimize the validation loss (Equation 5), we can
𝐿𝑣𝑎𝑙 (𝜷 , ..., 𝜷 , 𝒘) = 𝑙𝑜𝑠𝑠 (ℎ 𝒘 (𝑥 (𝜷 ), ..., 𝑥 (𝜷 )), 𝑦) (4)
𝒙,𝑦 ∈𝐷 𝑣𝑎𝑙 iteratively update the underlying parameters 𝝉 using the gradient
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data SIGMOD ’23, June 18–23, 2023, Seattle, WA
Algorithm 1 Solving Bi-level Optimization with Gradient Descent ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 + ), ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 − ) are the gradients of the training
1: Initialize 𝝉 and 𝒘 loss w.r.t 𝜷, where we consider 𝒘 + and 𝒘 − as constant. They can be
2: while not converged do derived from Equation 3 or as we will show later using backpropa-
3: Update 𝝉 : 𝝉 = 𝝉 − 𝜂 1 ∇𝝉 𝐿𝑣𝑎𝑙 (𝜷 (𝝉 ), 𝒘 − 𝜂 2 ∇𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 (𝝉 ), 𝒘 ) ) gation through the pipeline. Therefore, to update the underlying
4: Update 𝒘: 𝒘 = 𝒘 − 𝜂 2 ∇𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 (𝝉 ), 𝒘 ) parameters 𝝉, we need to compute three gradients w.r.t. 𝜷 with
5: return 𝜷 (𝝉 ), 𝒘 different model parameters: ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 + ), ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 − ) and
∇𝜷 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ).
of the validation loss with respect to 𝝉 as follows: Backpropagation through the pipeline. The gradients of loss
w.r.t. 𝜷 can be computed using backward propagation through the
𝝉 = 𝝉 − 𝜂 1 ∇𝝉 𝐿𝑣𝑎𝑙 (𝜷 (𝝉), 𝒘 ∗ ) (8) 𝜕𝐿 be the derivative of the loss w.r.t. the input of the
pipeline. Let 𝜕𝑥 𝑠
where 𝜂 1 is the learning rate for updating 𝝉. However, to obtain the ML model or the final output of the preprocessing pipeline. From
optimal model 𝒘 ∗ in Equation 8, we need to solve the inner opti- Equation 2, using the chain rule, we have
mization problem (Equation 6) by completely training an ML model 𝑚
until convergence for every update of 𝝉, which is computationally 𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 ∑︁ 𝜕𝑓𝑖 𝑗
= 𝑓𝑖 𝑗 (𝑥𝑖 −1 ), = 𝛽𝑖 𝑗 (12)
expensive. To solve this issue, instead of finding the optimal 𝒘 ∗ , 𝜕𝛽𝑖 𝑗 𝜕𝑥𝑖 𝜕𝑥𝑖 −1 𝜕𝑥𝑖 𝑗=1 𝜕𝑥𝑖 −1
we approximate it by doing only a single training step, which is
𝜕𝑓
a one-step gradient descent on the current model parameters 𝒘 However, in this equation, the gradient term 𝜕𝑥𝑖𝑖−1 𝑗
, namely, the
(denoted by 𝒘 ′ ) as: gradient of the output of a TF operator w.r.t. its input depends on the
𝒘 ∗ ≈ 𝒘 ′ = 𝒘 − 𝜂 2 ∇𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 (𝝉), 𝒘) (9) internals of a TF operator, which may not be easy to compute. Also,
we expect users to add their own customized TF operators, but users
where 𝜂 2 is the learning rate for model training 1 . Similar approx- may not be able to derive the gradients. Therefore, we assume that
imation can also be found in previous work [12, 25] for solving
all TF operators are black-box functions, where we only have access
other bi-level optimization problems. We summarize the procedure
of solving the bi-level optimization (Equation 5 and Equation 6) to the output and input without knowing their internal algorithms,
using gradient descent in Algorithm 1, where the pipeline param- and we approximate the gradients using numerical differentiation.
eters 𝜷 (𝝉) and model parameters 𝒘 are alternately updated using Let 𝜖 be a small scalar. This gradient can be estimated as:
gradient descent until convergence. 𝜕𝑓𝑖 𝑗 𝑓𝑖 𝑗 (𝑥𝑖 −1 + 𝜖) − 𝑓𝑖 𝑗 (𝑥𝑖 −1 − 𝜖)
≈ (13)
Gradient Computation. In Algorithm 1, ∇𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 (𝝉), 𝒘) can 𝜕𝑥𝑖 −1 2𝜖
be derived from Equation 3, where 𝜷 can be considered as constant.
Then, we can use Equation 12 to backpropagate the gradients of
However, since 𝜷 (Equation 7) and 𝒘 ′ (Equation 9) are variables of
the loss w.r.t. the output of each TF operator and each 𝛽𝑖 𝑗 .
𝝉, to compute the gradients ∇𝝉 𝐿𝑣𝑎𝑙 (𝜷 (𝝉), 𝒘 ′ (𝝉)), we need to use
the chain rule 2 : Implementation with automatic differentiation. Many auto-
matic differentiation engines (e.g. Pytorch [30], TensorFlow [1])
∇𝝉 𝐿𝑣𝑎𝑙 (𝜷 (𝝉), 𝒘 ′ (𝝉)) = ∇𝜷 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) · ∇𝝉 𝜷 can perform backpropagation automatically and efficiently to com-
− 𝜂 2 ∇𝒘 ′ 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) · ∇𝒘,𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘) · ∇𝝉 𝜷 (10) pute the gradients. However, these tools usually require that all
the computations in the forward propagation are differentiable and
We can decompose Equation 10 into three parts: 𝐷 1 , 𝐷 2 , 𝐷 3 , where
implemented using their frameworks. Since we consider TF oper-
𝐷 1 = ∇𝝉 𝜷 ators as black-box functions of which the internal algorithms are
𝐷 2 = ∇𝜷 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) unknown, we cannot leverage these tools directly to backpropa-
gate the preprocessing pipeline. To solve this issue, we modify the
𝐷 3 = ∇𝒘 ′ 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) · ∇𝒘,𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘)
original forward propagation of the pipeline (Equation 2) to be:
The gradient 𝐷 1 can be derived from Equation 7. The gradient 𝐷 2 𝑚 𝑚 𝑚
is also easy to compute, where we consider 𝒘 ′ as constant and
∑︁ ∑︁ ∑︁
𝑥𝑖 = 𝛽𝑖 𝑗 𝑜˜𝑖 𝑗 + 𝑥𝑖 −1 𝛽˜𝑖 𝑗 𝑑˜𝑖 𝑗 − 𝑥˜𝑖 −1 𝛽˜𝑖 𝑗 𝑑˜𝑖 𝑗 (14)
compute the gradient of validation loss w.r.t. 𝜷. This can be derived 𝑗=1 𝑗=1 𝑗=1
from Equation 4 or as we will show later, using backpropagation
through the pipeline. However, computing 𝐷 3 is difficult since it where 𝑜˜𝑖 𝑗 = 𝑓𝑖 𝑗 (𝑥𝑖 −1 ) is the output of the black-box TF opera-
tor; 𝑑˜𝑖 𝑗 = 𝑖 𝑗 is the numerical derivative computed using Equa-
𝜕𝑓
involves a second-order derivative and matrix-vector product com- 𝜕𝑥𝑖 −1
putation. Following the previous work on neural network architec- tion 13; 𝑥˜𝑖 −1 is a snapshot of 𝑥𝑖 −1 , which is a constant number
ture search [25], we approximate it using numerical differentiation. that has the same value as 𝑥𝑖 −1 but does not require gradient (e.g.,
Let 𝜖 be a small scalar and 𝒘 ± = 𝒘 ± 𝜖∇𝒘 ′ 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ), then 𝐷 3 can 𝑥˜𝑖 −1 = 𝑥𝑖 −1 .detach() in PyTorch or 𝑥˜𝑖 −1 = stop_gradient(𝑥𝑖 −1 )
be estimated as: in TensorFlow); 𝛽˜𝑖 𝑗 is a snapshot of 𝛽𝑖 𝑗 . Note that all the variables
∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 + ) − ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 − ) with tilde are constant numbers that do not have gradients.
𝐷3 ≈ (11)
2𝜖 In the forward pass, Equation 14 yields exactly the same outputs
1 In
as Equation 2, but it only requires the output of the TF operators
practice, to avoid tuning two learning rates, we can simply set 𝜂 1 = 𝜂 2 [25].
2 Chain rules for multivariable functions: Suppose that 𝑥 (𝑡 ) and 𝑦 (𝑡 ) are differentiable
and the numerical derivatives computed using the input and output
functions of 𝑡 and 𝑧 = 𝑓 (𝑥 (𝑡 ), 𝑦 (𝑡 ) ) is a differentiable function of 𝑥 and 𝑦 . Then the of TF operators. Therefore, the internal implementation of TF oper-
chain rule states 𝑑𝑧 𝜕𝑧 𝜕𝑥 𝜕𝑧
𝑑𝑡 = 𝜕𝑥 𝜕𝑡 + 𝜕𝑦 𝜕𝑡 .
𝜕𝑦
ators is not involved, which enables the automatic backpropagation
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
Algorithm 2 DiffPrep-Fix 𝜷 (𝝉) (Line 4 - 7) with a mini-batch randomly sampled from the
Require: Space of TF types and operators 𝑆, pre-defined prototype T, training/validation set, instead of using all training/validation exam-
training set 𝐷𝑡𝑟𝑎𝑖𝑛 , validation set 𝐷 𝑣𝑎𝑙 ples, which is similar to using stochastic gradient descent in place
Ensure: Optimal pipeline parameters 𝜷 and model parameters 𝒘 of gradient descent. We apply this strategy in our experiments.
1: Initialize 𝝉 and 𝒘
2: while not converged do 4 DATA PREPROCESSING WITH FLEXIBLE
3: Fit TF operators on the transformed training data
4: Forward Propagation: Compute 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 + ), 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 − ),
PROTOTYPE
𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) We now discuss the more general and flexible case, where the data
5: Backward Propagation: Compute ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 + ), preprocessing prototype is not pre-defined. In such a case, prior to
∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷, 𝒘 − ), ∇𝜷 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) operator selection using the method we introduced in Section 3, we
6: Compute ∇𝝉 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) need to select a prototype from space for each feature. A prototype is
7: Update 𝝉 : 𝝉 = 𝝉 − 𝜂 1 ∇𝝉 𝐿𝑣𝑎𝑙 (𝜷, 𝒘 ′ ) an ordered sequence of TF types. Therefore, for prototype selection,
8: Update 𝒘: 𝒘 = 𝒘 − 𝜂 2 ∇𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜷 (𝝉 ), 𝒘 ) we need to determine (1) the TF types that are included in the
9: return 𝜷 (𝝉 ), 𝒘
prototype and (2) the order of TF types in the sequence.
To simplify this search problem, we introduce an identity transfor-
mation operator (𝐼 ) into each TF type, which is defined as 𝐼 (𝑥) = 𝑥.
to be performed. In the backward pass, the automatic differentiation
The identity TF operator is a function that maps data to itself, so
engines will compute the gradients as:
selecting the identity TF operator for a TF type during operator se-
𝑚
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 ∑︁ ˜ ˜ lection is equivalent to dropping this TF type in the prototype. This
= 𝑜˜𝑖 𝑗 , = 𝛽𝑖 𝑗 𝑑𝑖 𝑗 (15) allows us to include all TF types in the prototype and only focus
𝜕𝛽𝑖 𝑗 𝜕𝑥𝑖 𝜕𝑥𝑖 −1 𝜕𝑥𝑖 𝑗=1
on their order, rather than deciding which to include or exclude. A
This yields exactly the same gradients as Equation 12 with the gradi- prototype becomes a permutation of all TF types with the option
ents of black-box TF operators replaced by approximate numerical to exclude any by selecting the identity operator for it.
gradients. Therefore, this allows us to backpropagate the prepro- Now, the problem is: how to find the optimal prototype for each
cessing pipeline correctly and automatically using any automatic feature such that the optimal pipeline under the optimal prototype
differentiation engines. minimizes the validation loss? A naive approach is to enumerate
Algorithm. Algorithm 2 shows the pseudocode of DiffPrep with every possible prototype and use the method introduced in Section
a fixed prototype. We first initialize the underlying pipeline pa- 3 to find the optimal pipeline for each prototype. However, this
rameters and model parameters (Line 1). Most TF operators need approach is not feasible because the search space of prototypes
to be fitted before performing transformation. For example, stan- is large. Given a space that contains 𝑠 TF types, the number of
dardization needs to compute the mean and standard deviation possible permutations is 𝑠!. Since different features can use different
of the input data. Therefore, at each iteration, we first fit all the prototypes, for a dataset with 𝑐 features, there are 𝑠!𝑐 possible
TF operators using the transformed training data (Line 3). Note combinations, which is exponential to the number of features.
that each TF operator should be fitted on its input training data, The hardness of the above search problem is again because the
which is the output data of its previous step in the pipeline (i.e., the search space of prototypes is discrete and non-differentiable. To
operator 𝑓𝑖 𝑗 should be fitted on 𝑥𝑖 −1 over all training data). Since solve this problem efficiently, we use a method similar to that
𝑥𝑖 −1 depends on 𝜷 parameters (Equation 14), it will be changed introduced in Section 3: we first parameterize the search space using
every iteration as 𝜷 parameters are updated. Therefore, we need a set of binary prototype parameters (Section 4.1) and then make
to refit TF operators at the beginning of every iteration. Then we the search space to be continuous and differentiable (Section 4.2).
can perform forward propagation using Equation 14 (Line 4) and Finally, using the method introduced in Section 3.3, we can solve
backward propagation using automatic differentiation (Line 5). We the bi-level optimization using gradient descent, which enables our
compute the gradients needed for updating underlying pipeline pa- method to learn the prototype parameters, pipeline parameters and
rameters 𝝉 using Equation 10 and Equation 11 (Line 6). We update model parameters simultaneously and efficiently in one training
the underlying pipeline parameters and model parameters alterna- loop.
tively using gradient descent (Line 7 - 8). This will be repeated until
convergence. 4.1 Parameterization
Complexity. The number of 𝜷 parameters needed is 𝑠 × 𝑚 × 𝑐, By introducing identity TF operators, given a space of TF types S =
where 𝑠 is the number of TF types, 𝑚 is the number of TF operators {𝐹 1, 𝐹 2, ...𝐹𝑠 }, a prototype T = {𝑇1,𝑇2, ...𝑇𝑠 } would be a permutation
for each TF type and 𝑐 is the number of features. The running time of S. Hence, we can represent a prototype using a 𝑠 × 𝑠 permutation
is dominated by the forward propagation and backward propaga- matrix 𝜶 = {𝛼𝑖 𝑗 }, 𝛼𝑖 𝑗 ∈ {0, 1} defined as follows.
tion. Without DiffPrep, the training process only needs to update
(
1 𝑇𝑖 = 𝐹 𝑗
the model parameters (Line 8), which requires one pass of forward 𝛼𝑖 𝑗 = (16)
0 Otherwise
propagation and one pass of backpropagation. With DiffPrep, up-
dating 𝜷 needs three more passes of forward propagation (Line 4) In other words, 𝛼𝑖 𝑗 is 1 if the 𝑖-th TF type in the prototype is 𝐹 𝑗 ,
and three more passes of backpropagation (Line 5). To improve the otherwise it is 0. Note that the permutation matrix has exactly one
efficiency of the algorithm, we fit TF operators (Line 3) and update element of 1 each row and each column and 0s elsewhere, i.e., the
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data SIGMOD ’23, June 18–23, 2023, Seattle, WA
sum of each row and each column is exactly 1. Hence, we have pipelines will be the preprocessed features parameterized by 𝜶 , 𝜷
Í Í
𝑖 𝛼𝑖 𝑗 = 1 and 𝑗 𝛼𝑖 𝑗 = 1. parameters, which will be fed into ML models.
Example 4.1. Assume that the space consists of three TF types
𝑆 = {𝐹 1, 𝐹 2, 𝐹 3 }. Consider the prototype T = {𝐹 2, 𝐹 3, 𝐹 1 }. We have 4.2 Differentiable Relaxation
𝛼 12 = 1, 𝛼 23 = 1, 𝛼 31 = 1 and 0s elsewhere. Hence, This prototype
We can reuse the method described in Section 3.2 to relax 𝜷 pa-
can be represented using the following 𝜶 matrix.
rameters. For 𝜶 matrix, similar to relaxing 𝜷 matrix, we can relax
0 1 0 𝜶 to be continuous variables 𝛼𝑖 𝑗 ∈ [0, 1], but we need to retain
𝜶 = 0 0 1 the constraints between parameters in the original binary matrix
1 0 0 Í Í
( 𝑖 𝛼𝑖 𝑗 = 1, 𝑗 𝛼𝑖 𝑗 = 1). In other words, 𝜶 matrix requires to be a
To generate a pipeline, in addition to the prototype, we also need non-negative squared matrix with both rows and columns summing
to select a TF operator for each TF type. Let 𝐹𝑖 = {𝑓𝑖1, 𝑓𝑖2 ...𝑓𝑖𝑚 }, up to 1, which is so-called a doubly stochastic matrix (DSM). To en-
where 𝑓𝑖 𝑗 is the 𝑗-th TF operator for the TF type 𝐹𝑖 . Then, we can force that 𝜶 is a DSM, we generate it using Sinkhorn normalization.
still use Equation 1 to define the 𝜷 matrix, which represents the Sinkhorn [38, 39] showed that any non-negative square matrix can
result of operator selection. An 𝜶 matrix and a 𝜷 matrix together be converted into a DSM by repeatedly and alternatively normal-
uniquely defines a data preprocessing pipeline. The data after 𝑖 izing its rows and columns. Cruz et al. [36] introduced Sinkhorn
steps transformation can be computed as: Layer that converts CNN predictions to a DSM using Sinkhorn
normalizations. Following Adams et al. [3], we define Sinkhorn
𝑠 ∑︁
𝑚
∑︁ normalization over any squared matrix 𝑋 as:
𝑥𝑖 = 𝛼𝑖 𝑗 𝛽 𝑗𝑘 𝑓 𝑗𝑘 (𝑥𝑖 −1 ) (17)
𝑗=1 𝑘=1
Note that in Equation 17, only the TF type 𝐹 𝑗 selected for the 𝑖-th 𝑋𝑖 𝑗 𝑋𝑖 𝑗
step has 𝛼𝑖 𝑗 = 1 and only the TF operator selected for 𝐹 𝑗 has 𝛽 𝑗𝑘 = 1. 𝐶𝑖 𝑗 (𝑋 ) = Í , 𝑅𝑖 𝑗 (𝑋 ) = Í , 𝑆 (𝑋 ) = lim 𝑆 𝑙 (𝑋 )
𝑖 𝑋𝑖 𝑗 𝑗 𝑋𝑖 𝑗 𝑙→∞
Therefore, the summation is exactly the output of the selected TF (
operator for the 𝑖-th TF type in the prototype. 𝑋𝑖 𝑗 𝑙 =1
𝑆 𝑙 (𝑋 ) =
Similar to Equation 3 and 4, we can also compute the training loss 𝑅(𝐶 (𝑆 𝑙 −1 (𝑋 )) 𝑙 > 1
𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 1, ..., 𝜶 𝑐 , 𝜷 1, ..., 𝜷 𝑐 , 𝒘) and validation loss 𝐿𝑣𝑎𝑙 (𝜶 1, ..., 𝜶 𝑐 ,
𝜷 1, ..., 𝜷 𝑐 , 𝒘) using Equation 17, which will be parameterized by 𝜶 ,
𝜷 matrices. We can also rewrite the DPPS problem statement using
where 𝑅 and 𝐶 are the row and column normalization. 𝑆 (𝑋 ) is the
𝜶 , 𝜷 matrices as:
result of Sinkhorn normalization, which is a DSM.
min 𝐿𝑣𝑎𝑙 (𝜶 1, ...𝜶 𝑐 , 𝜷 1, ...𝜷 𝑐 , 𝒘 ∗ ) (18) Inspired by these work, we define 𝜶 matrix as 𝜶 = 𝑆 (𝜽 ), where
𝜶 1 ,...,𝜶 𝑐 ,𝜷 1 ,...,𝜷 𝑐 𝜽 = {𝜃𝑖 𝑗 } ∈ R𝑠+×𝑠 are non-negative underlying parameters. Note
s.t. 𝒘 ∗ = argmin𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 1, ...𝜶 𝑐 , 𝜷 1, ..., 𝜷 𝑐 , 𝒘) that this normalization is a differentiable process, which allows
(19) us to compute the gradients w.r.t the underlying parameters. By
enforcing these constraints, we can interpret 𝛼𝑖 𝑗 (Equation 16)
Architecture. Figure 5 shows the architecture of DiffPrep with a as the probability that 𝐹 𝑗 is the 𝑖-th transformation type in the
flexible prototype. For each input feature, DiffPrep will learn a set prototype, and 𝛽𝑖 𝑗 (Equation 1) as the probability that 𝑓𝑖 𝑗 is selected
of 𝜶 and 𝜷 parameters that defines a data preprocessing pipeline. for 𝐹𝑖 . In this sense, Equation 17 will return the expected value of
The 𝜶 parameters define the prototype and the 𝜷 parameters define the transformed data, which can be used to compute the expected
the results of operator selection. The final output of DiffPrep-Flex training and validation loss.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
Algorithm 3 DiffPrep-Flex that are widely used in practice and provided by scikit-learn [31]
Require: Space of TF types and operators 𝑆, training set 𝐷𝑡𝑟𝑎𝑖𝑛 , validation as shown in Table 1. For operators that have parameters, we dis-
set 𝐷 𝑣𝑎𝑙 cretize them with different parameters. For example, for the Z-score
Ensure: Optimal prototype parameters 𝜶 , pipeline parameters 𝜷, and method, we consider Z-score(2), Z-score(3) and Z-score(4) as 3 dif-
model parameters 𝒘 ferent TF operators. We also add an identity operator to each TF
1: Initialize 𝜽 , 𝝉 and 𝒘 type which amounts to a skipping operator. Since most TF opera-
2: while not converged do tors provided by scikit-learn require the input data to contain no
3: Fit TF operators on the transformed training data
missing values, we do not add the identity operator to the miss-
4: Forward Propagation: Compute 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 + ),
ing value imputation. We enforce the missing value imputation
𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 − ), 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ )
5: Backward Propagation: Compute ∇𝜶 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 + ), to be the first TF type in the prototype so that missing values are
∇𝜶 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 − ), ∇𝜶 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ ), ∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 + ), imputed in the first step. In addition, many operators only accept
∇𝜷 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 − ), ∇𝜷 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ ) numerical input data. Therefore, after missing value imputation,
6: Compute ∇𝝉 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ ), ∇𝜽 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ ) we transform all categorical features into numerical features using
7: Update 𝝉, 𝜽 : 𝝉 = 𝝉 − 𝜂 1 ∇𝝉 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ ), one-hot encoding.
𝜽 = 𝜽 − 𝜂 1 ∇𝜽 𝐿𝑣𝑎𝑙 (𝜶 , 𝜷, 𝒘 ′ )
Methods Compared. We compare the following data preprocess-
8: Update 𝒘: 𝒘 = 𝒘 − 𝜂 2 ∇𝒘 𝐿𝑡𝑟𝑎𝑖𝑛 (𝜶 , 𝜷, 𝒘 )
9: return 𝜶 (𝜽 ), 𝜷 (𝝉 ), 𝒘
ing approaches.
data preprocessing. Since we focus on data preprocessing and connect-4 may benefit from cleaning outliers properly as it has
feature engineering is not performed for all other methods, for a great number of outliers. On this dataset, we found DP-Fix
a fair comparison, we turn off the feature processor and only selects Z-score methods for some features, while selecting skip
use the data processor. We will report the results of AS with its operators for some other features. This shows the benefits of
feature processor later in Section 5.6, where we will show that DiffPrep by using feature-wise pipelines, where we can clean
our methods can also be combined with a feature processor. true outliers in some features without affecting other features
• Learn2Clean (LC) [5]. This is a reinforcement-learning-based data that may not contain true outliers. We will explain the con-
cleaning and preparation method proposed in the literature. It tribution of each component of DiffPrep in more detail in the
uses Q-learning to select the optimal preprocessing pipeline that ablation study (Section 5.4).
maximizes the ML model performance. The TF types, operators (3) DiffPrep-Fix (DP-Fix) and DiffPrep-Flex (DP-Flex) respectively
and order are all flexible, but it only generates one pipeline to outperform the best baseline methods on 11 and 13 datasets.
preprocess all the features in the same way. We modify the search Comparing DiffPrep-Flex with DiffPrep-Fix, DiffPrep-Flex out-
space of TF operators to be the same as ours (Table 1). performs DiffPrep-Fix on 9 out of 18 datasets, but the two meth-
• BoostClean (BC) [20]. This is an automatic data cleaning approach ods differ by less than 2% on 17 datasets. DiffPrep-Flex does
that uses boosting to select an ensemble of data cleaning opera- not significantly improve upon DiffPrep-Fix despite the larger
tors (or pipelines) from a pre-defined candidate set to maximize search space, because the fixed order we used for DiffPrep-Fix
validation accuracy. BoostClean combines the ML models trained is already the optimal order for many datasets. Hence, changing
on different transformed data induced by different pipelines, thus the order in DiffPrep-Flex may not lead to further improvement.
it needs to train multiple models on different candidate pipelines. Also, as the gradient descent is not always guaranteed to find
BoostClean does not support feature-wise pipelines. To generate the global optimum, it is possible to find a less optimal solution,
candidate pipelines, we use the same transformation order as especially for DiffPrep-Flex which has a harder optimization
DiffPrep-Fix and randomly sample 50 pipelines from the space problem due to its flexibility. However, DiffPrep-Fix with de-
(Table 1). BoostClean is designed for binary classification but can fault order is insufficient as it may not work well for all datasets.
be adapted for multi-class classification by breaking it down into As we will show in the ablation study, if the order is not set
multiple binary tasks using the one-vs-all method. We adopt this properly, DiffPrep-Fix may perform worse, while DiffPrep-Flex
method in our experiments. We set its ensemble size 𝐵 = 5, the can avoid such cases. In addition, the default order will be in-
best setup it reported. valid if users add custom TF types. The benefit of DiffPrep-Flex
is that it does not require and is not affected by the pre-defined
Training Process. We use an SGD optimizer to optimize the model transformation order.
parameters and use an Adam optimizer to optimize the pipeline (4) RandomSearch (RS) performs better than Default on 15 out
parameters. The learning rate is tuned using the validation set. The of 18 datasets. Especially on wall-robot-nav and run_or_walk,
batch size is 512 and the number of training epochs is 1000. We the accuracy is improved by 17.5% and 11%, respectively. This
keep track of the validation loss during training and report the indicates that using the same default preprocessing pipeline for
result at the epoch with the minimum validation loss. The training all datasets is generally not a good strategy, despite its wide
and evaluation are implemented using PyTorch [30], which utilizes adoption in practice.
parallelism by default. (5) BoostClean (BC) performs worse than DiffPrep (DP-Fix and DP-
Evaluation Metrics. Our goal is to automatically and efficiently Flex combined) on 17 out of 18 datasets. Comparing BC with
compose a data preprocessing pipeline to maximize downstream RandomSearch (RS), on binary classification tasks, BoostClean
ML model performance. Therefore, we use the test accuracy of the (BC) outperforms RS on 6 out of 9 datasets. This is because
model and the end-to-end running time as the evaluation metrics. BoostClean uses an ensemble of models and statistical boosting
to improve model performance. Also, BoostClean has a larger
5.2 Performance Comparison candidate set with 30 additional pipelines compared to Ran-
domSearch, increasing its chance of finding a better pipeline.
Accuracy Comparison. Table 2 shows the test accuracy of the However, on datasets with more than 2 classes, BoostClean
ML model on 18 real-world datasets with different preprocessing only outperforms RandomSearch on 3 out of 9 datasets. Espe-
methods. We have the following observations from Table 2: cially on obesity and abalone, BoostClean is 18.9% and 7.5%
(1) Different preprocessing pipelines can lead to significantly differ- worse than RandomSearch. This is because BoostClean uses the
ent model performances. For example, on wall-robot-nav, using one-vs-all method to handle multi-class classification, which
different pipelines, the accuracy differs by more than 20%. may cause class imbalance issue and thus lead to worse model
(2) DiffPrep (DP-Fix and DP-Flex combined) achieves the best test performances when the number of classes is large.
accuracy on 15 out of 18 datasets. Our methods surpass the best (6) Although Auto-Sklearn and RandomSearch use different opti-
baseline method by more than 1% on 9 datasets. In particular, on mization methods, they perform close to each other on most
run_or_walk, obesity and connect-4, our methods outperform datasets, where the difference is less than 2% on 14 out of 18
the best baseline method by 6.6%, 5.5% and 4.2%, respectively. datasets. This is because they use a similar search space. In
This shows the effectiveness of DiffPrep and the significant comparison, our methods consider a larger space with feature-
improvement of model performance by using a larger search wise pipelines and flexible order, which further improves the
space. The reasons for gains vary by datasets. For example, performance.
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
Table 2: Comparison of model test accuracy on 18 real-world datasets using different data preprocessing pipelines. DiffPrep
(DP-Fix and DP-Flex combined) achieves the best test accuracy on 15 out of 18 datasets.
(7) Learn2Clean (LC) on average performs worst and it outper- The ratio between our method and the Default method depends on
forms Default only on 6 datasets. Although Learn2Clean has a the time it takes to propagate through the pipelines and through
larger search space than other baseline methods, the Q-learning the model. Our experiments use logistic regression, in which prop-
method is not effective to find a good pipeline and thus it per- agation through the model is relatively fast. However, as the model
forms worse than other methods. sizes increase, the ratio between our methods and the Default will
go down. For example, Figure 6(b) shows that with a larger ML
model (two-layer neural network with 100 neurons), DiffPrep-Fix is
Running Time Comparison. To compare running time, we cate-
2-3 times slower and DiffPrep-Flex is 6-7 times slower than Default,
gorize all the datasets into different bins based on the number of
both of which are much faster than other baseline methods.
rows of each dataset (e.g., 0-10K, 10-20K) and we report the aver-
age end-to-end running time of different methods over datasets in
each bin as shown in Figure 6(a). As expected, Default takes the
shortest time since it only trains the model once with the fixed
pipeline. RandomSearch (RS) is about 20 times slower than Default,
since it simply trains the model 20 times with 20 randomly sampled
pipelines. Auto-Sklearn (AS) takes about 1 hour constantly as it
keeps searching for the best result within the given 1 hour time
limit. BoostClean (BC) is about 50 times slower than Default on
binary class datasets as it needs to train a model on every candidate
pipeline, but it is much slower on multi-class datasets as it uses the
one-vs-all method and needs to rerun the whole algorithm for each
subclass. Additionally, while both RandomSearch and BoostClean (a) Logistic Regression
can utilize parallelism to decrease running time by training multi-
ple models simultaneously on multiple machines, this study only
considers the single machine setup, which is more common for
general users. Overall, DiffPrep-Fix takes about half of the time of
RandomSearch, which amounts to 10 times slower than Default and
it is generally faster than Learn2Clean. DiffPrep-Flex takes similar
time as RandomSearch and is generally faster than Auto-Sklearn.
Also, we can see the running time of our methods grows linearly
with the size of the dataset, showing their scalability.
Note that although DiffPrep only requires training the model
once, the additional overhead comes from the forward/backward (b) Two-layer Neural Network
propagation through the pipelines. Recall that DiffPrep uses “dy-
namic” pipelines that are changed at every iteration. Therefore, we Figure 6: Comparison of end-to-end running time of different
need to reinvoke TF operators to transform data at every iteration. preprocessing methods.
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data SIGMOD ’23, June 18–23, 2023, Seattle, WA
5.3 Sensitivity Analysis small training set can result in overfitting issues and using 60:20
train/val split (i.e., 25% ratio in the plot) is good enough in practice.
Non-linear ML models. One concern for our methods is that the The results on other datasets and on DiffPrep-Flex reveal similar
approximate gradient computation (e.g., Equation 11) may not work findings and thus are omitted here.
well for non-linear ML models. To understand the impact of end
ML models, we test our methods with a non-linear ML model: a
two-layer neural network with ReLU activation and 100 neurons
in the hidden layer. As shown in Table 3, compared with other
methods, DiffPrep (DP-Fix and DP-Flex combined) achieves the
best test accuracy on 10 out of 18 datasets. Especially on pbcseq and
avila, our method outperforms the best baseline method by 3.1% and
3%, respectively. This shows that our method is still effective with
non-linear models in practice. This corroborates with findings in
DARTS [25], which uses similar approximate gradient computation
for neural network architecture search and empirically works for
non-linear models.
Table 4: Comparison of model accuracy and data cleaning quality on synthetic datasets with injected errors.
Model Accuracy Imputation Quality (RMSE)
Dataset DEF RS AS LC BC DP-Fix DP-Flex DEF RS AS LC DP-Fix DP-Flex
avila 0.521 0.577 0.575 0.536 0.546 0.630 0.614 1.043 1.05 1.043 1.05 0.696 0.8
eeg 0.577 0.646 0.637 0.521 0.641 0.661 0.663 1787.601 1788.85 1787.601 1788.85 2662.249 2000.434
wal-robot-nav 0.66 0.863 0.818 0.646 0.862 0.892 0.888 1.25 0.965 1.25 1.312 1.273 1.427
Alternative Optimization Objective. To understand the benefits we randomly inject 10% missing values into three datasets. Since
of bi-level optimization, we experiment with an alternative, one- the ground truth of the synthetic datasets is known, we can eval-
level optimization objective, where both pipeline parameters (𝜶 , uate the quality of missing value imputation using the root mean
𝜷) and model parameters (𝒘) are learned to minimize the training squared error (RMSE) between imputed values and ground truth
loss. This optimization problem can be solved by computing the values. Table 4 shows the model accuracy and the quality of missing
gradient of the training loss w.r.t each parameter and updating 𝜶 , 𝜷, value imputation using different preprocessing methods. Note that
𝒘 simultaneously using gradient descent. We run DiffPrep-Fix using we omit data quality for BoostClean (BC) as it uses an ensemble
this alternative objective and the results are shown in Table 2 as DP- of ML models trained on different imputed data. The results show
Fix (train-opt). Compared with this alternative method, our original that DiffPrep consistently achieves the best model accuracy on all
DiffPrep-Fix (DP-Fix) performs better on 12 out of 18 datasets and synthetic datasets. However, the imputation methods selected by
has the same accuracy on 2 datasets. We hypothesize that this is our methods are not necessarily the method with the best data
because optimizing pipelines and models jointly on training data quality. For example, on wal-robot-nav, the imputation quality of
makes it easier to overfit the training data and thus reduces the DiffPrep-Flex is the worst, however, it achieves better accuracy
test accuracy. In addition, although our method achieves better than other baseline methods. This indicates that selecting operators
accuracy in most cases, the difference between the two methods is by data quality purely may not lead to the best model performance.
less than 1% on 17 out of 18 datasets. This is because even though Instead, we need to jointly consider other operators in the pipeline
the alternative objective only involves the training set, we tune and the ML model to achieve the best result.
hyperparameters (e.g., learning rates) using the validation set and
adopt early stopping to mitigate the overfitting. Similar findings
are also observed in DARTS [25], which uses a validation set to 5.6 Beyond Data Preprocessing
select network architecture.
While DiffPrep focuses on optimizing data preprocessing-related
parameters, it can be easily integrated with AutoML methods that
5.5 Case study with Synthetic Datasets target other components in the ML pipeline, such as feature selec-
Traditional data cleaning or preprocessing methods focus on im- tion, feature embedding and hyperparameter tuning. In previous
proving the data quality without considering downstream ML experiments, we have combined DiffPrep with learning rate tun-
models. However, in the context of data cleaning for ML, prior ing. In this example, we demonstrate how DiffPrep can work with
works [24, 27] have shown that this may not improve and can even feature extraction. Specifically, we train a random forest classifier
downgrade the ML model performance. This motivates us to take on the raw input data as a features extractor, where we can use the
ML models into consideration and use the model accuracy as guid- tree node embeddings and predicted probability as the extracted
ance to select preprocessing pipelines. To justify this motivation, features. We append the extracted features to the original data and
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data SIGMOD ’23, June 18–23, 2023, Seattle, WA
feed the data into DiffPrep for data preprocessing and model train- However, only 192 to 1976 operators were used in the evaluation,
ing. Table 5 shows the changes in the number of features and test which is much smaller than the number of pipelines evaluated in our
accuracy of DP-Fix with and without the feature extractor. The search space. BoostClean does not consider feature-wise pipelines.
accuracy significantly improved on 16 out of 18 datasets due to AlphaClean [22] is a similar automatic data cleaning pipeline gen-
feature extraction. For reference, we report the performance change eration system that finds cleaning pipelines from data cleaning
of Auto-Sklearn (AS) with its feature processor turned on, which libraries to maximize user-defined quality measures. AlphaClean
performs feature extraction and selection on top of data preprocess- uses tree search algorithms and learns pruning heuristics to reduce
ing. The accuracy is significantly improved on 11 out of 18 datasets. the search space. Learn2Clean [5] uses reinforcement learning to
Note that the performance on some datasets become worse, as en- select a sequence of data preprocessing operators such that the ML
abling feature processor can make the search harder due to the model performance is maximized. However, reinforcement learning
increase in training time and search space. needs to train and evaluate models many times to obtain rewards,
which could lead to scalability challenges. ActiveClean [21] focuses
6 RELATED WORK on gradient-based models and prioritizes cleaning examples with
higher gradients that are likely to have large impacts on the model.
Automated Machine Learning. The traditional workflow of de- CPClean [18] quantifies the impact of data cleaning on ML mod-
veloping ML models requires significant domain knowledge and els using the uncertainty of predictions and prioritizes cleaning
human effort, especially for data preprocessing and model training, examples that would lead to the maximum reduction on the un-
which can be time-consuming and expensive. Automated machine certainty of predictions. Both ActiveClean and CPClean solve the
learning, also known as AutoML, aims to reduce the need of hu- problem of which examples should be cleaned, while our method
man involvement by automating the whole process of ML model is design to select appropriate preprocessing/cleaning operators.
development. Our work provides an effective and efficient solution Data Cleaning and AutoML [28] investigated the impact of data
to automate the process of data preprocessing. cleaning for AutoML systems and found that data errors did not
Many existing AutoML tools enable automation in the process affect the AutoML performance significantly as AutoML can select
of data preprocessing and model training. Azure [4] is an AutoML robust models and adjust ML pipelines to handle data errors prop-
system developed by Microsoft. It uses matrix factorization and erly. This verifies our motivation to take downstream models into
Bayesian optimization to automatically select models and tune consideration when handling data errors rather than isolating data
hyperparameters. However, for data preprocessing, it only con- cleaning as a separate process.
siders tuning the normalization methods. For other transforma-
tions, it simply uses pre-defined default methods, such as mean Network Architecture Search. Network architecture search (NAS)
imputation for missing value imputation. The order of transforma- aims to automate the design of neural network architectures. Al-
tions is fixed and different features are preprocessed in the same though our work is not directly related to NAS, many of our ideas
way. Auto-Sklearn [10] is an open-source AutoML package. It also are inspired by DARTS [25], which is a differentiable NAS method.
relies on Bayesian optimization to automate data preprocessing To search network architecture efficiently, DARTS first relaxes the
and model training. Specifically, it adopts random-forest-based se- categorical choice of network operators using continuous archi-
quential model-based optimization to explore the search space. For tecture parameters, which is similar to the pipeline parameters we
data preprocessing, Auto-sklearn considers tuning a larger set of used for choices of TF operators. It then formulates the architecture
transformations than Azure. However, it also uses a fixed order search as a bi-level optimization problem, which can be solved by
of transformation and preprocesses all features in the same way. updating the architecture and model parameters alternatively using
In comparison, our methods allow flexible order of transforma- gradient descent. Although DARTS and DiffPrep are similar in their
tions and different features can use different pipelines. Instead of methodology, there are two major differences. First, most archi-
Bayesian optimization, we use bi-level optimization with gradient tecture operators (e.g., convolution, ReLU) are differentiable and
descent, which is faster and only needs to train the model once. their gradients can be easily computed using automatic differenti-
ation engines (e.g., PyTorch). However, the TF operators for data
Data Cleaning for ML. Our work builds upon a line of data clean- preprocessing can be complex and the gradients may not be easily
ing methods that incorporate signals from the downstream ML computed. Second, the same type of TF operators rarely appears
models into the design of cleaning objectives [27]. DiffML [17] is more than once in a pipeline and the order of TF operators (types)
our closest work in this area since it adopts a similar idea of mak- are usually flexible. In comparison, architecture operators usually
ing ML pipelines differentiable so that preprocessing steps can be repeat multiple times in an architecture but their order is usually
jointly learned with the ML model using backpropogation. How- fixed (e.g., DARTS uses ReLU-Conv-BN).
ever, DiffML considers each preprocessing step separately while
we consider a pipeline of preprocessing steps and formulate the
ordering using Sinkhorn. In addition, DiffML minimizes training 7 CONCLUSION
loss with one-level optimization, while we minimize validation loss In this paper, we propose DiffPrep, a method that can automatically
with bi-level optimization, which reduces the risk of overfitting. and efficiently search the optimal data preprocessing pipeline for a
BoostClean [20] automatically selects cleaning algorithms from the given tabular dataset and an ML model such that the performance of
search space via statistical boosting to maximize the ML model’s the ML model is maximized. We formalize the problem of automatic
validation accuracy. It supports conditional cleaning operations data preprocessing as a bi-level optimization problem. We then
defined by a combination of custom detection and repair functions. relax the discrete search space using continuous parameters, which
SIGMOD ’23, June 18–23, 2023, Seattle, WA Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong
enables us to search optimal preprocessing pipelines efficiently [19] Konstantina Kourou, Themis P Exarchos, Konstantinos P Exarchos, Michalis V
using gradient descent. The experiments show that our method Karamouzis, and Dimitrios I Fotiadis. 2015. Machine learning applications in
cancer prognosis and prediction. Computational and structural biotechnology
achieves the best test accuracy on 15 out of 18 real-world datasets journal 13 (2015), 8–17.
and improves the model’s test accuracy by up to 6.6 percentage [20] Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boost-
clean: Automated error detection and repair for machine learning. arXiv preprint
point. We note that our proposed differentiable search strategy can arXiv:1711.01299 (2017).
only search for optimal data preprocessing pipelines when the end [21] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Gold-
model is differentiable. We leave it to future work on how to search berg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Pro-
ceedings of the VLDB Endowment 9, 12 (2016), 948–959.
for the optimal data processing pipeline for non-differentiable end [22] Sanjay Krishnan and Eugene Wu. 2019. Alphaclean: Automatic generation of
models, such as random forests. data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019).
[23] Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine
learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020.
ACKNOWLEDGMENTS [24] Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML:
a study for evaluating the impact of data cleaning on ml classification tasks. In
We thank the many members of the Georgia Tech Database Group 2021 ICDE. IEEE, 13–24.
for their valuable feedback on this work. This research was sup- [25] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable
architecture search. arXiv preprint arXiv:1806.09055 (2018).
ported in part by Bosch Research North America. [26] Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. 2021. Investi-
gating bi-level optimization for learning and vision from a unified perspective: A
survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence
REFERENCES (2021).
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey [27] Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Cleaning before ML to Cleaning for ML. Data Engineering (2021), 24.
2016. Tensorflow: a system for large-scale machine learning.. In Osdi, Vol. 16. [28] Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye, and Ziawasch Abed-
Savannah, GA, USA, 265–283. jan. 2022. Data Cleaning and AutoML: Would an optimizer choose to clean?
[2] Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Datenbank-Spektrum (2022), 1–10.
Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. De- [29] Randal S Olson and Jason H Moore. 2016. TPOT: A tree-based pipeline optimiza-
tecting data errors: Where are we and what needs to be done? Proceedings of the tion tool for automating machine learning. In Workshop on automatic machine
VLDB Endowment 9, 12 (2016), 993–1004. learning. PMLR, 66–74.
[3] Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propaga- [30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
tion. arXiv preprint arXiv:1106.1925 (2011). Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
[4] Microsoft Azure. [n. d.]. Azure AutoML. https://azure.microsoft.com/en-us/ 2017. Automatic differentiation in PyTorch. (2017).
services/machine-learning/automatedml/. Accessed: August 23, 2023. [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
[5] Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
data preparation. In The World Wide Web Conference. 2580–2586. napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
[6] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Cre- Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
ating embeddings of heterogeneous relational datasets for data integration tasks. [32] Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883.
In Proceedings of the 2020 ACM SIGMOD International Conference on Management [33] Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow.
of Data. 1335–1349. Information Systems 92 (2020), 101483.
[7] Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data clean- [34] Jyoti Ramteke, Samarth Shah, Darshan Godhia, and Aadil Shaikh. 2016. Election
ing: Overview and emerging challenges. In Proceedings of the 2016 International result prediction using Twitter sentiment analysis. In 2016 international conference
Conference on Management of Data. ACM, 2201–2206. on inventive computation technologies (ICICT), Vol. 1. IEEE, 1–5.
[8] Yuanzheng Ci, Chen Lin, Ming Sun, Boyu Chen, Hongwen Zhang, and Wanli [35] Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data.
Ouyang. 2021. Evolving search space for neural architecture search. In Proceedings In 2014 ICDE. IEEE, 1294–1297.
of the IEEE/CVF International Conference on Computer Vision. 6659–6669. [36] Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, and Stephen Gould. 2017.
[9] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2016. Deep Deeppermnet: Visual permutation learning. In Proceedings of the IEEE Conference
direct reinforcement learning for financial signal representation and trading. on Computer Vision and Pattern Recognition. 3949–3957.
IEEE transactions on neural networks and learning systems 28, 3 (2016), 653–664. [37] Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann,
[10] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska.
Frank Hutter. 2020. Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning. 2019. Democratizing data science through interactive curation of ml pipelines.
arXiv:2007.04074 [cs.LG] (2020). In Proceedings of the 2019 International Conference on Management of Data. 1171–
[11] Matthias Feurer, Aaron Klein, Jost Eggensperger, Katharina Springenberg, Manuel 1188.
Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. [38] Richard Sinkhorn. 1964. A relationship between arbitrary positive matrices and
In Advances in Neural Information Processing Systems 28 (2015). 2962–2970. doubly stochastic matrices. The annals of mathematical statistics 35, 2 (1964),
[12] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. 876–879.
Forward and reverse gradient-based hyperparameter optimization. In Interna- [39] Richard Sinkhorn and Paul Knopp. 1967. Concerning nonnegative matrices and
tional Conference on Machine Learning. PMLR, 1165–1173. doubly stochastic matrices. Pacific J. Math. 21, 2 (1967), 343–348.
[13] Salvador García, Julián Luengo, and Francisco Herrera. 2015. Data preprocessing [40] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013.
in data mining. Springer. Auto-WEKA: Combined selection and hyperparameter optimization of classifica-
[14] Joseph Giovanelli, Besim Bilalli, and Alberto Abelló Gamazo. 2021. Effective data tion algorithms. In Proceedings of the 19th ACM SIGKDD international conference
pre-processing for AutoML. In Proceedings of the 23rd International Workshop on on Knowledge discovery and data mining. 847–855.
Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): [41] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML:
co-located with the 24th International Conference on Extending Database Technology Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60.
and the 24th International Conference on Database Theory (EDBT/ICDT 2021): https://doi.org/10.1145/2641190.2641198
Nicosia, Cyprus, March 23, 2021. CEUR-WS. org, 1–10. [42] Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning
[15] Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier Performance with Relational Embedding Data Augmentation. In Proceedings of
detection using replicator neural networks. In International Conference on Data the 2022 International Conference on Management of Data. 1504–1517.
Warehousing and Knowledge Discovery. Springer, 170–180.
[16] Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state- Received October 2022; revised January 2023; accepted February 2023
of-the-art. Knowledge-Based Systems 212 (2021), 106622.
[17] Benjamin Hilprecht, Christian Hammacher, Eduardo Reis, Mohamed Abdelaal,
and Carsten Binnig. 2022. DiffML: End-to-end Differentiable ML Pipelines. arXiv
preprint arXiv:2207.01269 (2022).
[18] Bojan Karlaš, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu,
and Ce Zhang. 2021. Nearest neighbor classifiers over incomplete information:
From certain answers to certain predictions. In PVLDB, Vol. 14.