Deep Neural Networks and Tabular Data A Survey
Deep Neural Networks and Tabular Data A Survey
applications [38], [39]. Thus, to tackle the data preprocessing neural networks. An overview of explanation mechanisms
and privacy challenges, probabilistic tabular data generation for deep models for tabular data is presented in Section VI.
is essential. Finally, with stricter data protection laws such as In Section VII, we provide an extensive empirical comparison
California Consumer Privacy Act (CCPA) [40] and the Euro- of machine and deep learning methods on real-world data,
pean General Data Protection Regulation (EU GDPR) [41], which also involves model size, runtime, and interpretability.
which both mandate a right to explanations for automated In Section VIII, we summarize the state of the field and give
decision systems (e.g., in the form or recourse [42]), inter- future perspectives. Finally, we outline several open research
pretability is becoming a key aspect for predictive models used questions before concluding in Section IX.
for tabular data [43], [44]. During deployment, interpretability
methods also serve as a valuable tool for model debugging II. R ELATED W ORK
and auditing [45]. To the best of our knowledge, there is no study dedicated
Evidently, apart from the core challenges of inference, gen- exclusively to the application of deep neural networks to
eration, and interpretability, there are several other important tabular data, spanning the areas of supervised and unsuper-
subfields, such as working with data streams, distribution vised learning, data synthesis, and interpretability. Prior works
shifts, as well as privacy and fairness considerations that cover some of these aspects, but none of them systematically
should not be neglected. Nevertheless, to navigate the vast discusses the existing approaches in the broadness of this
body of literature, we focus on the identified core problems survey.
and thoroughly review the state of the art in this work. We will However, there are some works that cover parts of the
briefly discuss the remaining topics at the end of this survey. domain. There is a comprehensive analysis of common
Beyond reviewing current literature, we think that an approaches for categorical data encoding as a preprocessing
exhaustive comparison between existing deep learning step for deep neural networks by Hancock and Khoshgof-
approaches for heterogeneous tabular data is necessary to put taar [47]. The authors compared existing methods for cate-
reported results into context. The variety of benchmarking gorical data encoding on various tabular datasets and different
datasets and the different setups often prevent the comparison deep learning architectures. We discuss the key categorical
of results across papers. In addition, important aspects of data encoding methods in Section IV-A1.
deep learning models, such as training and inference time, A recent survey by Sahakyan et al. [43] summarizes expla-
model size, and interpretability, are usually not discussed. nation techniques in the context of tabular data. Hence, we do
We aim to bridge this gap by providing a comparison of not provide a detailed discussion of explainable machine
the surveyed inference approaches with classical—yet very learning for tabular data in this article. However, for the sake
strong—baselines such as XGBoost [46]. We open-source of completeness, we present some of the most relevant works
our code, allowing researchers to reproduce and extend our in Section VI and highlight open challenges in this area.
findings. Gorishniy et al. [48] empirically evaluated a large number of
In summary, the aims of this survey are to provide the state-of-the-art deep learning approaches for tabular data on a
following: wide range of datasets. He et al. [49] demonstrated that a tuned
deep neural network model with a ResNet-like architecture
1) a thorough review of existing scientific literature on deep shows comparable performance to some state-of-the-art deep
learning for tabular data; learning approaches for tabular data.
2) a taxonomic categorization of the available approaches Recently, Shwartz-Ziv and Armon [8] published a study
for classification and regression tasks on heterogeneous on several different deep models for tabular data, including
tabular data; TabNet [6], NODE [7], and Net-DNF [50]. In addition,
3) a presentation of the state of the art and promising paths they compared deep learning approaches to gradient boosting
toward tabular data generation; decision tree (GBDT) algorithms regarding accuracy, training
4) an overview of existing explanation approaches for deep effort, inference efficiency, and hyperparameter optimization
models for tabular data; time. They observed that deep models had the best results
5) an extensive empirical comparison of traditional on their chosen datasets, and however, not one single deep
machine learning methods and deep learning models on model could outperform all the others in general. The deep
multiple real-world heterogeneous tabular datasets; models were challenged by GBDTs, leading the authors to
6) a discussion on the main reasons for the limited success conclude that efficient tabular data modeling using deep neural
of deep learning on tabular data; networks is still an open research problem. In the face of
7) a list of open challenges related to deep learning for this evidence, we aim to integrate the necessary background
tabular data. for future research on the inference problem and on the
Accordingly, this survey is structured as follows. We dis- intertwined challenges of generation and explainability into
cuss related works in Section II. To introduce the reader to a single work.
the field, in Section III, we provide definitions of the key
III. TABULAR DATA AND D EEP N EURAL N ETWORKS
terms, a brief outline of the domain’s history, and propose
a unified taxonomy of current approaches to deep learning A. Definitions
with tabular data. Section IV covers the main methods for In this section, we give definitions for central terms used in
modeling tabular data using deep neural networks. Section V this work. We also provide pointers to the original works for
presents an overview on tabular data generation using deep more detailed explanations of the methods.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
E XAMPLE OF A H ETEROGENEOUS TABULAR D ATASET. H ERE , W E S HOW
F IVE S AMPLES W ITH S ELECTED VARIABLES F ROM THE A DULT
D ATASET [54]. S ECTION VII-A P ROVIDES F URTHER
D ETAILS ON T HIS D ATASET
improve the performance of deep neural networks on tabular to information loss, leading to a reduction in predictive
data [10]. This has led to an intensification of research on performance [76].
regularization approaches. 4) Importance of Single Features: While typically changing
Due to the tremendous success of attention-based the class of an image requires a coordinated change in
approaches such as transformers on textual [64] and visual many features, i.e., pixels, the smallest possible change
data [65], [66], researchers have recently also started applying of a categorical (or binary) feature can entirely flip a
attention-based methods and self-supervised learning tech- prediction on tabular data [63]. In contrast to deep neural
niques to tabular data. After the introduction of transformer networks, decision-tree algorithms can handle varying
architectures to the field of tabular data [6], a lot of research feature importance exceptionally well by selecting a
effort has focused on transformer architectures that can be single feature and appropriate threshold (i.e., splitting)
successfully applied to very large tabular datasets. values and “ignoring” the rest of the data sample. Shavitt
and Segal [63] have argued that individual weight reg-
ularization may mitigate this challenge and motivated
C. Challenges of Learning With Tabular Data
more work in this direction [10].
As we have mentioned in Section II, deep neural networks With these four fundamental challenges in mind, we continue
often perform less favorably compared to more traditional by organizing and discussing the strategies developed to
machine learning methods (e.g., tree-based methods) when address them. We start by developing a suitable taxonomy.
dealing with tabular data. However, it is often unclear why
deep learning cannot achieve the same level of predictive
quality as in other domains such as image classification and D. Unified Taxonomy
natural language processing. In the following, we identify and In this section, we introduce a taxonomy of approaches that
discuss four possible reasons. allows for a unified view of the field. We divide the works
1) Low-Quality Training Data: Data quality is a common from the deep learning with tabular data literature into three
issue with real-world tabular datasets. They often include main categories: data transformation methods, specialized
missing values [34], extreme data (outliers) [67], and architectures, and regularization models. In Fig. 1, we provide
erroneous or inconsistent data [68] and have a small an overview of our taxonomy of deep learning methods for
overall size relative to the high-dimensional feature tabular data.
vectors generated from the data [69]. Also, due to the 1) Data Transformation Methods: The methods in the first
expensive nature of data collection, tabular data are group transform categorical and numerical data. This is usually
frequently class-imbalanced. These challenges affect all done to enable deep neural network models to better extract
machine learning algorithms; however, most of the mod- the information signal. Methods from this group do not require
ern decision tree-based algorithms can handle missing new architectures or adaptations of the existing data processing
values or different/extreme variable ranges internally pipeline. Nevertheless, the transformation step comes at the
by looking for appropriate approximations and split cost of an increased preprocessing time. This might be an
values [46], [70], [71]. issue for high-load systems [77], particularly in the presence
2) Missing or Complex Irregular Spatial Dependencies: of categorical variables with high cardinality and growing
There is often no spatial correlation between the vari- dataset size. We can further subdivide this area into single-
ables in tabular datasets [72] or the dependencies dimensional encodings and multidimensional encodings. The
between features are rather complex and irregular. When former encodings are employed to transform each feature
working with tabular data, the structure and relationships independently while the latter encoding methods map an entire
between its features have to be learned from scratch. record to another representation.
Thus, the inductive biases used in popular models for 2) Specialized Architectures: The biggest share of works
homogeneous data, such as convolutional neural net- investigates specialized architectures and suggests that a dif-
works, are unsuitable for modeling this data type [50], ferent deep neural network architecture is required for tabular
[73], [74]. data. Two types of architectures are of particular importance:
3) Dependency on Preprocessing: A key advantage of hybrid models fuse classical machine learning approaches
deep learning on homogeneous data is that it includes (e.g., decision trees) with neural networks, while transformer-
an implicit representation learning step [2], so only a based models rely on attention mechanisms.
minimal amount of preprocessing or explicit feature con- 3) Regularization Models: Finally, the group of regular-
struction is required. However, for tabular data and deep ization models claims that one of the main reasons for the
neural networks, the performance may strongly depend moderate performance of deep learning models on tabular data
on the selected preprocessing strategy [75]. Handling is their extreme nonlinearity and model complexity. Therefore,
the categorical features remains particularly challenging strong regularization schemes are proposed as a solution. They
[47] and can easily lead to a very sparse feature matrix are mainly implemented in the form of special-purpose loss
(e.g., by using a one-hot encoding scheme) or introduce functions.
a synthetic ordering of previously unordered values (e.g., We believe that our taxonomy may help practitioners find
by using an ordinal encoding scheme). Finally, pre- the methods of choice that can be easily integrated into their
processing methods for deep neural networks may lead existing tool chain. For instance, applying data transformations
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
can result in performance improvements while maintaining approach is also used in the CatBoost framework [71], a state-
the current model architecture. Conversely, using specialized of-the-art machine learning library for heterogeneous tabular
architectures, the data preprocessing pipeline can be kept data based on the gradient boosting algorithm [95].
intact. A different strategy is hash-based encoding. Every category
is transformed into a fixed-size value via a deterministic hash
IV. D EEP N EURAL N ETWORKS FOR TABULAR DATA function. The output size is not directly dependent on the
In this section, we discuss the use of deep neural networks number of input categories but can be chosen manually.
on tabular data for classification and regression tasks according 2) Multidimensional Encoding: A first automatic encoding
to the taxonomy presented in Section III. We provide an strategy is the value imputation and mask estimation (VIME)
overview of existing deep learning approaches in this area approach [79]. The authors propose a self-supervised and
of research in Table II and examine the three methodolog- semisupervised deep learning framework for tabular data that
ical categories in detail: data transformation methods (see trains an encoder in a self-supervised fashion by using two
Section IV-A), architecture-based methods (see Section IV-B), pretext tasks. Those tasks are independent of the concrete
and regularization-based models (see Section IV-C). downstream task that the predictor has to solve. The first
task of VIME is called mask vector estimation; its goal is
to determine which values in a sample are corrupted. The
A. Data Transformation Methods second task, i.e., feature vector estimation, is to recover the
Most traditional approaches for deep neural networks on original values of the sample. The encoder itself is a simple
tabular data fall into this group. Interestingly, data preprocess- multilayer perceptron. This automatic encoding makes use of
ing plays a relatively minor role in computer vision, even the fact that there is often much more unlabeled than labeled
though the field is currently dominated by deep learning solu- data. The encoder learns how to construct an informative
tions [2]. There are many different possibilities to transform homogeneous representation of the raw input data. In the
tabular data, and each may have a different impact on the semisupervised step, a predictive model, which is also a
learning results [47]. deep neural network model, is trained using the labeled and
1) Single-Dimensional Encoding: One of the critical obsta- unlabeled data transformed by the encoder. For the encoder,
cles for deep learning with tabular data is categorical variables. a novel data augmentation method is used, corrupting an unla-
Since neural networks only accept real number vectors as beled data point multiple times with different masks. On the
inputs, these values must be transformed before a model can predictions from all augmented samples from one original data
use them. Therefore, the first class of methods attempts to point, a consistency loss can be computed, which rewards
encode categorical variables in a way suitable for deep learning similar outputs. To summarize, the VIME network trains an
models. encoder, which is responsible to transform the categorical and
Approaches in this group [47] are divided into deterministic numerical features into a new homogeneous and informative
techniques, which can be used before training the model, and representation. This transformed feature vector is used as an
more complicated automatic techniques that are part of the input to the predictive model. For the encoder itself, the
model architecture. There are many ways for deterministic data categorical data can be transformed by a simple one-hot encod-
encoding; hence, we restrict ourselves to the most common ing and binary encoding. The experimental results highlight
ones without the claim of completeness. how the self-supervised and semisupervised variants of the
The simplest data encoding technique might be ordinal or VIME framework can boost the performance over that of other
label encoding. Every category is just mapped to a discrete baselines such as XGBoost. Even in the absence of unlabeled
numeric value, e.g., {Apple, Banana} are encoded as {0, 1}. data, learning encodings in the proposed manner is shown to
One drawback of this method may be that it introduces an be beneficial for downstream performance.
artificial order to previously unordered categories. Another Another stream of research aims at transforming the tabular
straightforward method that does not induce any order is the input into a more homogeneous format. Since the revival
one-hot encoding. One additional column for each unique of deep learning, convolutional neural networks have shown
category is added to the data. Only the column corresponding tremendous success in computer vision tasks. Therefore, Sun
to the observed category is assigned the value one, with the et al. [78] proposed the SuperTML method, which is a data
other values being zero. In our example, Apple could be conversion technique to transform tabular data into an image
encoded as (1,0) and Banana as (0,1). In the presence data format (2-D matrices), i.e., black-and-white images.
of a diverse set of categories in the data, this method can lead On three datasets, SuperTML shows performance comparable
to high-dimensional sparse feature vectors and exacerbate the with or superior to XGBoost.
“curse of dimensionality” problem. The image generator for tabular data (IGTD) in [72] follows
One approach that needs no extra columns and does not an idea similar to SuperTML. The IGTD framework converts
include any artificial order is the so-called leave-one-out tabular data into images to make use of classical convolutional
encoding. It is based on the target encoding technique pro- architectures. As convolutional neural networks rely on spatial
posed in the work in [94], where every category is replaced dependencies, the transformation into images is optimized
with the mean of the target variable of that category. The leave- by minimizing the difference between the feature distance
one-out encoding excludes the current row when computing ranking of the tabular data and the pixel distance ranking of
the mean of the target variable to avoid overfitting. This the generated image. Every feature corresponds to one pixel,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
OVERVIEW OF D EEP L EARNING A PPROACHES FOR TABULAR D ATA . W E O RGANIZE T HEM IN C ATEGORIES O RDERED C HRONOLOGICALLY I NSIDE THE
G ROUPS . T HE “I NTERPRETABILITY ” C OLUMN I NDICATES W HETHER THE A PPROACH O FFERS S OME F ORM I NTERPRETABILITY FOR THE M ODEL’ S
D ECISIONS . T HE K EY C HARACTERISTICS OF E VERY M ODEL A RE S UMMARIZED IN THE L AST C OLUMN
which leads to compact images with similar features close at 1) Hybrid Models: Most approaches for deep neural net-
neighboring pixels. Thus, IGDTs can be used in the absence of works on tabular data are hybrid models. They transform
domain knowledge. The authors show relatively solid results the data and fuse successful classical machine learning
for data with strong feature relationships, but the method approaches, often decision trees, with neural networks. We dis-
may fail if the features are independent or feature similarities tinguish between fully differentiable models, which can be
cannot characterize the relationships. In their experiments, differentiated with respect to all their parameters and partly
the authors used only gene expression profiles and molecular differentiable models.
descriptors of drugs as data. This kind of data may lead a) Fully differentiable models: The fully differentiable
to a favorable inductive bias, so the general viability of the models in this category offer a valuable property: They permit
approach remains unclear. end-to-end deep learning for training and inference by means
of gradient descent optimizers. Thus, they allow for highly
B. Specialized Architectures efficient implementations in modern deep learning frameworks
Specialized architectures form the largest group of that exploit GPU or TPU acceleration throughout the code.
approaches for deep tabular data learning. In this group, Popov et al. [7] proposed an ensemble of differentiable
the focus is on the development and investigation of novel oblivious decision trees [96]—also known as the NODE
deep neural network architectures designed specifically for framework for deep learning on tabular data. Oblivious deci-
heterogeneous tabular data. Guided by the types of available sion trees use the same splitting function for all nodes on the
models, we divide this group into two subgroups: hybrid same level and can therefore be easily parallelized. NODE is
models (presented in IV-B1) and transformer-based models inspired by the successful CatBoost [71] framework. To make
(discussed in IV-B2). the whole architecture fully differentiable and benefit from
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
end-to-end optimization, NODE utilizes the entmax transfor- The work by Cheng et al. [81] proposes a hybrid archi-
mation [97] and soft splits. In the original experiments, the tecture that consists of linear and deep neural network
NODE framework outperforms XGBoost and other GBDT models—Wide&Deep. A linear model that takes single fea-
models on many datasets. As NODE is based on decision tree tures and a wide selection of handcrafted logical expressions
ensembles, there is no preprocessing or transformation of the on features as an input is enhanced by a deep neural net-
categorical data necessary. Decision trees are known to handle work to improve the generalization capabilities. In addition,
discrete features well. In the official implementation, strings Wide&Deep learns an n-dimensional embedding vector for
are converted to integers using the leave-one-out encoding each categorical feature. All embeddings are concatenated
scheme. The NODE framework is widely used and provides resulting in a dense vector used as input to the neural net-
a sound implementation that can be readily deployed. work. The final prediction can be understood as a sum of
Frosst and Hinton [82] contributed another model relying both models. Experiments with a real-world system for app
on soft decision trees (SDTs) to make neural networks more recommendation confirmed that users installed apps suggested
interpretable. They investigated training a deep neural network by Wide&Deep were significantly more often than those
first, before using a mixture of its outputs and the ground-truth provided by the previous model. A similar work by Guo
labels to train the SDT model in a second step. The authors and Berkhahn [99] proposes an embedding using deep neural
showed that training a neural model first increases accuracy networks for categorical variables.
over SDTs that are directly learned from the data. However, Another contribution to the realm of Wide&Deep models is
their distilled trees still exhibit a performance gap to the neural DeepFM [15]. The authors demonstrate that it is possible to
networks that were fit in the initial step. Nevertheless, the replace the handcrafted feature transformations with learned
model itself shows a clear relationship among different classes factorization machines (FMs) [100]. The FM is an extension
in a hierarchical fashion. It groups different categorical values of a linear model designed to capture lower order interac-
based on the common patterns, e.g., digits 8 and 9 from tions between features within high-dimensional and sparse
the MNIST dataset [98]. To summarize, the proposed method data efficiently. Higher order interactions are modeled by
allows for high interpretability and efficient inference, at the a deep neural network. Similar to the original Wide&Deep
cost of slightly reduced accuracy. model, DeepFM also relies on the same embedding vectors
Follow-up work [89] extends this line of research to het- for its “wide” and “deep” parts. In contrast to the original
erogeneous tabular data and regression tasks and presents the Wide&Deep model, however, DeepFM alleviates the need for
SDT regressor (SDTR) framework. The SDTR is a neural manual feature engineering. The experimental results show
network, which imitates a binary decision tree. Therefore, all a solid improvement in CTR prediction tasks compared to
neurons, such as nodes in a tree, get the same input from the a variety of models relying on either low- or high-order
data instead of the output from previous layers. In the case of dependencies only and compared to other hybrid approaches.
deep networks, the SDTR could not beat other state-of-the-art Finally, network-on-network (NON) [86] is a classifica-
models, but it has shown promising results in a low-memory tion model for tabular data, which focuses on capturing
setting, where single tree models and shallow architectures the intrafeature information efficiently. It consists of three
were compared. components: a fieldwise network consisting of one unique
Katzir et al. [50] followed the related idea. Their Net-DNF deep neural network for every column to capture the column-
builds on the observation that every decision tree is merely specific information, an across-field network, which chooses
a form of a Boolean formula, more precisely a disjunctive the optimal operations based on the dataset, and an operation
normal form. They use this inductive bias to design the fusion network, connecting the chosen operations allowing for
architecture of a neural network, which is able to imitate the nonlinearities. As the optimal operations for the specific data
characteristics of the GBDT algorithm. The resulting Net-DNF are selected, the performance is considerably better than that
was tested for classification tasks on datasets with no missing of other deep learning models. However, decision trees, the
values, where it showed the results that are comparable to current state-of-the-art models for tabular data, were not listed
those of XGBoost [46]. However, the authors did not men- among the baselines. Also, training as many neural networks
tion how to handle high-cardinality categorical data, as the as columns and selecting the operations on the fly may lead
used datasets contained mostly numerical and few binary to a long computation time.
features. b) Partly differentiable models: This subgroup of hybrid
Linear models (e.g., linear and logistic regression) provide models aims at combining nondifferentiable approaches with
global interpretability but are inferior to complex deep neural deep neural networks. Models from this group usually utilize
networks. Usually, handcrafted feature engineering is required decision trees for the nondifferentiable part.
to improve the accuracy of linear models. Liu et al. [87] The DeepGBM model [62] combines the flexibility of
used a deep neural network to combine the features in a deep neural networks with the preprocessing capabilities of
possibly nonlinear way; the resulting combination of fea- GBDTs. DeepGBM consists of two neural networks—CatNN
tures then serves as input to the linear model. In their and GBDT2NN. While CatNN is specialized to handle sparse
approach—termed DDN2LR—this enhances the simple, inter- categorical features, GBDT2NN is specialized to deal with
pretable linear model. In experimental evaluations, DNN2LR dense numerical features.
can outperform other more complex DNN models while main- In the preprocessing step for the CatNN network, the cate-
taining some extent of interpretability. gorical data are transformed via ordinal encoding (to convert
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
uses self-attention-based transformers to map the categorical confirmed that this new approach can reach state-of-the-
features to contextual embedding. This embedding is more art results on most datasets by using intersample attention
robust to missing or noisy data and enables interpretability. mechanisms.
The embedded categorical features are then together with the
C. Regularization Models
numerical ones fed into a simple multilayer perceptron. If,
in addition, there is an extra amount of unlabeled data, unsu- The third group of approaches argues that extreme flexi-
pervised pretraining can improve the results, using masked bility of deep learning models for tabular data is one of the
language modeling or replacing token detection. Extensive main learning obstacles and strong regularization of learned
experiments show that TabTransformer matches the perfor- parameters may improve the overall performance.
mance of tree-based ensemble techniques, showing success One of the first methods in this category was the regu-
also when dealing with missing or noisy data. The TabTrans- larization learning network (RLN) proposed by Shavitt and
former network puts a significant focus on the categorical Segal [63], which uses a learned regularization scheme. The
features. It transforms the embedding of those features into main idea is based on the observation that features in tab-
contextual embedding, which is then used as input for the ular datasets have very different importances. Contrarily to
multilayer perceptron. This embedding is implemented by other data modalities data such as images or text, a single
different multihead attention-based transformers, which are tabular feature may change the entire prediction. Therefore,
optimized during training. the authors apply trainable regularization coefficients to each
ARM-net [91] is an adaptive neural network for relation single weight in a neural network, hence allowing high
modeling tailored to tabular data. The key idea of the ARM-net sensitivity with respect to certain inputs or network parts
framework is to model feature interactions with combined while being insensitive to others. To efficiently determine
features (feature crosses) selectively and dynamically by first the corresponding coefficients, the authors propose a novel
transforming the input features into exponential space and loss function termed “counterfactual loss.” The regularization
then determining the interaction order and interaction weights coefficients lead to a very sparse network, which also provides
adaptively for each feature cross. Furthermore, the authors the importance of the remaining input features.
propose a novel sparse attention mechanism to generate the In their experiments, RLNs outperform deep neural net-
interaction weights given the input data dynamically. Thus, works and obtain the results comparable to those of the GBDT
users can explicitly model feature crosses of arbitrary orders algorithm, but the evaluation relies on a dataset with mainly
with noisy features filtered selectively. On five real-world numerical data to compare the models. The RLN paper does
datasets, ARM-net shows its superior effectiveness in rep- not address the issues of categorical data. For the experiments
resenting feature interactions compared to various baselines, and the example implementation, datasets with exclusively
which model the feature interactions in different ways. numerical data (except for the gender attribute) were used.
Self-attention and intersample attention transformer A similar idea is proposed in [112], where regularization
(SAINT) [9] is a hybrid attention approach, combining coefficients are learned only in the first layer with a goal to
self-attention [5] with intersample attention over multiple extract feature importance.
rows. When handling missing or noisy data, this mechanism Kadra et al. [10] stated that simple multilayer percep-
allows the model to borrow the corresponding information trons can outperform state-of-the-art algorithms on tabular
from similar samples, which improves the model’s robustness. data if deep learning networks are properly regularized. The
The technique is reminiscent of nearest neighbor imputation. authors propose a “cocktail” of regularization with 13 different
In addition, all features are embedded into a combined dense techniques that are applied jointly. From those, the optimal
latent vector, enhancing existing correlations between values subset and their subsidiary hyperparameters are selected. They
from one data point. To exploit the presence of unlabeled data, demonstrate in extensive experiments that the regulariza-
a self-supervised contrastive pre-training can further improve tion “cocktails” can not only improve the performance of
the results, minimizing the distance between two views of the multilayer perceptrons but these simple models also outper-
same sample and maximizing the distance between different form tree-based architectures. On the downside, the extensive
ones. Like the VIME framework (Section IV-A1), SAINT per-dataset regularization and hyperparameter optimization
uses CutMix [110] to augment samples in the input space and take much more computation time than the GBDT algorithm.
uses mixup [111] in the embedding space. The experimental There are several other noteworthy works [113], [114],
results show that SAINT outperforms tree-based models [115], indicating that strong regularization of deep neural
like XGBoost as well as other deep learning approaches for networks can be beneficial for tabular data.
tabular data on average. When unlabeled data are available,
the performance can be improved further using the proposed V. TABULAR DATA G ENERATION
pretraining. For many applications, the generation of realistic tabular
Finally, even some new learning paradigms are being pro- data is fundamental. Three of the main purposes are data
posed. For instance, the nonparametric transformer (NPT) [92] augmentation [117], data imputation (i.e., the filling of missing
does not construct a mapping from individual inputs to outputs values) [118], [119], and rebalancing [36], [37], [120], [121].
but uses the entire dataset at once. By using attention between Another highly relevant topic is privacy-aware machine learn-
data points, relations between arbitrary samples can be mod- ing [38], [39], [122] where generated data can potentially be
eled and leveraged for classifying test samples. Experiments leveraged to overcome privacy concerns.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
A. Methods to which extent the inductive bias used for images are suitable
While the generation of images and text is highly for tabular data.
explored [123], [124], [125], generating synthetic tabular data The approach by Xu et al. [130] focuses on the correlation
is a less frequent concern. The mixed structure of discrete and between the features of one data point. The authors first pro-
continuous features along with their different value distribu- pose the mode-specific normalization technique for data pre-
tions still poses a significant challenge. processing that allows to transform non-Gaussian distributions
Classical approaches for the data generation task include in the continuous columns. They express numeric values in
Copulas [126], [127] and Bayesian networks [128]. Among terms of a mixture component number and the deviation from
Bayesian networks, those based on the Chow–Liu approxima- that component’s center. This allows to represent multimodal
tion [129] are especially popular [38], [130], [131], [132]. and skewed distributions. Their generative solution, coined
In the deep learning era, generative adversarial networks CTGAN, uses the conditional GAN architecture to enforce
(GANs) [133] have proven highly successful for the generation learning proper conditional distributions for each column.
of images [123], [134]. GANs were recently introduced as To obtain categorical values and to allow for backpropagation
an original way to train a generative deep neural network in the presence of categorical values, the gumbel-softmax
model. They consist of two separate models: a generator trick [143] is utilized. The authors also propose a model based
G that generates samples from the data distribution and a on VAEs, named tabular VAE (TVAE), which outperforms
discriminator D that estimates the probability that a sample their suggested GAN approach. Both approaches can be con-
came from the ground-truth distribution. Both G and D are sidered state of the art.
usually chosen to be nonlinear functions such as multilayer While GANs and VAEs are prevalent, other recently
perceptrons. To learn a generator distribution pg over data proposed architectures include machine-learned causal mod-
x, the generator G(z; θg ) maps the samples from a noise els [144] and invertible flows [38]. When privacy is the main
distribution pz (z) (e.g., the Gaussian distribution) to the input factor of concern, models, such as PATE-GAN [145], provide
data space. The discriminator D(x; θd ) outputs the probability generative models with certain differential privacy guarantees.
that a data point x comes from the training data’s distribution Although very relevant for practical applications, such privacy
pdata rather than from the generator’s output distribution pg . guarantees and related federated learning approaches with
During joint training of G and D, G will start generating tabular data [146] are outside the scope of this review.
successively more realistic samples to fool the discriminator Fan et al. [122] compared a variety of different GAN archi-
D. For more details on GANs, we refer the interested reader tectures for tabular data synthesis and recommended using
to the original paper [133]. a simple, fully connected architecture with a vanilla GAN
In Table III, we provide an overview of tabular generation loss with minor changes to prevent mode collapse. They also
approaches that use deep learning techniques. Note that due use the normalization proposed in [130]. In their experiments,
to the enormous number of approaches, we list the most the WGAN loss or the use of convolutional architectures on
influential works that address the problem of data generation tabular data does boost the generative performance.
with a particular focus on tabular data. We exclude works that
are targeted toward highly domain-specific tasks.
Although it was found that GANs lag behind at the genera- B. Assessing Generative Quality
tion of discrete outputs such as natural language [125], they are To assess the quality of the generated data, several per-
still frequently chosen to generate tabular data. Vanilla GANs formance measures are used. The most common approach
or derivates, such as the Wasserstein GAN (WGAN) [135], is to define a proxy classification task and train one model
WGAN with gradient penalty (WGAN-GP) [136], Cramér for it on the real training set and another on the artificially
GAN [137], or the Boundary seeking GAN [138], which generated dataset. With a highly capable generator, the predic-
is designed to model discrete data, are commonly used tive performance of the artificial-data model on the real-data
in the literature to generate tabular data (cf. Table III). test set should be almost on par with its real-data counter-
Moreover, VeeGAN [139] is frequently used as a reference part. This measure is often referred to as machine learning
for tabular data generation [38], [130], [131]. Apart from efficacy and used in [39], [131], and [147]. In nonobvious
GANs, autoencoder-based architectures—in particular those classification tasks, an arbitrary feature can be used as a
relying on variational autoencoders (VAEs) [140]—have been label and predicted [39], [148], [149]. Another approach is
proposed [130], [141]. to visually inspect the modeled distributions per feature, e.g.,
In the following, we will briefly discuss the most rele- the cumulative distribution functions [117], or compare the
vant approaches that helped shape the domain. For example, expected values in scatter plots [39], [148]. A more quan-
MedGAN [39] was one of the first works and provides a deep titative approach is the use of statistical tests, such as the
learning model to generate patient records. As all the features Kolmogorov–Smirnov test [152], to assess the distributional
in their work are discrete, this model cannot be easily trans- difference [149]. On synthetic datasets, the output distribution
ferred to arbitrary tabular datasets. The table-GAN approach can be compared to the ground truth, e.g., in terms of log
in [142] adapts the deep convolutional GAN for tabular data. likelihood [130], [144]. Because overfitted models can also
Specifically, the features from one record are converted into a obtain good scores, Xu et al. [130] proposed evaluating the
matrix so that they can be processed by convolutional filters of likelihood of a test set under an estimate of the GAN’s
a convolutional neural network. However, it remains unclear output distribution. Especially in a privacy-preserving context,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III methods aim to highlight the influence of the inputs that have
G ENERATION OF TABULAR D ATA U SING D EEP N EURAL on the prediction by assigning importance scores to the input
N ETWORK M ODELS ( IN C HRONOLOGICAL O RDER )
features. Some popular approaches for model explanations aim
at constructing classification models that are explainable by
design [158], [159], [160]. This is often achieved by enforcing
the deep neural network model to be locally linear. Moreover,
if the model’s parameters are known and can be accessed,
then the explanation technique can use these parameters to
generate the model explanation. For such settings, relevance-
propagation-based methods, e.g., [161], [162], and gradient-
based approaches, e.g., [163], [164], [165], have been sug-
gested. In cases where the parameters of the neural network
cannot be accessed, model-agnostic approaches can prove
useful. This group of approaches seeks to explain a model’s
behavior locally by applying surrogate models [116], [166],
[167], [168], [169], which are interpretable by design and are
used to explain individual predictions of black-box machine
learning models. In order to test the performance of these
black-box explanations techniques, Liu et al. [170] suggested
a python-based benchmarking library.
B. Counterfactual Explanations
From the perspective of algorithmic recourse, the main pur-
pose of counterfactual explanations is to suggest constructive
interventions to the input of a deep neural network so that
the output changes to the advantage of an end user. In simple
terms, a minimal change to the feature vector that will flip
the classification outcome is computed and provided as an
explanation. By emphasizing both the feature importance and
the distribution of the distance to closest record (DCR) can
the recommendation aspect, counterfactual explanation meth-
be calculated and compared to the respective distances on
ods can be further divided into three different groups: works
the test set [142]. This measure is important to assess the
that assume that all features can be independently manipulated
extent of sample memorization. Overall, we conclude that
[171] and works that focus on manifold constraints to capture
a single measure is not sufficient to assess the generative
feature dependencies.
quality. For instance, a generative model that memorizes the
In the class of independence-based methods, where the input
original samples will score well in the machine learning
features of the predictive model are assumed to be indepen-
efficiency metric but fail the DCR check. Therefore, we highly
dent, some approaches use combinatorial solvers to generate
recommend using several evaluation measures that focus on
recourse in the presence of feasibility constraints [172], [173],
individual aspects of data quality.
[174], [175]. Another line of research deploys gradient-based
optimization to find low-cost counterfactual explanations in the
VI. E XPLANATION M ECHANISMS FOR D EEP
presence of feasibility and diversity constraints [176], [177].
L EARNING W ITH TABULAR DATA
The main problem with these approaches is that they abstract
Explainable machine learning is concerned with the prob- from input correlations.
lem of providing explanations for complex machine learn- To alleviate this problem and to suggest realistic-looking
ing models. With stricter regulations for automated decision- counterfactuals, researchers have suggested building recourse
making [41] and the adoption of machine learning models suggestions on generative models [178], [179], [180], [181],
in high-stakes domains such as finance and healthcare [45], [182]. The main idea is to change the geometry of the
[153], [154], interpretability is becoming a key concern. intervention space to a lower dimensional latent space, which
Toward this goal, various streams of research follow different encodes different factors of variation while capturing input
explainability paradigms. Among these, feature attribution dependencies. To this end, these methods primarily use (tabu-
methods and counterfactual explanations are two of the popu- lar data) VAEs [140], [183]. In particular, Mahajan et al. [181]
lar forms [155], [156], [157]. Because these techniques are demonstrated how to encode various feasibility constraints
gaining importance for researchers and practitioners alike, into such models. However, an extensive comparison across
we dedicate the following to reviewing these methods. this class of methods is still missing since it is difficult to
measure how realistic the generated data are in the context of
A. Feature Highlighting Explanations algorithmic recourse.
Local input attribution techniques seek to explain the behav- More recently, a few works have suggested to develop
ior of machine learning models instance by instance. Those counterfactual explanations that are robust to model shifts
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
VII. E XPERIMENTS
Although several experimental studies have been pub-
home equity. The task consists of using the information about
lished in recent years [8], [10], an exhaustive comparison
the applicant in their credit report to predict whether they will
between existing deep learning approaches for heterogeneous
repay their HELOC account within a two-year period.
tabular data is still missing in the literature. For example,
We further use the Adult Income dataset [54], which is
important aspects of deep learning models, such as training
among the most popular tabular datasets used in the surveyed
and inference time, model size, and interpretability, are not
work (five usages). It includes basic information about indi-
discussed.
viduals such as age, gender, and education. The target variable
To fill this gap, we present an extensive empirical com-
is binary; it represents high and low income.
parison of machine and deep learning methods on real-world
The largest tabular dataset in our study is HIGGS, which
datasets with varying characteristics in this section. We discuss
stems from particle physics. The task is to distinguish between
the dataset choice (VII-A), the results (VII-B), and present
signals with Higgs bosons (HIGGS) and a background
a comparison of the training and inference time for all the
process [192]. Monte Carlo simulations [193] were used to
machine learning models considered in this survey (VII-C).
produce the data. In the first 21 columns (columns 2-22), the
We also discuss the size of deep learning models. Finally,
particle detectors in the accelerator measure kinematic proper-
to the best of our knowledge, we present the first comparison
ties. In the last seven columns, these properties are analyzed.
of explainable deep learning methods for tabular data (VII-
In total, HIGGS includes 11 million rows. We also binarize the
D). We release the full source code of our experiments for
21st variable into a categorical variable with three groups since
maximum transparency.1
DeepFM, DeepGBM, TabTransformer, and SAINT models
require at least one categorical attribute, to benchmark the
A. Datasets method’s special functionality on large datasets.
In computer vision, there are many established datasets The Covertype dataset [54] is multiclassification dataset,
for the evaluation of new deep learning architectures such as which holds cartographic information about land cells (e.g.,
MNIST [98], CIFAR [189], and ImageNet [190]. On the con- elevation and slope). The goal is to predict which one out of
trary, there are no established standard heterogeneous datasets. seven forest cover types is present in the cell.
Carefully checking the works listed in Section IV, we iden- Finally, we utilize the California Housing dataset [194],
tified over 100 different datasets with different characteristics which contains information about a number of properties. The
in their respective experimental evaluation sections. We note prediction task (regression) is to estimate the price of the
that the small overlap between the mentioned works makes corresponding home.
it hard to compare the results across these works in general. The fundamental characteristics of the selected datasets are
Therefore, in this work, we deliberately select datasets cov- summarized in Table IV.
ering the entire range of characteristics, such as data domain
(e.g., finance, e-commerce, geography, and physics), different B. Open Performance Benchmark on Tabular Data
types of target variables (classification and regression), varying
1) Hyperparameter Selection: In order to do a fair eval-
number of categorical variables and continuous variables, and
uation, we use the Optuna library [199] with 100 iterations
differing sample sizes (small to large). Furthermore, most
for each model to tune hyperparameters. Each hyperparameter
of the selected datasets were previously featured in multiple
configuration was cross-validated with five folds. The hyper-
studies.
parameter ranges used are publicly available online along with
The first dataset of our study is the Home Equity Line of
our code. We laid out the search space based on the informa-
Credit (HELOC) dataset provided by FICO [191]. This dataset
tion given in the corresponding papers and recommendations
consists of anonymized information from real homeowners
from the framework’s authors.
who applied for home equity lines of credit. An HELOC is a
2) Data Preprocessing: We prepossessed the data in the
line of credit typically offered by a bank as a percentage of
same way for every machine learning model by applying zero-
1 Open benchmarking on tabular data for machine learning models: mean, unit-variance normalization to the numerical features
https://github.com/kathrinse/TabSurvey. and an ordinal encoding to the categorical ones using the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
O PEN P ERFORMANCE B ENCHMARK R ESULTS BASED ON (S TRATIFIED ) F IVEFOLD C ROSS VALIDATION . W E U SE THE S AME F OLD S PLITTING S TRATEGY
FOR E VERY D ATASET. T HE T OP R ESULTS FOR E ACH D ATASET A RE IN B OLD , W E A LSO U NDERLINE THE S ECOND -B EST R ESULTS . T HE
M EAN AND S TANDARD D EVIATION VALUES A RE R EPORTED FOR E ACH BASELINE M ODEL . M ISSING R ESULTS I NDICATE T HAT THE
C ORRESPONDING M ODEL C OULD N OT B E A PPLIED TO THE TASK T YPE (R EGRESSION OR M ULTICLASS C LASSIFICATION )
alphabetical order. According to Hancock and Khoshgof- learning approaches. This suggests that for very large tabu-
taar [47], the chosen encoding strategy shows comparable lar datasets with predominantly continuous features, modern
performance to more advanced methods. The missing values neural network architectures may have an advantage over
were substituted with zeros for the linear regression and classical approaches after all. In general, however, our results
models based on pure neural networks since these methods are consistent with the inferior performance of deep learning
cannot accept them otherwise. We explicitly specify categor- techniques in comparison to approaches based on decision tree
ical features for LightGBM, DeepFM, DeepGBM, TabNet, ensembles (such as GBDT) on tabular data that were observed
TabTransformer, and SAINT since these approaches provide in various Kaggle competitions [201].
special functionality dedicated to categorical values, e.g., Considering only deep learning approaches, we observe that
learning an embedding of the categories. As we noted in SAINT provided competitive results across datasets. However,
Section III-C, the results of experiments may be affected by for the other models, the performance was highly dependent on
the data preprocessing. the chosen dataset. DeepFM performed best (among the deep
3) Reproducibility and Extensibility: For maximum repro- learning models) on the Adult dataset and second-best on the
ducibility, we run all experiments in a docker container [200]. California Housing dataset, but returned only weak results on
We underline again that our full code is publicly released so the HELOC dataset.
that the experiments can be replicated. The mentioned datasets
are also publicly available and can be used as a benchmark
C. Run Time Comparison
for novel methods. We would highly welcome contributed
implementations of additional methods from the data science We also analyze the training and inference time of
community. the models in comparison to their performance. We plot
4) Results: The results of our experiments are shown in the time–performance characteristic for the models in
Table V. They draw a different picture than many recent Figs. 3 and 4 for the Adult and the HIGGS dataset, respec-
research papers may suggest: for all but the very large HIGGS tively. While the training time of gradient boosting-based
dataset, the best scores are still obtained by boosted decision models is lower than that of most deep neural network-based
tree ensembles. XGBoost and CatBoost outperform all deep methods, their inference time on the HIGGS dataset with
learning-based approaches on the small and medium datasets, 11 million samples is significantly higher: for XGBoost, the
the regression dataset, and the multiclass dataset. For the inference time amounts to 5995 s, whereas inference times
large-scale HIGGS, SAINT outperforms the classical machine for MLP and SAINT are 10.18 and 282 s, respectively. All
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Train (left) and inference (right) time benchmarks for selected methods on the Adult dataset with 32.561 samples. The circle size reflects the accuracy
standard deviation.
Fig. 4. Train (left) and inference (right) time benchmarks for selected methods on the HIGGS dataset with 11 million samples. The circle size reflects the
accuracy standard deviation.
TABLE VI
S PEARMAN R ANK C ORRELATION OF THE P ROVIDED ATTRIBUTION W ITH
K ERNEL SHAP VALUES AS G ROUND T RUTH . R ESULTS W ERE
C OMPUTED ON 750 R ANDOM S AMPLES
F ROM THE A DULT D ATASET
Therefore, some deep learning solutions transform them into a methods, such as incremental decision trees [207], [208], are
homogeneous representation more suitable to neural networks. often preferred in online learning applications.
While the additional overhead is small, such transforms can
boost performance considerably and should thus be among the B. Open Research Questions
first strategies applied in real-world scenarios. Several open problems need to be addressed in future
4) Architectures for Deep Learning on Tabular Data: research. In this section, we will list those we deem funda-
Architecturewise, there has been a clear trend toward mental to the domain.
transformer-based solutions (Section IV-B2) in recent years. 1) Information-Theoretic Analysis of Encodings: Encoding
These approaches offer multiple advantages over standard methods are highly popular when dealing with tabular data.
neural network architectures, for instance, learning with atten- However, the majority of data preprocessing approaches for
tion over both categorical and numerical features. More- deep neural networks are lossy in terms of information content.
over, self-supervised or unsupervised pretraining that leverages Therefore, it is challenging to achieve an efficient, almost
unlabeled tabular data to train parts of the deep learning lossless transformation of heterogeneous tabular data into
model is gaining popularity, not only among transformer-based homogeneous data. Nevertheless, the information-theoretic
approaches. Performancewise, multiple independent evalua- view on these transformations remains to be investigated in
tions demonstrate that deep neural network methods from the detail and could shed light on the underlying mechanisms.
hybrid (Section IV-B1) and transformer-based (Section IV-B2) 2) Computational Efficiency in Hybrid Models: The work
groups exhibit superior predictive performance compared to by Shwartz-Ziv and Armon [8] suggests that the combination
plain deep neural networks on various datasets [9], [48], of a GBDT and deep neural networks may improve the pre-
[62], [84]. This underlines the importance of special-purpose dictive performance of a machine learning system. However,
architectures for tabular data. it also leads to growing complexity. Training or inference
5) Deep Generative Models for Tabular Data: Powerful times, which far exceed those of classical machine learning
tabular data generation is essential for the development of approaches, are a recurring problem when developing hybrid
high-quality models, particularly in a privacy context. With models. We conclude that the integration of state-of-the-art
suitable data generators at hand, developers can use large, syn- approaches from classical machine learning and deep learn-
thetic, and yet realistic datasets to develop better models, while ing has not been conclusively resolved yet and future work
not being subject to privacy concerns [145]. Unfortunately, the should be conducted on how to mitigate the tradeoff between
generation task is as hard as inference in predictive models, predictive performance and computational complexity.
so progress in both areas will likely go hand in hand. 3) Individual Regularizations: We applaud recent research
6) Interpretable Deep Learning Models for Tabular Data: on individual regularization methods, in which we see a
Interpretability is undoubtedly desirable, particularly for tab- promising direction to tackle the problem of highly sensitive
ular data models frequently applied to personal data, e.g., features. We believe that representing the towering influence
in healthcare and finance. An increasing number of approaches of certain features is crucial to success. Whether context- and
offer it out-of-the-box, but most current deep neural network architecture-specific regularizations for tabular data can be
models are still mainly concerned with the optimization of a found remains an open question. In addition, it is relevant
chosen error metric. Therefore, extending existing open-source to explore the theoretical constraints that govern the success
libraries (see [157], [170]) aimed at interpreting black-box of regularization on tabular data more profoundly.
models helps advance the field. Moreover, interpretable deep 4) Novel Processes for Tabular Data Generation: For tab-
tabular learning is essential for understanding model decisions ular data generation, modified GANs and VAEs are prevalent.
and results, especially for life-critical applications. However, However, the modeling of dependencies and categorical dis-
much of the state-of-the-art recourse literature does not offer tributions remains the key challenge. Novel architectures in
easy support of heterogeneous tabular data and lacks metrics this area, such as diffusion models, have not been adapted to
to evaluate the quality of heterogeneous data recourse. Finally, the domain of tabular data. Furthermore, the definition of an
model explanations can be used to identify and mitigate entirely new generative process particularly focused on tabular
potential unwanted biases and eliminate unfair discrimination data might be worth investigating.
[204], [205]. 5) Interpretability: Going forward, counterfactual explana-
7) Learning From Evolving Data Streams: Many modern tions for deep tabular learning can be used to improve the per-
applications are subject to continuously evolving data streams, ceived fairness in human–artificial intelligence (AI) interaction
e.g., social media, online retail, or healthcare. Streaming data scenarios and to enable personalized decision-making [188].
are usually heterogeneous and potentially unlimited. There- However, the heterogeneity of tabular data poses problems for
fore, observations must be processed in a single pass and can- counterfactual explanation methods to be reliably deployed in
not be stored. Indeed, online learning models can only access practice. The problem of efficiently handling heterogeneous
a fraction of the data at each time step. Furthermore, they have tabular data in the presence of feasibility constraints remains
to deal with limited resources and shifting data distributions unsolved [157].
(i.e., concept drift) [206]. Hence, hyperparameter optimization 6) Transfer of Deep Learning Methods to Data Streams:
and model selection, as typically involved in deep learning, are Recent work shows that some of the limitations of neural
usually not feasible in a data stream. For this reason, despite networks in an evolving data stream can be overcome [25],
the success of deep learning in other domains, less complex [209]. Conversely, changes in the parameters of a neural
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
network may be effectively used to weigh the importance numerical variables, the deep learning model SAINT outper-
of input features over time [210] or to detect concept drift formed these classical approaches. Furthermore, we assessed
[211]. Accordingly, we argue that deep learning for streaming the explanation properties of deep learning models with the
data—in particular strategies for dealing with evolving and self-attention mechanism. Although the TabNet model shows
heterogeneous tabular data—should receive more attention in promising explanatory capabilities, inconsistencies between
the future. the explanations remain an open issue.
7) Transfer Learning for Tabular Data: Reusing knowledge Due to the importance of tabular data to industry and
gained solving one problem and applying it to a different task academia, new ideas in this area are in high demand and can
is the research problem addressed by transfer learning. While have a significant impact. With this review, we hope to provide
transfer learning is successfully used in computer vision and interested readers with the references and insights they need
natural language processing applications [212], there are no to address open challenges and effectively advance the field.
efficient and generally accepted ways to do transfer learning
R EFERENCES
for tabular data. Hence, a general research question can be how
to share knowledge between multiple (related) tabular datasets [1] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural Netw., vol. 61, pp. 85–117, May 2015.
efficiently. [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cam-
8) Data Augmentation for Tabular Data: Data augmenta- bridge, MA, USA: MIT Press, 2016.
tion has proven highly effective to prevent overfitting, espe- [3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
cially in computer vision [213]. While some data augmentation [4] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink, and
techniques for tabular data exist, e.g., SMOTE-NC [214], sim- J. Schmidhuber, “LSTM: A search space Odyssey,” IEEE Trans.
ple models fail to capture the dependency structure of the data. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017.
[5] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Therefore, generating additional samples in a continuous latent Process. Syst., 2017, pp. 5998–6008.
space is a promising direction. This was investigated by Darabi [6] S. O. Arik and T. Pfister, “TabNet: Attentive interpretable tabular
and Elor [37] for minority oversampling. Nevertheless, the learning,” 2019, arXiv:1908.07442.
[7] S. Popov, S. Morozov, and A. Babenko, “Neural oblivious decision
reported improvements are only marginal. Thus, future work ensembles for deep learning on tabular data,” 2019, arXiv:1909.06312.
is required to find simple, yet effective random transformations [8] R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all
to enhance tabular training sets. you need,” 2021, arXiv:2106.03253.
[9] G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and
9) Self-Supervised Learning: Large-scale labeled data are T. Goldstein, “SAINT: Improved neural networks for tabular data via
usually required to train deep neural networks; however, data row attention and contrastive pre-training,” 2021, arXiv:2106.01342.
labeling is an expensive task. To avoid this expensive step, [10] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka, “Well-tuned simple
nets excel on tabular datasets,” in Proc. Adv. Neural Inf. Process. Syst.,
self-supervised methods propose to learn general feature repre- 2021, pp. 1–14.
sentations from available unlabeled data. These methods have [11] D. Ulmer, L. Meijerink, and G. Cinà, “Trust Issues: Uncertainty estima-
also shown astonishing results in computer vision and natural tion does not enable reliable OOD detection on medical tabular data,”
in Proc. Mach. Learn. Health NeurIPS Workshop, 2020, pp. 341–354.
language processing [215], [216]. Only a few recent works in [12] S. Somani et al., “Deep learning and the electrocardiogram: Review
this direction [79], [80], [217] deal with heterogeneous data. of the current state-of-the-art,” EP Europace, vol. 23, no. 8,
Hence, novel self-supervised learning approaches dedicated to pp. 1179–1191, Aug. 2021.
[13] V. Borisov, E. Kasneci, and G. Kasneci, “Robust cognitive load
tabular data might be worth investigating. detection from wrist-band sensors,” Comput. Hum. Behav. Rep., vol. 4,
Aug. 2021, Art. no. 100116.
IX. C ONCLUSION [14] J. M. Clements, D. Xu, N. Yousefi, and D. Efimov, “Sequential deep
learning for credit risk monitoring with tabular financial data,” 2020,
This survey is the first work to systematically explore arXiv:2012.15330.
deep neural network approaches for heterogeneous tabular [15] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “DeepFM: A
data. In this context, we highlighted the main challenges and factorization-machine based neural network for CTR prediction,” 2017,
arXiv:1703.04247.
research advances in modeling, generating, and explaining tab- [16] Z. Shuai, L. Yao, A. Sun, and T. Yi, “Deep learning based recommender
ular data. We introduced a unified taxonomy that categorizes system: A survey and new perspectives,” ACM Comput. Surv., vol. 52,
deep learning approaches for tabular data into three branches: no. 1, pp. 1–38, 2017.
[17] Q. Zhang, L. Cao, C. Shi, and Z. Niu, “Neural time-aware sequential
data transformation methods, specialized architectures, and recommendation by jointly modeling preference dynamics and explicit
regularization models. We believe that our taxonomy will feature couplings,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
help catalog future research and better understand and address no. 10, pp. 5125–5137, Oct. 2022.
[18] M. Ahmed, H. Afzal, A. Majeed, and B. Khan, “A survey of evolution
the remaining challenges in applying deep learning to tabular in predictive models and impacting factors in customer churn,” Adv.
data. We hope that it will help researchers and practitioners Data Sci. Adapt. Anal., vol. 9, no. 3, Jul. 2017, Art. no. 1750007.
to find the most appropriate strategies and methods for their [19] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Com-
applications. mun. Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., 2016.
In addition, we also conducted an unbiased evaluation of [20] F. Cartella, O. Anunciação, Y. Funabiki, D. Yamaguchi, T. Akishita, and
the state-of-the-art deep learning approaches on multiple real- O. Elshocht, “Adversarial attacks for tabular data: Application to fraud
detection and imbalanced data,” in Proc. CEUR Workshop, vol. 2808,
world datasets. Deep neural network-based methods for het- 2021, pp. 1–9.
erogeneous tabular data are still inferior to machine learning [21] C. J. Urban and K. M. Gates, “Deep learning: A primer for psycholo-
methods based on decision tree ensembles for small- and gists,” Psychol. Methods, vol. 26, no. 6, pp. 743–773, 2021.
[22] G. Pang, C. Aggarwal, C. Shen, and N. Sebe, “Editorial deep learning
medium-sized datasets (less than ∼1M samples). Only for for anomaly detection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
a very large dataset mainly consisting of continuous and no. 6, pp. 2282–2286, Jun. 2022.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[23] S. Wang et al., “Multiview deep anomaly detection: A systematic [50] L. Katzir, G. Elidan, and R. El-Yaniv, “Net-DNF: Effective deep
exploration,” IEEE Trans. Neural Netw. Learn. Syst., early access, modeling of tabular data,” in Proc. Int. Conf. Learn. Represent., 2021,
Jun. 26, 2022, doi: 10.1109/TNNLS.2022.3184723. pp. 1–16.
[24] V. Škvára, J. Francå, M. Zorek, T. Pevnỳ, and V. Šmídl, “Comparison of [51] R. U. David and M. Lane, Introduction to Statistics. 2003. [Online].
anomaly detectors: Context matters,” IEEE Trans. Neural Netw. Learn. Available: http://onlinestatbook.com/
Syst., vol. 33, no. 6, pp. 2494–2507, Jun. 2022. [52] M. Ryan, Deep Learning With Structured Data. New York, NY, USA:
[25] D. Sahoo, Q. Pham, J. Lu, and S. C. H. Hoi, “Online deep learning: Simon & Schuster, 2020.
Learning deep neural networks on the fly,” 2017, arXiv:1711.03705. [53] M. W. Cvitkovic et al., “Deep learning in unconventional domains,”
[26] X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state-of-the- Ph.D. dissertation, California Inst. Technol., Pasadena, CA, USA,
art,” Knowl.-Based Syst., vol. 212, Jan. 2021, Art. no. 106622. 2020.
[27] P. Yin, G. Neubig, W.-T. Yih, and S. Riedel, “TaBERT: Pretrain- [54] D. Dua and C. Graff. (2017). UCI Machine Learning Repository.
ing for joint understanding of textual and tabular data,” 2020, [Online]. Available: http://archive.ics.uci.edu/ml
arXiv:2005.08314. [55] A. J. Miles, “The sunstroke epidemic of Cincinnati, Ohio, during
[28] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial networks in the summer of 1881,” Public Health Papers Rep., vol. 7, no. 1,
computer vision: A survey and taxonomy,” 2019, arXiv:1906.01529. pp. 293–304, 1881.
[29] D. Lichtenwalter, P. Burggräf, J. Wagner, and T. Weißer, “Deep [56] R. A. Fisher, “The use of multiple measurements in taxonomic prob-
multimodal learning for manufacturing problem solving,” Proc. CIRP, lems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, Aug. 1936.
vol. 99, pp. 615–620, 2021. [57] D. A. Jdanov, D. Jasilionis, V. M. Shkolnikov, and M. Barbieri, “Human
[30] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine mortality database,” in Encyclopedia Gerontology Population Aging,
learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. D. Gu and M. E. Dupre, Eds. Cham, Switzerland: Springer, 2020.
Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019. [58] E. Fix, Discriminatory Analysis: Nonparametric Discrimination, Con-
[31] D. Medvedev and A. D’yakonov, “New properties of the data distilla- sistency Properties. Wright-Patterson AFB, OH, USA: USAF school
tion method when working with tabular data,” 2020, arXiv:2010.09839. of Aviation Medicine, 1951.
[32] J. Li, Y. Li, X. Xiang, S.-T. Xia, S. Dong, and Y. Cai, “TNT: An [59] C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, and
interpretable tree-network-tree learning framework using knowledge Y. C. Lee, “Learning and extracting finite state automata with second-
distillation,” Entropy, vol. 22, no. 11, p. 1203, Oct. 2020. order recurrent neural networks,” Neural Comput., vol. 4, no. 3,
[33] D. Roschewitz, M.-A. Hartley, L. Corinzia, and M. Jaggi, “IFedAvg: pp. 393–405, May 1992.
Interpretable data-interoperability for federated learning,” 2021, [60] L. Willenborg and T. De Waal, Statistical Disclosure Control in
arXiv:2107.06580. Practice, vol. 111. New York, NY, USA: Springer, 1996.
[34] A. Sánchez-Morales, J.-L. Sancho-Gómez, J.-A. Martínez-García, and [61] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks:
A. R. Figueiras-Vidal, “Improving deep learning performance with Estimating the click-through rate for new ads,” in Proc. 16th Int. Conf.
missing values via deletion and compensation,” Neural Comput. Appl., World Wide Web (WWW), 2007, pp. 521–530.
vol. 32, no. 17, pp. 13233–13244, Sep. 2020. [62] G. Ke, Z. Xu, J. Zhang, J. Bian, and T.-Y. Liu, “DeepGBM: A deep
learning framework distilled by GBDT for online prediction tasks,” in
[35] M. Abroshan, K. H. Yip, C. Tekin, and M. Van Der Schaar, “Con-
Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
servative policy construction using variational autoencoders for logged
Jul. 2019, pp. 384–394.
data with missing values,” IEEE Trans. Neural Netw. Learn. Syst., early
access, Jan. 10, 2022, doi: 10.1109/TNNLS.2021.3136385. [63] I. Shavitt and E. Segal, “Regularization learning networks: Deep
learning for tabular datasets,” in Proc. Adv. Neural Inf. Process. Syst.,
[36] J. Engelmann and S. Lessmann, “Conditional Wasserstein GAN-based
2018, pp. 1379–1389.
oversampling of tabular data for imbalanced learning,” Expert Syst.
[64] T. B. Brown et al., “Language models are few-shot learners,” 2020,
Appl., vol. 174, Jul. 2021, Art. no. 114582.
arXiv:2005.14165.
[37] S. Darabi and Y. Elor, “Synthesising multi-modal minority samples for
[65] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
tabular data,” 2021, arXiv:2105.08204.
for image recognition at scale,” in Proc. Int. Conf. Learn. Represent.,
[38] S. Kamthe, S. Assefa, and M. Deisenroth, “Copula flows for synthetic 2021, pp. 1–11.
data generation,” 2021, arXiv:2101.00598. [66] S. Khan, M. Naseer, M. Hayat, S. Waqas Zamir, F. Shahbaz Khan, and
[39] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, M. Shah, “Transformers in vision: A survey,” 2021, arXiv:2101.01169.
“Generating multi-label discrete patient records using generative adver- [67] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
sarial networks,” in Proc. 2nd Mach. Learn. Healthcare Conf., 2017, anomaly detection: A review,” ACM Comput. Surv., vol. 54, no. 2,
pp. 286–305. pp. 1–38, Mar. 2021.
[40] State of California, Department of Justice. (2018). California Consumer [68] A. F. Karr, A. P. Sanil, and D. L. Banks, “Data quality: A statistical
Privacy Act (CCPA). Accessed: Dec. 20, 2022. [Online]. Available: perspective,” Stat. Methodol., vol. 3, no. 2, pp. 137–173, 2006.
https://oag.ca.gov/privacy/ccpa [69] L. Xu and K. Veeramachaneni, “Synthesizing tabular data using gen-
[41] GDPR. (2016). Regulation (EU) 2016/679 of the European Parliament erative adversarial networks,” 2018, arXiv:1811.11264.
and of the Council. Official Journal of the European Union. [Online]. [70] G. Ke et al., “LightGBM: A highly efficient gradient boosting
Available: http://www.privacyregulation.eu/en/13.htm decision tree,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[42] P. Voigt and A. Von Dem Bussche, “The EU general data protection pp. 3146–3154.
regulation (GDPR),” in A Practical Guide, vol. 10, 1st ed. Cham, [71] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and
Switzerland: Springer, 2017, Art. no. 3152676. A. Gulin, “CatBoost: Unbiased boosting with categorical features,” in
[43] M. Sahakyan, Z. Aung, and T. Rahwan, “Explainable artificial Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 6638–6648.
intelligence for tabular data: A survey,” IEEE Access, vol. 9, [72] Y. Zhu et al., “Converting tabular data into images for deep learning
pp. 135392–135422, 2021. with convolutional neural networks,” Sci. Rep., vol. 11, no. 1, pp. 1–11,
[44] B. I. Grisci, M. J. Krause, and M. Dorn, “Relevance aggregation for May 2021.
neural networks interpretability and knowledge discovery on tabular [73] N. Rahaman et al., “On the spectral bias of neural networks,” in Proc.
data,” Inf. Sci., vol. 559, pp. 111–129, Jun. 2021. Int. Conf. Mach. Learn., 2019, pp. 5301–5310.
[45] U. Bhatt et al., “Explainable machine learning in deployment,” in Proc. [74] B. R. Mitchell et al., “The spatial inductive bias of deep learning,”
Conf. Fairness, Accountability, Transparency, Jan. 2020, pp. 648–657. Ph.D. dissertation, Johns Hopkins Univ., Baltimore, MD, USA, 2017.
[46] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” [75] Y. Gorishniy, I. Rubachev, and A. Babenko, “On embeddings for
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, numerical features in tabular deep learning,” 2022, arXiv:2203.05556.
Aug. 2016, pp. 785–794. [76] E. Fitkov-Norris, S. Vahid, and C. Hand, “Evaluating the impact of
[47] J. T. Hancock and T. M. Khoshgoftaar, “Survey on categorical data for categorical data encoding and scaling on neural network classification
neural networks,” J. Big Data, vol. 7, no. 1, pp. 1–41, Dec. 2020. performance: The case of repeat consumption of identical cultural
[48] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting goods,” in Proc. Int. Conf. Eng. Appl. Neural Netw. Cham, Switzerland:
deep learning models for tabular data,” 2021, arXiv:2106.11959. Springer, 2012, pp. 343–352.
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [77] D. Baylor et al., “TFX: A TensorFlow-based production-scale machine
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. learning platform,” in Proc. 23rd ACM SIGKDD Int. Conf. Knowl.
(CVPR), Jun. 2016, pp. 770–778. Discovery Data Mining, 2017, pp. 1387–1395.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[78] B. Sun et al., “SuperTML: Two-dimensional word embedding for [103] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip,
the precognition on structured tabular data,” in Proc. IEEE/CVF “A comprehensive survey on graph neural networks,” IEEE Trans.
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, Mar. 2020.
pp. 1–9. [104] C. Wang, M. Li, and A. J. Smola, “Language models with transform-
[79] J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “VIME: ers,” 2019, arXiv:1904.09408.
Extending the success of self- and semi-supervised learning to tabular [105] A. F. T. Martins and R. Fernandez Astudillo, “From softmax to
domaindim,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, sparsemax: A sparse model of attention and multi-label classification,”
pp. 1–11. 2016, arXiv:1602.02068.
[80] D. Bahri, H. Jiang, Y. Tay, and D. Metzler, “SCARF: Self- [106] G. Van Rossum and F. L. Drake, Jr., Python Reference Manual.
supervised contrastive learning using random feature corruption,” 2021, Amsterdam, The Netherlands: Centrum voor Wiskunde en Informatica,
arXiv:2106.15147. 1995.
[81] H.-T. Cheng et al., “Wide & deep learning for recommender sys- [107] M. Joseph, “PyTorch tabular: A framework for deep learning with
tems,” in Proc. 1st Workshop Deep Learn. Recommender Syst., 2016, tabular data,” 2021, arXiv:2104.13638.
pp. 7–10. [108] S. Boughorbel, F. Jarray, and A. Kadri, “Fairness in TabNet model
[82] N. Frosst and G. Hinton, “Distilling a neural network into a soft by disentangled representation for the prediction of hospital no-show,”
decision tree,” 2017, arXiv:1711.09784. 2021, arXiv:2103.04048.
[83] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “XDeepFM: [109] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan,
Combining explicit and implicit feature interactions for recommender “A survey on bias and fairness in machine learning,” ACM Comput.
systems,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Surv., vol. 54, no. 6, pp. 1–35, Jul. 2021.
Data Mining, Jul. 2018, pp. 1754–1763. [110] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “CutMix:
[84] G. Ke, J. Zhang, Z. Xu, J. Bian, and T.-Y. Liu. (2018). TabNN: Regularization strategy to train strong classifiers with localizable fea-
A Universal Neural Network Solution for Tabular Data. [Online]. tures,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
Available: https://openreview.net/forum?id=r1eJssCqY7 pp. 6023–6032.
[85] R. Agarwal et al., “Neural additive models: Interpretable machine [111] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup:
learning with neural nets,” 2020, arXiv:2004.13912. Beyond empirical risk minimization,” 2017, arXiv:1710.09412.
[86] Y. Luo, H. Zhou, W.-W. Tu, Y. Chen, W. Dai, and Q. Yang, “Network [112] V. Borisov, J. Haug, and G. Kasneci, “CancelOut: A layer for feature
on network for tabular data classification in real-world applications,” selection in deep neural networks,” in Proc. Int. Conf. Artif. Neural
in Proc. 43rd Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2020, Netw. Cham, Switzerland: Springer, 2019, pp. 72–83.
pp. 2317–2326. [113] G. Valdes, W. Arbelo, Y. Interian, and J. H. Friedman, “Lockout: Sparse
[87] Z. Liu, Q. Liu, H. Zhang, and Y. Chen, “DNN2LR: Interpretation- regularization of neural networks,” 2021, arXiv:2107.07160.
inspired feature crossing for real-world tabular data,” 2020, [114] J. Fiedler, “Simple modifications to improve tabular neural networks,”
arXiv:2008.09775. 2021, arXiv:2108.03214.
[88] S. Ivanov and L. Prokhorenkova, “Boost then Convolve: Gradient [115] K. Lounici, K. Meziani, and B. Riu, “Muddling label regularization:
boosting meets graph neural networks,” in Proc. Int. Conf. Learn. Deep learning for tabular datasets,” 2021, arXiv:2106.04462.
Represent., 2021, pp. 1–16. [116] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model
[89] H. Luo, F. Cheng, H. Yu, and Y. Yi, “SDTR: Soft decision tree regressor predictions,” in Proc. NeurIPS, 2017, pp. 1–10.
for tabular data,” IEEE Access, vol. 9, pp. 55999–56011, 2021. [117] H. Chen, S. Jajodia, J. Liu, N. Park, V. Sokolov, and
[90] X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTrans- V. S. Subrahmanian, “FakeTables: Using GANs to generate functional
former: Tabular data modeling using contextual embeddings,” 2020, dependency preserving tables with bounded real data,” in Proc. Twenty-
arXiv:2012.06678. Eighth Int. Joint Conf. Artif. Intell., Aug. 2019, pp. 2074–2080.
[91] S. Cai, K. Zheng, G. Chen, H. V. Jagadish, B. C. Ooi, and M. Zhang, [118] L. Gondara and K. Wang, “MIDA: Multiple imputation using denoising
“ARM-Net: Adaptive relation modeling network for structured data,” autoencoders,” in Proc. Pacific–Asia Conf. Knowl. Discovery Data
in Proc. Int. Conf. Manage. Data, Jun. 2021, pp. 207–220. Mining. Cham, Switzerland: Springer, 2018, pp. 260–272.
[119] R. D. Camino et al., “Working with deep generative models and tabular
[92] J. Kossen, N. Band, C. Lyle, A. Gomez, T. Rainforth, and Y. Gal,
data imputation,” in Proc. ICML Artemiss Workshop, 2020, pp. 1–6.
“Self-attention between datapoints: Going beyond individual input-
output pairs in deep learning,” in Proc. Adv. Neural Inf. Process. Syst., [120] M. Quintana and C. Miller, “Towards class-balancing human comfort
A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, datasets with GANs,” in Proc. 6th ACM Int. Conf. Syst. Energy-Efficient
pp. 28742–28756. Buildings, Cities, Transp., Nov. 2019, pp. 391–392.
[121] A. Koivu, M. Sairanen, A. Airola, and T. Pahikkala, “Synthetic minority
[93] Y. Yamada, O. Lindenbaum, S. Negahban, and Y. Kluger, “Feature
oversampling of vital statistics data with generative adversarial net-
selection using stochastic gates,” in Proc. Mach. Learn. Syst., 2020,
works,” J. Amer. Med. Inform. Assoc., vol. 27, no. 11, pp. 1667–1674,
pp. 8952–8963.
Nov. 2020.
[94] D. Micci-Barreca, “A preprocessing scheme for high-cardinality cat-
[122] J. Fan, J. Chen, T. Liu, Y. Shen, G. Li, and X. Du, “Relational
egorical attributes in classification and prediction problems,” ACM
data synthesis using generative adversarial networks: A design space
SIGKDD Explor. Newslett., vol. 3, no. 1, pp. 27–32, Jul. 2001.
exploration,” Proc. VLDB Endowment, vol. 13, no. 12, pp. 1962–1975,
[95] J. H. Friedman, “Stochastic gradient boosting,” Comput. Statist. Data Aug. 2020.
Anal., vol. 38, no. 4, pp. 367–378, 2002.
[123] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
[96] P. Langley and S. Sage, “Oblivious decision trees and abstract cases,” “Analyzing and improving the image quality of StyleGAN,” in Proc.
in Proc. Work. Notes AAAI Workshop Case-Based Reasoning. Seattle, IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
WA, USA, 1994, pp. 113–117. pp. 8110–8119.
[97] B. Peters, V. Niculae, and A. F. T. Martins, “Sparse Sequence-to- [124] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, “Adversarial ranking
Sequence models,” 2019, arXiv:1905.05702. for language generation,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[98] Y. LeCun and C. Cortes. (2010). MNIST Handwritten Digit Database. pp. 1–11.
[Online]. Available: http://yann.lecun.com/exdb/mnist/ [125] S. Subramanian, S. Rajeswar, F. Dutil, C. Pal, and A. Courville,
[99] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” “Adversarial generation of natural language,” in Proc. 2nd Workshop
2016, arXiv:1604.06737. Represent. Learn. NLP, 2017, pp. 241–251.
[100] S. Rendle, “Factorization machines,” in Proc. IEEE Int. Conf. Data [126] N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,”
Mining, Dec. 2010, pp. 995–1000. in Proc. IEEE Int. Conf. Data Sci. Adv. Analytics (DSAA), Oct. 2016,
[101] F. Moosmann, B. Triggs, and F. Jurie, “Fast discriminative visual pp. 399–410.
codebooks using randomized clustering forests,” in Proc. 20th Annu. [127] Z. Li, Y. Zhao, and J. Fu, “SynC: A copula based framework for
Conf. Neural Inf. Process. Syst. (NIPS). Cambridge, MA, USA: MIT generating synthetic data from aggregated sources,” in Proc. Int. Conf.
Press, 2006, pp. 985–992. Data Mining Workshops (ICDMW), Nov. 2020, pp. 571–578.
[102] X. He et al., “Practical lessons from predicting clicks on ads at [128] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao,
Facebook,” in Proc. 8th Int. Workshop Data Mining Online Advertising, “PrivBayes: Private data release via Bayesian networks,” ACM Trans.
2014, pp. 1–9. Database Syst., vol. 42, no. 4, pp. 1–41, Oct. 2017.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[129] C. Chow and C. Liu, “Approximating discrete probability distributions [153] E. Tjoa and C. Guan, “A survey on explainable artificial intelligence
with dependence trees,” IEEE Trans. Inf. Theory, vol. IT-14, no. 3, (XAI): Toward medical XAI,” IEEE Trans. Neural Netw. Learn. Syst.,
pp. 462–467, May 1968. vol. 32, no. 11, pp. 4793–4813, Nov. 2021.
[130] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, [154] J. Kauffmann, M. Esders, L. Ruff, G. Montavon, W. Samek, and
“Modeling tabular data using conditional GAN,” in Proc. Adv. Neural K.-R. Müller, “From clustering to cluster explanations via neural
Inf. Process. Syst., vol. 33, 2019, pp. 1–11. networks,” IEEE Trans. Neural Netw. Learn. Syst., early access,
[131] L. V. H. Vardhan and S. Kok, “Generating privacy-preserving synthetic Jul. 7, 2022, doi: 10.1109/TNNLS.2022.3185901.
tabular data using oblivious variational autoencoders,” in Proc. Work- [155] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and
shop Econ. Privacy Data Labor 37th Int. Conf. Mach. Learn., 2020, D. Pedreschi, “A survey of methods for explaining black box models,”
pp. 1–8. ACM Comput. Surv., vol. 51, no. 5, pp. 1–42, Sep. 2019.
[132] M. Baak, S. Brugman, I. F. Rojas, L. Dalmeida, R. E. Urlus, and [156] K. Gade, S. C. Geyik, K. Kenthapadi, V. Mithal, and A. Taly, “Explain-
J.-B. Oger, “Synthsonic: Fast, probabilistic modeling and synthesis able AI in industry,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl.
of tabular data,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, Discovery Data Mining, Jul. 2019, pp. 3203–3204.
pp. 4747–4763. [157] M. Pawelczyk, S. Bielawski, J. Van Den Heuvel, T. Richter, and
[133] I. J. Goodfellow et al., “Generative adversarial networks,” 2014, G. Kasneci, “CARLA: A Python library to benchmark algorithmic
arXiv:1406.2661. recourse and counterfactual explanation algorithms,” in Proc. Adv.
[134] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation Neural Inf. Process. Syst. (NeurIPS) Benchmark Datasets Track, 2021,
learning with deep convolutional generative adversarial networks,” in pp. 1–22.
Proc. 4th Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–16. [158] Y. Lou, R. Caruana, and J. Gehrke, “Intelligible models for classifica-
[135] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver- tion and regression,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl.
sarial networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223. Discovery Data Mining (KDD), 2012, pp. 150–158.
[136] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, [159] D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretabil-
“Improved training of Wasserstein GANs,” in Proc. 31st Int. Conf. ity with self-explaining neural networks,” in Proc. NeurIPS, 2018,
Neural Inf. Process. Syst., 2017, pp. 5769–5779. pp. 1–10.
[137] M. G. Bellemare et al., “The Cramér distance as a solution to biased [160] D. Wang, Q. Yang, A. Abdul, and B. Y. Lim, “Designing theory-driven
Wasserstein gradients,” 2017, arXiv:1705.10743. user-centric explainable AI,” in Proc. CHI, 2019, pp. 1–15.
[138] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio, [161] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and
“Boundary-seeking generative adversarial networks,” in Proc. Int. Conf. W. Samek, “On pixel-wise explanations for non-linear classifier deci-
Learn. Represent., 2018, pp. 1–17. sions by layer-wise relevance propagation,” PLoS ONE, vol. 10, no. 7,
[139] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton, Jul. 2015, Art. no. e0130140.
“VEEGAN: Reducing mode collapse in GANs using implicit varia- [162] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller,
tional learning,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., “Layer-wise relevance propagation: An overview,” in Explainable
2017, pp. 3310–3320. AI: Interpreting, Explaining and Visualizing Deep Learning. Cham,
[140] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Switzerland: Springer, 2019, pp. 193–209.
Proc. 2nd Int. Conf. Learn. Represent. (ICLR), Conf. Track, 2014, [163] G. Kasneci and T. Gottron, “LICON: A linear weighting scheme for
pp. 1–14. the contribution ofInput variables in deep artificial neural networks,” in
[141] C. Ma, S. Tschiatschek, R. Turner, J. M. Hernández-Lobato, and Proc. 25th ACM Int. Conf. Inf. Knowl. Manage., Oct. 2016, pp. 45–54.
C. Zhang, “VAEM: A deep generative model for heterogeneous mixed [164] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep
type data,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3319–3328.
pp. 1–11. [165] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
[142] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim, “Grad-CAM++: Generalized gradient-based visual explanations for
“Data synthesis based on generative adversarial networks,” Proc. VLDB deep convolutional networks,” in Proc. IEEE Winter Conf. Appl.
Endowment, vol. 11, no. 10, pp. 1071–1083, Jun. 2018. Comput. Vis. (WACV), Mar. 2018, pp. 1–9.
[143] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization [166] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should i trust
with Gumbel-Softmax,” in Proc. Int. Conf. Learn. Represent., 2017, you?’: Explaining the predictions of any classifier,” in Proc. 22nd
pp. 1–13. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
[144] B. Wen, L. O. Colon, K. P. Subbalakshmi, and R. Chandramouli, pp. 1135–1144.
“Causal-TGAN: Generating tabular data using causal generative adver- [167] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision
sarial networks,” 2021, arXiv:2104.10680. model-agnostic explanations,” in Proc. AAAI, 2018, pp. 1–9.
[145] J. Jordon, J. Yoon, and M. Van Der Schaar, “PATE-GAN: Generating [168] S. M. Lundberg et al., “From local explanations to global understanding
synthetic data with differential privacy guarantees,” in Proc. Int. Conf. with explainable AI for trees,” Nature Mach. Intell., vol. 2, pp. 56–67,
Learn. Represent., 2018, pp. 1–21. Jan. 2020.
[146] N. M. Jebreel, J. Domingo-Ferrer, A. Blanco-Justicia, and D. Sánchez, [169] J. Haug, S. Zürn, P. El-Jiz, and G. Kasneci, “On baselines for local
“Enhanced security and privacy via fragmented federated learning,” feature attributions,” 2021, arXiv:2101.00905.
IEEE Trans. Neural Netw. Learn. Syst., early access, Oct. 19, 2022, [170] Y. Liu, S. Khandagale, C. White, and W. Neiswanger, “Synthetic
doi: 10.1109/TNNLS.2022.3212627. benchmarks for scientific research in explainable machine learning,” in
[147] A. Mottini, A. Lheritier, and R. Acuna-Agost, “Airline passenger Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) Benchmark Datasets
name record generation using generative adversarial networks,” 2018, Track, 2021, pp. 1–25.
arXiv:1807.06657. [171] S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual explanations
[148] R. Camino, C. Hammerschmidt, and R. State, “Generating multi- without opening the black box: Automated decisions and the GDPR,”
categorical samples with generative adversarial networks,” in Proc. Harvard J. Law Technol., vol. 31, no. 2, p. 841, 2018.
ICML Workshop Theor. Found. Appl. Deep Generative Models, 2018, [172] B. Ustun, A. Spangher, and Y. Liu, “Actionable recourse in linear
pp. 1–7. classification,” in Proc. Conf. Fairness, Accountability, Transparency,
[149] M. K. Baowaly, C.-C. Lin, C.-L. Liu, and K.-T. Chen, “Synthesizing Jan. 2019, pp. 10–19.
electronic health records using improved generative adversarial net- [173] C. Russell, “Efficient search for diverse coherent explanations,” in Proc.
works,” J. Amer. Med. Inform. Assoc., vol. 26, no. 3, pp. 228–241, Conf. Fairness, Accountability, Transparency, Jan. 2019, pp. 20–28.
Mar. 2019. [174] K. Rawal and H. Lakkaraju, “Beyond individualized recourse: Inter-
[150] Z. Zhao, A. Kunar, H. Van der Scheer, R. Birke, and pretable and interactive summaries of actionable recourses,” in Proc.
L. Y. Chen, “CTAB-GAN: Effective table data synthesizing,” 2021, NeurIPS, 2020, pp. 12187–12198.
arXiv:2102.08369. [175] A.-H. Karimi, G. Barthe, B. Balle, and I. Valera, “Model-agnostic
[151] V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci, counterfactual explanations for consequential decisions,” in Proc. Int.
“Language models are realistic tabular data generators,” 2022, Conf. Artif. Intell. Statist., 2020, pp. 895–905.
arXiv:2210.06280. [176] A. Dhurandhar et al., “Explanations based on the missing: Towards
[152] F. J. Massey, Jr., “The Kolmogorov-Smirnov test for goodness of fit,” contrastive explanations with pertinent negatives,” in Proc. Adv. Neural
J. Amer. Statist. Assoc., vol. 46, no. 253, pp. 68–78, 1951. Inf. Process. Syst. (NeurIPS), 2018, pp. 1–12.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[177] B. Mittelstadt, C. Russell, and S. Wachter, “Explaining explanations in [198] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas
AI,” in Proc. Conf. Fairness, Accountability, Transparency, Jan. 2019, immanent in nervous activity,” Bull. Math. Biophys., vol. 5, no. 4,
pp. 279–288. pp. 115–133, 1943.
[178] M. Pawelczyk, K. Broelemann, and G. Kasneci, “Learning model- [199] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A
agnostic counterfactual explanations for tabular data,” in Proc. Web next-generation hyperparameter optimization framework,” in Proc. 25th
Conf., Apr. 2020, pp. 3126–3132. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2019,
[179] M. Downs, J. L. Chu, Y. Yacoby, F. Doshi-Velez, and W. Pan, “CRUDS: pp. 1–10.
Counterfactual recourse using disentangled subspaces,” in Proc. ICML [200] D. Merkel, “Docker: Lightweight Linux containers for consistent
Workshop Hum. Interpretability Mach. Learn. (WHI), 2020, 1–23. development and deployment,” Linux J., vol. 2014, no. 239, p. 2, 2014.
[180] S. Joshi, O. Koyejo, W. Vijitbenjaronk, B. Kim, and J. Ghosh, “Towards [201] C. S. Bojer and J. P. Meldgaard, “Kaggle forecasting competitions: An
realistic individual recourse and actionable explanations in black-box overlooked learning opportunity,” Int. J. Forecasting, vol. 37, no. 2,
decision making systems,” 2019, arXiv:1907.09615. pp. 587–603, Apr. 2021.
[181] D. Mahajan, C. Tan, and A. Sharma, “Preserving causal constraints [202] Y. Rong, T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci,
in counterfactual explanations for machine learning classifiers,” 2019, “A consistent and efficient evaluation strategy for attribution methods,”
arXiv:1912.03277. in Proc. Int. Conf. Mach. Learn., 2022, pp. 18770–18795.
[182] M. Pawelczyk, K. Broelemann, and G. Kasneci, “On counterfactual [203] R. Tomsett, D. Harborne, S. Chakraborty, P. Gurram, and A. Preece,
explanations under predictive multiplicity,” in Proc. Conf. Uncertainty “Sanity checks for saliency metrics,” in Proc. AAAI Conf. Artif. Intell.,
Artif. Intell. (UAI), 2020, pp. 809–818. vol. 34, no. 4, 2020, pp. 6021–6029.
[183] A. Nazábal, P. M. Olmos, Z. Ghahramani, and I. Valera, “Han- [204] E. Ntoutsi et al., “Bias in data-driven artificial intelligence systems-an
dling incomplete heterogeneous data using VAEs,” Pattern Recognit., introductory survey,” Wiley Interdiscipl. Reviews: Data Mining Knowl.
vol. 107, Nov. 2020, Art. no. 107501. Discovery, vol. 10, no. 3, p. e1356, 2020.
[184] S. Upadhyay, S. Joshi, and H. Lakkaraju, “Towards robust and reli- [205] A. Giloni et al., “BENN: Bias estimation using a deep neural network,”
able algorithmic recourse,” in Proc. Adv. Neural Inf. Process. Syst. IEEE Trans. Neural Netw. Learn. Syst., early access, May 11, 2022,
(NeurIPS), vol. 34, 2021, pp.16926–16937. doi: 10.1109/TNNLS.2022.3172365.
[185] R. Dominguez-Olmedo, A.-H. Karimi, and B. Schölkopf, “On the [206] Y. Sun, K. Tang, Z. Zhu, and X. Yao, “Concept drift adaptation by
adversarial robustness of causal algorithmic recourse,” in Proc. Int. exploiting historical knowledge,” IEEE Trans. Neural Netw. Learn.
Conf. Mach. Learn. (ICML), 2022, pp. 5324–5342. Syst., vol. 29, no. 10, pp. 4822–4832, Oct. 2018.
[186] M. Pawelczyk, T. Datta, J. Van-Den-Heuvel, G. Kasneci, and [207] P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proc.
H. Lakkaraju, “Probabilistically robust recourse: Navigating the trade- 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD),
offs between costs and robustness in algorithmic recourse,” 2022, 2000, pp. 71–80.
arXiv:2203.06768. [208] C. Manapragada, G. I. Webb, and M. Salehi, “Extremely fast decision
[187] A.-H. Karimi, G. Barthe, B. Schölkopf, and I. Valera, “A survey tree,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data
of algorithmic recourse: Definitions, formulations, solutions, and Mining, Jul. 2018, pp. 1953–1962.
prospects,” 2020, arXiv:2010.04050. [209] P. Duda, M. Jaworski, A. Cader, and L. Wang, “On training deep neural
[188] S. Verma, J. Dickerson, and K. Hines, “Counterfactual explanations for networks using a streaming approach,” J. Artif. Intell. Soft Comput.
machine learning: A review,” 2020, arXiv:2010.10596. Res., vol. 10, no. 1, pp. 15–26, Jan. 2020.
[189] A. Krizhevsky, “Learning multiple layers of features from tiny images,” [210] J. Haug, M. Pawelczyk, K. Broelemann, and G. Kasneci, “Leveraging
Univ. Toronto, Toronto, ON, Canada, Tech. Rep., 2009. [Online]. Avail- model inherent variable importance for stable online feature selection,”
able: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
[190] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Aug. 2020, pp. 1478–1502.
“ImageNet: A large-scale hierarchical image database,” in Proc. IEEE [211] J. Haug and G. Kasneci, “Learning parameter distributions to detect
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255. concept drift in data streams,” in Proc. 25th Int. Conf. Pattern Recognit.
[191] FICO. (2019). Home Equity Line of Credit (HELOC) (ICPR), Jan. 2021, pp. 9452–9459.
Dataset. Accessed: Jun. 15, 2022. [Online]. Available: [212] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on
https://community.fico.com/s/explainable-machine-learning-challenge deep transfer learning,” in Proc. Int. Conf. Artif. Neural Netw. Cham,
[192] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles Switzerland: Springer, 2018, pp. 270–279.
in high-energy physics with deep learning,” Nature Commun., vol. 5, [213] C. Shorten and T. M. Khoshgoftaar, “A survey on image data aug-
no. 1, pp. 1–9, Sep. 2014. mentation for deep learning,” J. Big Data, vol. 6, no. 1, pp. 1–48,
[193] C. Z. Mooney, Monte Carlo Simulation. Newbury Park, CA, USA: Dec. 2019.
SAGE, 1997. [214] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
[194] R. K. Pace and R. Barry, “Sparse spatial autoregressions,” Statist. “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell.
Probab. Lett., vol. 33, pp. 291–297, May 1997. Res., vol. 16, no. 1, pp. 321–357, Jan. 2002.
[195] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, [215] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep
Classification Regression Trees. Evanston, IL, USA: Routledge, neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
2017. vol. 43, no. 11, pp. 4037–4058, Nov. 2021.
[196] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, [216] X. Liu et al., “Self-supervised learning: Generative or contrastive,”
2001. IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 857–876, Jan. 2021.
[197] K. Broelemann and G. Kasneci, “A gradient-based split criterion for [217] T. Ucar, E. Hajiramezanali, and L. Edwards, “SubTab: Subsetting
highly accurate and transparent model trees,” in Proc. IJCAI, 2019, features of tabular data for self-supervised representation learning,” in
pp. 1–8. Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 1–13.