0% found this document useful (0 votes)

104 views21 pages

Deep Neural Networks and Tabular Data A Survey

This article provides an overview of deep learning methods for tabular data by categorizing them into three groups: data transformations, specialized architectures, and regularization models. It discusses approaches for generating tabular data and explaining deep models on tabular data. The article empirically compares machine learning and deep learning methods on real-world tabular datasets and finds that gradient-boosted tree ensembles still outperform deep learning models on supervised tasks, suggesting stagnating research progress for competitive deep learning on tabular data.

Uploaded by

Suhas M Angadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views21 pages

Deep Neural Networks and Tabular Data A Survey

Uploaded by

Suhas M Angadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Deep Neural Networks and Tabular Data: A Survey

Vadim Borisov , Tobias Leemann , Kathrin Seßler , Johannes Haug ,
Martin Pawelczyk , and Gjergji Kasneci
Abstract— Heterogeneous tabular data are the most commonly and sparse categorical features. Furthermore, the correlation
used form of data and are essential for numerous critical among the features is weaker than the one introduced through
and computationally demanding applications. On homogeneous spatial or semantic relationships in image or speech data.
datasets, deep neural networks have repeatedly shown excellent
performance and have therefore been widely adopted. However, Hence, it is necessary to discover and exploit relations without
their adaptation to tabular data for inference or data generation relying on spatial information [9]. Therefore, Kadra et al. [10]
tasks remains highly challenging. To facilitate further progress in called tabular datasets the “last unconquered castle” for deep
the field, this work provides an overview of state-of-the-art deep neural network models.
learning methods for tabular data. We categorize these methods Heterogeneous data are the most commonly used form of
into three groups: data transformations, specialized architectures,
and regularization models. For each of these groups, our work data [8], and it is ubiquitous in many crucial applications,
offers a comprehensive overview of the main approaches. More- such as medical diagnosis based on patient history [11], [12],
over, we discuss deep learning approaches for generating tabular [13], predictive analytics for financial applications (e.g., risk
data and also provide an overview over strategies for explaining analysis, estimation of creditworthiness, the recommendation
deep models on tabular data. Thus, our first contribution is to of investment strategies, and portfolio management) [14],
address the main research streams and existing methodologies in
the mentioned areas while highlighting relevant challenges and click-through rate (CTR) prediction [15], user recommen-
open research questions. Our second contribution is to provide dation systems [16], [17], customer churn prediction [18],
an empirical comparison of traditional machine learning methods cybersecurity [19], fraud detection [20], psychology [21],
with 11 deep learning approaches across five popular real-world anomaly detection [22], [23], [24], and so forth. In all these
tabular datasets of different sizes and with different learning applications, a boost in predictive performance and robust-
objectives. Our results, which we have made publicly available
as competitive benchmarks, indicate that algorithms based on ness may have considerable benefits for both end users and
gradient-boosted tree ensembles still mostly outperform deep companies that provide such solutions. Simultaneously, this
learning models on supervised learning tasks, suggesting that requires handling many data-related pitfalls, such as noise,
the research progress on competitive deep learning models for impreciseness, different attribute types and value ranges, or the
tabular data is stagnating. To the best of our knowledge, this missing value problem and privacy issues.
is the first in-depth overview of deep learning approaches for
tabular data; as such, this work can serve as a valuable starting Meanwhile, deep neural networks offer multiple advantages
point to guide researchers and practitioners interested in deep over traditional machine learning methods. First, these meth-
learning with tabular data. ods are highly flexible [25], allow for efficient and iterative
Index Terms— Benchmark, deep neural networks, discrete training, and are particularly valuable for AutoML [26], [27].
data, heterogeneous data, interpretability, probabilistic modeling, Second, tabular data generation is possible using deep neural
survey, tabular data, tabular data generation. networks and can, for instance, help mitigate class imbalance
problems [28]. Third, neural networks can be deployed for
I. I NTRODUCTION multimodal learning problems where tabular data can be
one of many input modalities [29], [30], for tabular data
E VER-INCREASING computational resources and the
availability of large, labeled datasets have accelerated
the success of deep neural networks [1], [2]. In particular,
distillation [31], [32], for federated learning [33], and in many
more scenarios.
architectures based on convolutions, recurrent mechanisms [3], Successful deployments of data-driven applications require
[4], or transformers [5] have led to unprecedented performance solving several tasks, among which we identified three core
in a multitude of domains. Although deep learning methods challenges: 1) inference; 2) data generation; and 3) inter-
perform outstandingly well for classification or data generation pretability. The most crucial task is inference, which is con-
tasks on homogeneous data (e.g., image, audio, and text cerned with making predictions based on past observations.
data), tabular data still pose a challenge to deep learning While a powerful predictive model is critical for all the
models [6], [7], [8]. Tabular data—in contrast to image or applications mentioned in the previous paragraph, the interplay
language data—are heterogeneous, leading to dense numerical between tabular data and deep neural networks goes beyond
simple inference tasks. Before a predictive model can even
Manuscript received 21 February 2022; revised 29 June 2022, 24 October be trained, the training data usually need to be preprocessed.
2022, and 28 November 2022; accepted 12 December 2022. (Corresponding This is where data generation plays a crucial role, as one of the
authors: Vadim Borisov; Tobias Leemann.)
The authors are with the Data Science and Analytics Research standard deployment steps involves the imputation of missing
(DSAR) Group, University of Tübingen, 72070 Tübingen, Germany (e-mail: values [34], [35] and the rebalancing of the dataset [36],
[email protected]; [email protected]). [37] (i.e., equalizing sample sizes for different classes). Fur-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2022.3229161. thermore, it might be simply impossible to use the actual
Digital Object Identifier 10.1109/TNNLS.2022.3229161 data due to privacy concerns, e.g., in financial or medical
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

applications [38], [39]. Thus, to tackle the data preprocessing neural networks. An overview of explanation mechanisms
and privacy challenges, probabilistic tabular data generation for deep models for tabular data is presented in Section VI.
is essential. Finally, with stricter data protection laws such as In Section VII, we provide an extensive empirical comparison
California Consumer Privacy Act (CCPA) [40] and the Euro- of machine and deep learning methods on real-world data,
pean General Data Protection Regulation (EU GDPR) [41], which also involves model size, runtime, and interpretability.
which both mandate a right to explanations for automated In Section VIII, we summarize the state of the field and give
decision systems (e.g., in the form or recourse [42]), inter- future perspectives. Finally, we outline several open research
pretability is becoming a key aspect for predictive models used questions before concluding in Section IX.
for tabular data [43], [44]. During deployment, interpretability
methods also serve as a valuable tool for model debugging II. R ELATED W ORK
and auditing [45]. To the best of our knowledge, there is no study dedicated
Evidently, apart from the core challenges of inference, gen- exclusively to the application of deep neural networks to
eration, and interpretability, there are several other important tabular data, spanning the areas of supervised and unsuper-
subfields, such as working with data streams, distribution vised learning, data synthesis, and interpretability. Prior works
shifts, as well as privacy and fairness considerations that cover some of these aspects, but none of them systematically
should not be neglected. Nevertheless, to navigate the vast discusses the existing approaches in the broadness of this
body of literature, we focus on the identified core problems survey.
and thoroughly review the state of the art in this work. We will However, there are some works that cover parts of the
briefly discuss the remaining topics at the end of this survey. domain. There is a comprehensive analysis of common
Beyond reviewing current literature, we think that an approaches for categorical data encoding as a preprocessing
exhaustive comparison between existing deep learning step for deep neural networks by Hancock and Khoshgof-
approaches for heterogeneous tabular data is necessary to put taar [47]. The authors compared existing methods for cate-
reported results into context. The variety of benchmarking gorical data encoding on various tabular datasets and different
datasets and the different setups often prevent the comparison deep learning architectures. We discuss the key categorical
of results across papers. In addition, important aspects of data encoding methods in Section IV-A1.
deep learning models, such as training and inference time, A recent survey by Sahakyan et al. [43] summarizes expla-
model size, and interpretability, are usually not discussed. nation techniques in the context of tabular data. Hence, we do
We aim to bridge this gap by providing a comparison of not provide a detailed discussion of explainable machine
the surveyed inference approaches with classical—yet very learning for tabular data in this article. However, for the sake
strong—baselines such as XGBoost [46]. We open-source of completeness, we present some of the most relevant works
our code, allowing researchers to reproduce and extend our in Section VI and highlight open challenges in this area.
findings. Gorishniy et al. [48] empirically evaluated a large number of
In summary, the aims of this survey are to provide the state-of-the-art deep learning approaches for tabular data on a
following: wide range of datasets. He et al. [49] demonstrated that a tuned
deep neural network model with a ResNet-like architecture
1) a thorough review of existing scientific literature on deep shows comparable performance to some state-of-the-art deep
learning for tabular data; learning approaches for tabular data.
2) a taxonomic categorization of the available approaches Recently, Shwartz-Ziv and Armon [8] published a study
for classification and regression tasks on heterogeneous on several different deep models for tabular data, including
tabular data; TabNet [6], NODE [7], and Net-DNF [50]. In addition,
3) a presentation of the state of the art and promising paths they compared deep learning approaches to gradient boosting
toward tabular data generation; decision tree (GBDT) algorithms regarding accuracy, training
4) an overview of existing explanation approaches for deep effort, inference efficiency, and hyperparameter optimization
models for tabular data; time. They observed that deep models had the best results
5) an extensive empirical comparison of traditional on their chosen datasets, and however, not one single deep
machine learning methods and deep learning models on model could outperform all the others in general. The deep
multiple real-world heterogeneous tabular datasets; models were challenged by GBDTs, leading the authors to
6) a discussion on the main reasons for the limited success conclude that efficient tabular data modeling using deep neural
of deep learning on tabular data; networks is still an open research problem. In the face of
7) a list of open challenges related to deep learning for this evidence, we aim to integrate the necessary background
tabular data. for future research on the inference problem and on the
Accordingly, this survey is structured as follows. We dis- intertwined challenges of generation and explainability into
cuss related works in Section II. To introduce the reader to a single work.
the field, in Section III, we provide definitions of the key
III. TABULAR DATA AND D EEP N EURAL N ETWORKS
terms, a brief outline of the domain’s history, and propose
a unified taxonomy of current approaches to deep learning A. Definitions
with tabular data. Section IV covers the main methods for In this section, we give definitions for central terms used in
modeling tabular data using deep neural networks. Section V this work. We also provide pointers to the original works for
presents an overview on tabular data generation using deep more detailed explanations of the methods.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 3

TABLE I
E XAMPLE OF A H ETEROGENEOUS TABULAR D ATASET. H ERE , W E S HOW
F IVE S AMPLES W ITH S ELECTED VARIABLES F ROM THE A DULT
D ATASET [54]. S ECTION VII-A P ROVIDES F URTHER
D ETAILS ON T HIS D ATASET

take one out of a limited set of values. Examples of typical

categorical variables include gender, user_id, product_type and
topic.
Tabular data, sometimes also called structured data [52],
are the subcategory of the heterogeneous data format that
is usually presented in a table [53] with data points as
rows and features as columns. In summary, for the scope
of this work, we refer to a dataset with a fixed number of
features that are either continuous or categorical as tabular.
Each data point can be understood as a row in the table,
or—taking a probabilistic view—as a sample from the
unknown joint distribution. An illustrative example of five
rows of heterogeneous, tabular data is provided in Table I.
Fig. 1. Unified taxonomy of deep neural network models for heterogeneous
tabular data.
B. Brief History of Deep Learning on Tabular Data
Tabular data are one of the oldest forms of data to be
The key concept in this survey is a (deep) neural network. statistically analyzed. Before digital collection of text, images,
Unless stated otherwise we use this concept as a synonym and sound was possible, almost all data were tabular [55], [56],
for feedforward networks, as described in [2], and name the [57]. Therefore, it was the target of early machine learning
concrete model whenever we deviate from this concept. A deep research [58]. However, deep neural networks became popular
neural network defines mapping fˆ in the digital age and were further developed with a focus on
y = f (x) ≈ fˆ(x; W ) (1) homogeneous data. In recent years, various supervised, self-
supervised, and semisupervised deep learning approaches have
that learns the value of the model parameters W (i.e., the been proposed, which explicitly address the issue of tabular
“weights” of a neural network) that results in the best approx- data modeling again. Early works mostly focused on data
imation of the true underlying and unknown function f . In this transformation techniques for preprocessing [59], [60], which
case, x is a multidimensional data sample (i.e., x ∈ Rn ) with are still important today [47].
corresponding target y (where typically, y ∈ Rk for k classes A huge stimulus was the rise of e-commerce, which
and y ∈ R for regression tasks) from a dataset of tuples demanded novel solutions, especially in advertising [15],
{(x i , yi )}i∈I . The network is called feedforward if the input [61]. These tasks required fast and accurate estimation on
information flows in one direction to the output without any heterogeneous datasets with many categorical variables, for
feedback connections. which the traditional machine learning approaches are not
Throughout this survey, we focus on heterogeneous data well suited (e.g., categorical features that have high cardinality
that usually contain a variety of attribute types. These include can lead to very sparse high-dimensional feature vectors and
both continuous and discrete attributes of different types (e.g., nonrobust models). As a result, researchers and data scientists
binary values, ordinal values, and high-cardinality categorical started looking for more flexible solutions, e.g., those based
values). This is fundamentally different from homogeneous on deep neural networks, that can capture complex nonlinear
data modalities, such as images, audio, or text data where dependencies in the data.
only a single feature type is present. In particular, the CTR prediction problem has received a
Categorical variables are an attribute type of particular lot of attention [15], [62]. A large variety of approaches were
importance. According to Lane’s definition [51], categorical proposed, most of them relying on specialized neural network
variables are qualitative values. They “do not imply a numeri- architectures for heterogeneous tabular data.
cal ordering,” unlike quantitative values, which are “measured A more recent line of research, sparked by Shavitt and
in terms of numbers.” Usually, a categorical variable can Segal [63], evolved based on the idea that regularization may
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

improve the performance of deep neural networks on tabular to information loss, leading to a reduction in predictive
data [10]. This has led to an intensification of research on performance [76].
regularization approaches. 4) Importance of Single Features: While typically changing
Due to the tremendous success of attention-based the class of an image requires a coordinated change in
approaches such as transformers on textual [64] and visual many features, i.e., pixels, the smallest possible change
data [65], [66], researchers have recently also started applying of a categorical (or binary) feature can entirely flip a
attention-based methods and self-supervised learning tech- prediction on tabular data [63]. In contrast to deep neural
niques to tabular data. After the introduction of transformer networks, decision-tree algorithms can handle varying
architectures to the field of tabular data [6], a lot of research feature importance exceptionally well by selecting a
effort has focused on transformer architectures that can be single feature and appropriate threshold (i.e., splitting)
successfully applied to very large tabular datasets. values and “ignoring” the rest of the data sample. Shavitt
and Segal [63] have argued that individual weight reg-
ularization may mitigate this challenge and motivated
C. Challenges of Learning With Tabular Data
more work in this direction [10].
As we have mentioned in Section II, deep neural networks With these four fundamental challenges in mind, we continue
often perform less favorably compared to more traditional by organizing and discussing the strategies developed to
machine learning methods (e.g., tree-based methods) when address them. We start by developing a suitable taxonomy.
dealing with tabular data. However, it is often unclear why
deep learning cannot achieve the same level of predictive
quality as in other domains such as image classification and D. Unified Taxonomy
natural language processing. In the following, we identify and In this section, we introduce a taxonomy of approaches that
discuss four possible reasons. allows for a unified view of the field. We divide the works
1) Low-Quality Training Data: Data quality is a common from the deep learning with tabular data literature into three
issue with real-world tabular datasets. They often include main categories: data transformation methods, specialized
missing values [34], extreme data (outliers) [67], and architectures, and regularization models. In Fig. 1, we provide
erroneous or inconsistent data [68] and have a small an overview of our taxonomy of deep learning methods for
overall size relative to the high-dimensional feature tabular data.
vectors generated from the data [69]. Also, due to the 1) Data Transformation Methods: The methods in the first
expensive nature of data collection, tabular data are group transform categorical and numerical data. This is usually
frequently class-imbalanced. These challenges affect all done to enable deep neural network models to better extract
machine learning algorithms; however, most of the mod- the information signal. Methods from this group do not require
ern decision tree-based algorithms can handle missing new architectures or adaptations of the existing data processing
values or different/extreme variable ranges internally pipeline. Nevertheless, the transformation step comes at the
by looking for appropriate approximations and split cost of an increased preprocessing time. This might be an
values [46], [70], [71]. issue for high-load systems [77], particularly in the presence
2) Missing or Complex Irregular Spatial Dependencies: of categorical variables with high cardinality and growing
There is often no spatial correlation between the vari- dataset size. We can further subdivide this area into single-
ables in tabular datasets [72] or the dependencies dimensional encodings and multidimensional encodings. The
between features are rather complex and irregular. When former encodings are employed to transform each feature
working with tabular data, the structure and relationships independently while the latter encoding methods map an entire
between its features have to be learned from scratch. record to another representation.
Thus, the inductive biases used in popular models for 2) Specialized Architectures: The biggest share of works
homogeneous data, such as convolutional neural net- investigates specialized architectures and suggests that a dif-
works, are unsuitable for modeling this data type [50], ferent deep neural network architecture is required for tabular
[73], [74]. data. Two types of architectures are of particular importance:
3) Dependency on Preprocessing: A key advantage of hybrid models fuse classical machine learning approaches
deep learning on homogeneous data is that it includes (e.g., decision trees) with neural networks, while transformer-
an implicit representation learning step [2], so only a based models rely on attention mechanisms.
minimal amount of preprocessing or explicit feature con- 3) Regularization Models: Finally, the group of regular-
struction is required. However, for tabular data and deep ization models claims that one of the main reasons for the
neural networks, the performance may strongly depend moderate performance of deep learning models on tabular data
on the selected preprocessing strategy [75]. Handling is their extreme nonlinearity and model complexity. Therefore,
the categorical features remains particularly challenging strong regularization schemes are proposed as a solution. They
[47] and can easily lead to a very sparse feature matrix are mainly implemented in the form of special-purpose loss
(e.g., by using a one-hot encoding scheme) or introduce functions.
a synthetic ordering of previously unordered values (e.g., We believe that our taxonomy may help practitioners find
by using an ordinal encoding scheme). Finally, pre- the methods of choice that can be easily integrated into their
processing methods for deep neural networks may lead existing tool chain. For instance, applying data transformations
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 5

can result in performance improvements while maintaining approach is also used in the CatBoost framework [71], a state-
the current model architecture. Conversely, using specialized of-the-art machine learning library for heterogeneous tabular
architectures, the data preprocessing pipeline can be kept data based on the gradient boosting algorithm [95].
intact. A different strategy is hash-based encoding. Every category
is transformed into a fixed-size value via a deterministic hash
IV. D EEP N EURAL N ETWORKS FOR TABULAR DATA function. The output size is not directly dependent on the
In this section, we discuss the use of deep neural networks number of input categories but can be chosen manually.
on tabular data for classification and regression tasks according 2) Multidimensional Encoding: A first automatic encoding
to the taxonomy presented in Section III. We provide an strategy is the value imputation and mask estimation (VIME)
overview of existing deep learning approaches in this area approach [79]. The authors propose a self-supervised and
of research in Table II and examine the three methodolog- semisupervised deep learning framework for tabular data that
ical categories in detail: data transformation methods (see trains an encoder in a self-supervised fashion by using two
Section IV-A), architecture-based methods (see Section IV-B), pretext tasks. Those tasks are independent of the concrete
and regularization-based models (see Section IV-C). downstream task that the predictor has to solve. The first
task of VIME is called mask vector estimation; its goal is
to determine which values in a sample are corrupted. The
A. Data Transformation Methods second task, i.e., feature vector estimation, is to recover the
Most traditional approaches for deep neural networks on original values of the sample. The encoder itself is a simple
tabular data fall into this group. Interestingly, data preprocess- multilayer perceptron. This automatic encoding makes use of
ing plays a relatively minor role in computer vision, even the fact that there is often much more unlabeled than labeled
though the field is currently dominated by deep learning solu- data. The encoder learns how to construct an informative
tions [2]. There are many different possibilities to transform homogeneous representation of the raw input data. In the
tabular data, and each may have a different impact on the semisupervised step, a predictive model, which is also a
learning results [47]. deep neural network model, is trained using the labeled and
1) Single-Dimensional Encoding: One of the critical obsta- unlabeled data transformed by the encoder. For the encoder,
cles for deep learning with tabular data is categorical variables. a novel data augmentation method is used, corrupting an unla-
Since neural networks only accept real number vectors as beled data point multiple times with different masks. On the
inputs, these values must be transformed before a model can predictions from all augmented samples from one original data
use them. Therefore, the first class of methods attempts to point, a consistency loss can be computed, which rewards
encode categorical variables in a way suitable for deep learning similar outputs. To summarize, the VIME network trains an
models. encoder, which is responsible to transform the categorical and
Approaches in this group [47] are divided into deterministic numerical features into a new homogeneous and informative
techniques, which can be used before training the model, and representation. This transformed feature vector is used as an
more complicated automatic techniques that are part of the input to the predictive model. For the encoder itself, the
model architecture. There are many ways for deterministic data categorical data can be transformed by a simple one-hot encod-
encoding; hence, we restrict ourselves to the most common ing and binary encoding. The experimental results highlight
ones without the claim of completeness. how the self-supervised and semisupervised variants of the
The simplest data encoding technique might be ordinal or VIME framework can boost the performance over that of other
label encoding. Every category is just mapped to a discrete baselines such as XGBoost. Even in the absence of unlabeled
numeric value, e.g., {Apple, Banana} are encoded as {0, 1}. data, learning encodings in the proposed manner is shown to
One drawback of this method may be that it introduces an be beneficial for downstream performance.
artificial order to previously unordered categories. Another Another stream of research aims at transforming the tabular
straightforward method that does not induce any order is the input into a more homogeneous format. Since the revival
one-hot encoding. One additional column for each unique of deep learning, convolutional neural networks have shown
category is added to the data. Only the column corresponding tremendous success in computer vision tasks. Therefore, Sun
to the observed category is assigned the value one, with the et al. [78] proposed the SuperTML method, which is a data
other values being zero. In our example, Apple could be conversion technique to transform tabular data into an image
encoded as (1,0) and Banana as (0,1). In the presence data format (2-D matrices), i.e., black-and-white images.
of a diverse set of categories in the data, this method can lead On three datasets, SuperTML shows performance comparable
to high-dimensional sparse feature vectors and exacerbate the with or superior to XGBoost.
“curse of dimensionality” problem. The image generator for tabular data (IGTD) in [72] follows
One approach that needs no extra columns and does not an idea similar to SuperTML. The IGTD framework converts
include any artificial order is the so-called leave-one-out tabular data into images to make use of classical convolutional
encoding. It is based on the target encoding technique pro- architectures. As convolutional neural networks rely on spatial
posed in the work in [94], where every category is replaced dependencies, the transformation into images is optimized
with the mean of the target variable of that category. The leave- by minimizing the difference between the feature distance
one-out encoding excludes the current row when computing ranking of the tabular data and the pixel distance ranking of
the mean of the target variable to avoid overfitting. This the generated image. Every feature corresponds to one pixel,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE II
OVERVIEW OF D EEP L EARNING A PPROACHES FOR TABULAR D ATA . W E O RGANIZE T HEM IN C ATEGORIES O RDERED C HRONOLOGICALLY I NSIDE THE
G ROUPS . T HE “I NTERPRETABILITY ” C OLUMN I NDICATES W HETHER THE A PPROACH O FFERS S OME F ORM I NTERPRETABILITY FOR THE M ODEL’ S
D ECISIONS . T HE K EY C HARACTERISTICS OF E VERY M ODEL A RE S UMMARIZED IN THE L AST C OLUMN

which leads to compact images with similar features close at 1) Hybrid Models: Most approaches for deep neural net-
neighboring pixels. Thus, IGDTs can be used in the absence of works on tabular data are hybrid models. They transform
domain knowledge. The authors show relatively solid results the data and fuse successful classical machine learning
for data with strong feature relationships, but the method approaches, often decision trees, with neural networks. We dis-
may fail if the features are independent or feature similarities tinguish between fully differentiable models, which can be
cannot characterize the relationships. In their experiments, differentiated with respect to all their parameters and partly
the authors used only gene expression profiles and molecular differentiable models.
descriptors of drugs as data. This kind of data may lead a) Fully differentiable models: The fully differentiable
to a favorable inductive bias, so the general viability of the models in this category offer a valuable property: They permit
approach remains unclear. end-to-end deep learning for training and inference by means
of gradient descent optimizers. Thus, they allow for highly
B. Specialized Architectures efficient implementations in modern deep learning frameworks
Specialized architectures form the largest group of that exploit GPU or TPU acceleration throughout the code.
approaches for deep tabular data learning. In this group, Popov et al. [7] proposed an ensemble of differentiable
the focus is on the development and investigation of novel oblivious decision trees [96]—also known as the NODE
deep neural network architectures designed specifically for framework for deep learning on tabular data. Oblivious deci-
heterogeneous tabular data. Guided by the types of available sion trees use the same splitting function for all nodes on the
models, we divide this group into two subgroups: hybrid same level and can therefore be easily parallelized. NODE is
models (presented in IV-B1) and transformer-based models inspired by the successful CatBoost [71] framework. To make
(discussed in IV-B2). the whole architecture fully differentiable and benefit from
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 7

end-to-end optimization, NODE utilizes the entmax transfor- The work by Cheng et al. [81] proposes a hybrid archi-
mation [97] and soft splits. In the original experiments, the tecture that consists of linear and deep neural network
NODE framework outperforms XGBoost and other GBDT models—Wide&Deep. A linear model that takes single fea-
models on many datasets. As NODE is based on decision tree tures and a wide selection of handcrafted logical expressions
ensembles, there is no preprocessing or transformation of the on features as an input is enhanced by a deep neural net-
categorical data necessary. Decision trees are known to handle work to improve the generalization capabilities. In addition,
discrete features well. In the official implementation, strings Wide&Deep learns an n-dimensional embedding vector for
are converted to integers using the leave-one-out encoding each categorical feature. All embeddings are concatenated
scheme. The NODE framework is widely used and provides resulting in a dense vector used as input to the neural net-
a sound implementation that can be readily deployed. work. The final prediction can be understood as a sum of
Frosst and Hinton [82] contributed another model relying both models. Experiments with a real-world system for app
on soft decision trees (SDTs) to make neural networks more recommendation confirmed that users installed apps suggested
interpretable. They investigated training a deep neural network by Wide&Deep were significantly more often than those
first, before using a mixture of its outputs and the ground-truth provided by the previous model. A similar work by Guo
labels to train the SDT model in a second step. The authors and Berkhahn [99] proposes an embedding using deep neural
showed that training a neural model first increases accuracy networks for categorical variables.
over SDTs that are directly learned from the data. However, Another contribution to the realm of Wide&Deep models is
their distilled trees still exhibit a performance gap to the neural DeepFM [15]. The authors demonstrate that it is possible to
networks that were fit in the initial step. Nevertheless, the replace the handcrafted feature transformations with learned
model itself shows a clear relationship among different classes factorization machines (FMs) [100]. The FM is an extension
in a hierarchical fashion. It groups different categorical values of a linear model designed to capture lower order interac-
based on the common patterns, e.g., digits 8 and 9 from tions between features within high-dimensional and sparse
the MNIST dataset [98]. To summarize, the proposed method data efficiently. Higher order interactions are modeled by
allows for high interpretability and efficient inference, at the a deep neural network. Similar to the original Wide&Deep
cost of slightly reduced accuracy. model, DeepFM also relies on the same embedding vectors
Follow-up work [89] extends this line of research to het- for its “wide” and “deep” parts. In contrast to the original
erogeneous tabular data and regression tasks and presents the Wide&Deep model, however, DeepFM alleviates the need for
SDT regressor (SDTR) framework. The SDTR is a neural manual feature engineering. The experimental results show
network, which imitates a binary decision tree. Therefore, all a solid improvement in CTR prediction tasks compared to
neurons, such as nodes in a tree, get the same input from the a variety of models relying on either low- or high-order
data instead of the output from previous layers. In the case of dependencies only and compared to other hybrid approaches.
deep networks, the SDTR could not beat other state-of-the-art Finally, network-on-network (NON) [86] is a classifica-
models, but it has shown promising results in a low-memory tion model for tabular data, which focuses on capturing
setting, where single tree models and shallow architectures the intrafeature information efficiently. It consists of three
were compared. components: a fieldwise network consisting of one unique
Katzir et al. [50] followed the related idea. Their Net-DNF deep neural network for every column to capture the column-
builds on the observation that every decision tree is merely specific information, an across-field network, which chooses
a form of a Boolean formula, more precisely a disjunctive the optimal operations based on the dataset, and an operation
normal form. They use this inductive bias to design the fusion network, connecting the chosen operations allowing for
architecture of a neural network, which is able to imitate the nonlinearities. As the optimal operations for the specific data
characteristics of the GBDT algorithm. The resulting Net-DNF are selected, the performance is considerably better than that
was tested for classification tasks on datasets with no missing of other deep learning models. However, decision trees, the
values, where it showed the results that are comparable to current state-of-the-art models for tabular data, were not listed
those of XGBoost [46]. However, the authors did not men- among the baselines. Also, training as many neural networks
tion how to handle high-cardinality categorical data, as the as columns and selecting the operations on the fly may lead
used datasets contained mostly numerical and few binary to a long computation time.
features. b) Partly differentiable models: This subgroup of hybrid
Linear models (e.g., linear and logistic regression) provide models aims at combining nondifferentiable approaches with
global interpretability but are inferior to complex deep neural deep neural networks. Models from this group usually utilize
networks. Usually, handcrafted feature engineering is required decision trees for the nondifferentiable part.
to improve the accuracy of linear models. Liu et al. [87] The DeepGBM model [62] combines the flexibility of
used a deep neural network to combine the features in a deep neural networks with the preprocessing capabilities of
possibly nonlinear way; the resulting combination of fea- GBDTs. DeepGBM consists of two neural networks—CatNN
tures then serves as input to the linear model. In their and GBDT2NN. While CatNN is specialized to handle sparse
approach—termed DDN2LR—this enhances the simple, inter- categorical features, GBDT2NN is specialized to deal with
pretable linear model. In experimental evaluations, DNN2LR dense numerical features.
can outperform other more complex DNN models while main- In the preprocessing step for the CatNN network, the cate-
taining some extent of interpretability. gorical data are transformed via ordinal encoding (to convert
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

the potential strings into integers), and the numerical features

are discretized, as this network is specialized for categorical
data. The GBDT2NN network distills the knowledge about
the underlying dataset from a model based on GBDTs by
accessing the leaf indices of the decision trees. This embed-
ding based on decision tree leaves was first proposed in [101]
for the random forest algorithm. Later, the same knowledge
distillation strategy has been adopted for GBDTs [102].
Using the proposed combination of two deep neural net-
works, DeepGBM has a strong learning capacity for both
categorical and numerical features. Distinctively, the authors
implemented and tested DeepGBM’s online prediction per-
formance, which is significantly higher than that of GBDTs.
On the downside, the leaf indices can be seen as meta
categorical features since these numbers cannot be directly
Fig. 2. Interpretable learning with the TabNet [6] architecture. We compare
compared. Also, it is not clear how other data-related issues, the attributions provided by the model for a sample from the UCI Adult dataset
such as missing values, different scaling of numeric features, with those provided by the game theoretic KernelSHAP framework [116].
and noise influence the predictions produced by the models. (a) TabNet attributions. (b) KernelSHAP attributions.
The TabNN architecture, introduced by Ke et al. [84],
is based on two principles: explicitly leveraging expressive each decision step (subnetwork) receives the current data
feature combinations and reducing model complexity. It distills batch as input. TabNet aggregates the outputs of all decision
the knowledge from GBDTs to retrieve feature groups; it steps to obtain the final prediction. At each decision step,
clusters them and then constructs the neural network based on TabNet first applies a sparse feature mask [105] to perform
those feature combinations. Also, structural knowledge from soft instancewise feature selection. The authors claim that the
the trees is transferred to provide an effective initialization. feature selection can save valuable resources, as the network
The experimental results show that the performance of a may focus on the most important features. The feature mask
GBDT model can be further improved by leveraging its feature of a decision step is trained using attentive information from
sets in combination with neural encoders. Furthermore, TabNN the previous decision step. To this end, a feature transformer
shows promising results on streaming data. However, the module decides which features should be passed to the next
construction of the network already takes different extensive decision step and which features should be used to obtain
computation steps of which one is only a heuristic to avoid the output at the current decision step. Some layers of the
an NP-hard problem. Unfortunately, these computational chal- feature transformers are shared across all decision steps. The
lenges and the unavailability of an implementation limit the obtained feature masks correspond to local feature weights
practical usability of the network. and can also be combined into a global importance score.
In similar spirit to DeepGBM and TabNN, the work by Accordingly, TabNet is one of the few deep neural networks
Ivanov and Prokhorenkova [88] proposed using GBDTs for the that offers different levels of interpretability by design. Indeed,
data prepossessing step. They exploited the fact that decision experiments show that each decision step of TabNet tends to
trees are special cases of directed graphs and process decision focus on a particular subdomain of the learning problem (i.e.,
trees using graph neural networks. Thus, the proposed frame- one particular subset of features). This behavior is similar to
work exploits the topology information from the decision trees convolutional neural networks. TabNet also provides a decoder
using graph neural networks [103]. The resulting architecture module that is able to preprocess input data (e.g., replace
is coined boosted graph neural network (BGNN). In multiple missing values) in an unsupervised way. Accordingly, TabNet
experiments, BGNN demonstrates that the proposed architec- can be used in a two-stage self-supervised learning procedure,
ture is superior to other state-of-the-art graph neural networks which improves the overall predictive quality. The experi-
in terms of predictive performance and training time and also ments confirm the improved feature selection process, which
outperforms GDBT models on most of the datasets. leads to smaller models with less trainable parameters. Also,
2) Transformer-Based Models: Transformer-based appro- TabNet outperforms tree-based models and MLPs consistently
aches form another subgroup of model-based deep neural while providing a more accurate interpretation of the feature
methods for tabular data. Inspired by the recent surge of importance. One of the popular Python [106] frameworks
interest in transformer-based methods and their successes on for tabular data provides an efficient implementation of Tab-
text and visual data [66], [104], researchers and practition- Net [107]. Recently, TabNet has also been investigated in the
ers have proposed multiple approaches using deep attention context of fair machine learning [108], [109]. Attention-based
mechanisms [5] for heterogeneous tabular data. architectures offer mechanisms for interpretability, which is
TabNet [6] is one of the first transformer-based models an essential advantage over many hybrid models. Fig. 2
for tabular data. Like a decision tree, the TabNet archi- shows attention maps of the TabNet model and KernelSHAP
tecture comprises multiple subnetworks that are processed explanation framework on the Adult dataset [54].
in a sequential hierarchical manner. According to [6], each Another supervised and semisupervised approach is intro-
subnetwork corresponds to one decision step. To train TabNet, duced by Huang et al. [90]. Their TabTransformer architecture
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 9

uses self-attention-based transformers to map the categorical confirmed that this new approach can reach state-of-the-
features to contextual embedding. This embedding is more art results on most datasets by using intersample attention
robust to missing or noisy data and enables interpretability. mechanisms.
The embedded categorical features are then together with the
C. Regularization Models
numerical ones fed into a simple multilayer perceptron. If,
in addition, there is an extra amount of unlabeled data, unsu- The third group of approaches argues that extreme flexi-
pervised pretraining can improve the results, using masked bility of deep learning models for tabular data is one of the
language modeling or replacing token detection. Extensive main learning obstacles and strong regularization of learned
experiments show that TabTransformer matches the perfor- parameters may improve the overall performance.
mance of tree-based ensemble techniques, showing success One of the first methods in this category was the regu-
also when dealing with missing or noisy data. The TabTrans- larization learning network (RLN) proposed by Shavitt and
former network puts a significant focus on the categorical Segal [63], which uses a learned regularization scheme. The
features. It transforms the embedding of those features into main idea is based on the observation that features in tab-
contextual embedding, which is then used as input for the ular datasets have very different importances. Contrarily to
multilayer perceptron. This embedding is implemented by other data modalities data such as images or text, a single
different multihead attention-based transformers, which are tabular feature may change the entire prediction. Therefore,
optimized during training. the authors apply trainable regularization coefficients to each
ARM-net [91] is an adaptive neural network for relation single weight in a neural network, hence allowing high
modeling tailored to tabular data. The key idea of the ARM-net sensitivity with respect to certain inputs or network parts
framework is to model feature interactions with combined while being insensitive to others. To efficiently determine
features (feature crosses) selectively and dynamically by first the corresponding coefficients, the authors propose a novel
transforming the input features into exponential space and loss function termed “counterfactual loss.” The regularization
then determining the interaction order and interaction weights coefficients lead to a very sparse network, which also provides
adaptively for each feature cross. Furthermore, the authors the importance of the remaining input features.
propose a novel sparse attention mechanism to generate the In their experiments, RLNs outperform deep neural net-
interaction weights given the input data dynamically. Thus, works and obtain the results comparable to those of the GBDT
users can explicitly model feature crosses of arbitrary orders algorithm, but the evaluation relies on a dataset with mainly
with noisy features filtered selectively. On five real-world numerical data to compare the models. The RLN paper does
datasets, ARM-net shows its superior effectiveness in rep- not address the issues of categorical data. For the experiments
resenting feature interactions compared to various baselines, and the example implementation, datasets with exclusively
which model the feature interactions in different ways. numerical data (except for the gender attribute) were used.
Self-attention and intersample attention transformer A similar idea is proposed in [112], where regularization
(SAINT) [9] is a hybrid attention approach, combining coefficients are learned only in the first layer with a goal to
self-attention [5] with intersample attention over multiple extract feature importance.
rows. When handling missing or noisy data, this mechanism Kadra et al. [10] stated that simple multilayer percep-
allows the model to borrow the corresponding information trons can outperform state-of-the-art algorithms on tabular
from similar samples, which improves the model’s robustness. data if deep learning networks are properly regularized. The
The technique is reminiscent of nearest neighbor imputation. authors propose a “cocktail” of regularization with 13 different
In addition, all features are embedded into a combined dense techniques that are applied jointly. From those, the optimal
latent vector, enhancing existing correlations between values subset and their subsidiary hyperparameters are selected. They
from one data point. To exploit the presence of unlabeled data, demonstrate in extensive experiments that the regulariza-
a self-supervised contrastive pre-training can further improve tion “cocktails” can not only improve the performance of
the results, minimizing the distance between two views of the multilayer perceptrons but these simple models also outper-
same sample and maximizing the distance between different form tree-based architectures. On the downside, the extensive
ones. Like the VIME framework (Section IV-A1), SAINT per-dataset regularization and hyperparameter optimization
uses CutMix [110] to augment samples in the input space and take much more computation time than the GBDT algorithm.
uses mixup [111] in the embedding space. The experimental There are several other noteworthy works [113], [114],
results show that SAINT outperforms tree-based models [115], indicating that strong regularization of deep neural
like XGBoost as well as other deep learning approaches for networks can be beneficial for tabular data.
tabular data on average. When unlabeled data are available,
the performance can be improved further using the proposed V. TABULAR DATA G ENERATION
pretraining. For many applications, the generation of realistic tabular
Finally, even some new learning paradigms are being pro- data is fundamental. Three of the main purposes are data
posed. For instance, the nonparametric transformer (NPT) [92] augmentation [117], data imputation (i.e., the filling of missing
does not construct a mapping from individual inputs to outputs values) [118], [119], and rebalancing [36], [37], [120], [121].
but uses the entire dataset at once. By using attention between Another highly relevant topic is privacy-aware machine learn-
data points, relations between arbitrary samples can be mod- ing [38], [39], [122] where generated data can potentially be
eled and leveraged for classifying test samples. Experiments leveraged to overcome privacy concerns.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

A. Methods to which extent the inductive bias used for images are suitable
While the generation of images and text is highly for tabular data.
explored [123], [124], [125], generating synthetic tabular data The approach by Xu et al. [130] focuses on the correlation
is a less frequent concern. The mixed structure of discrete and between the features of one data point. The authors first pro-
continuous features along with their different value distribu- pose the mode-specific normalization technique for data pre-
tions still poses a significant challenge. processing that allows to transform non-Gaussian distributions
Classical approaches for the data generation task include in the continuous columns. They express numeric values in
Copulas [126], [127] and Bayesian networks [128]. Among terms of a mixture component number and the deviation from
Bayesian networks, those based on the Chow–Liu approxima- that component’s center. This allows to represent multimodal
tion [129] are especially popular [38], [130], [131], [132]. and skewed distributions. Their generative solution, coined
In the deep learning era, generative adversarial networks CTGAN, uses the conditional GAN architecture to enforce
(GANs) [133] have proven highly successful for the generation learning proper conditional distributions for each column.
of images [123], [134]. GANs were recently introduced as To obtain categorical values and to allow for backpropagation
an original way to train a generative deep neural network in the presence of categorical values, the gumbel-softmax
model. They consist of two separate models: a generator trick [143] is utilized. The authors also propose a model based
G that generates samples from the data distribution and a on VAEs, named tabular VAE (TVAE), which outperforms
discriminator D that estimates the probability that a sample their suggested GAN approach. Both approaches can be con-
came from the ground-truth distribution. Both G and D are sidered state of the art.
usually chosen to be nonlinear functions such as multilayer While GANs and VAEs are prevalent, other recently
perceptrons. To learn a generator distribution pg over data proposed architectures include machine-learned causal mod-
x, the generator G(z; θg ) maps the samples from a noise els [144] and invertible flows [38]. When privacy is the main
distribution pz (z) (e.g., the Gaussian distribution) to the input factor of concern, models, such as PATE-GAN [145], provide
data space. The discriminator D(x; θd ) outputs the probability generative models with certain differential privacy guarantees.
that a data point x comes from the training data’s distribution Although very relevant for practical applications, such privacy
pdata rather than from the generator’s output distribution pg . guarantees and related federated learning approaches with
During joint training of G and D, G will start generating tabular data [146] are outside the scope of this review.
successively more realistic samples to fool the discriminator Fan et al. [122] compared a variety of different GAN archi-
D. For more details on GANs, we refer the interested reader tectures for tabular data synthesis and recommended using
to the original paper [133]. a simple, fully connected architecture with a vanilla GAN
In Table III, we provide an overview of tabular generation loss with minor changes to prevent mode collapse. They also
approaches that use deep learning techniques. Note that due use the normalization proposed in [130]. In their experiments,
to the enormous number of approaches, we list the most the WGAN loss or the use of convolutional architectures on
influential works that address the problem of data generation tabular data does boost the generative performance.
with a particular focus on tabular data. We exclude works that
are targeted toward highly domain-specific tasks.
Although it was found that GANs lag behind at the genera- B. Assessing Generative Quality
tion of discrete outputs such as natural language [125], they are To assess the quality of the generated data, several per-
still frequently chosen to generate tabular data. Vanilla GANs formance measures are used. The most common approach
or derivates, such as the Wasserstein GAN (WGAN) [135], is to define a proxy classification task and train one model
WGAN with gradient penalty (WGAN-GP) [136], Cramér for it on the real training set and another on the artificially
GAN [137], or the Boundary seeking GAN [138], which generated dataset. With a highly capable generator, the predic-
is designed to model discrete data, are commonly used tive performance of the artificial-data model on the real-data
in the literature to generate tabular data (cf. Table III). test set should be almost on par with its real-data counter-
Moreover, VeeGAN [139] is frequently used as a reference part. This measure is often referred to as machine learning
for tabular data generation [38], [130], [131]. Apart from efficacy and used in [39], [131], and [147]. In nonobvious
GANs, autoencoder-based architectures—in particular those classification tasks, an arbitrary feature can be used as a
relying on variational autoencoders (VAEs) [140]—have been label and predicted [39], [148], [149]. Another approach is
proposed [130], [141]. to visually inspect the modeled distributions per feature, e.g.,
In the following, we will briefly discuss the most rele- the cumulative distribution functions [117], or compare the
vant approaches that helped shape the domain. For example, expected values in scatter plots [39], [148]. A more quan-
MedGAN [39] was one of the first works and provides a deep titative approach is the use of statistical tests, such as the
learning model to generate patient records. As all the features Kolmogorov–Smirnov test [152], to assess the distributional
in their work are discrete, this model cannot be easily trans- difference [149]. On synthetic datasets, the output distribution
ferred to arbitrary tabular datasets. The table-GAN approach can be compared to the ground truth, e.g., in terms of log
in [142] adapts the deep convolutional GAN for tabular data. likelihood [130], [144]. Because overfitted models can also
Specifically, the features from one record are converted into a obtain good scores, Xu et al. [130] proposed evaluating the
matrix so that they can be processed by convolutional filters of likelihood of a test set under an estimate of the GAN’s
a convolutional neural network. However, it remains unclear output distribution. Especially in a privacy-preserving context,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 11

TABLE III methods aim to highlight the influence of the inputs that have
G ENERATION OF TABULAR D ATA U SING D EEP N EURAL on the prediction by assigning importance scores to the input
N ETWORK M ODELS ( IN C HRONOLOGICAL O RDER )
features. Some popular approaches for model explanations aim
at constructing classification models that are explainable by
design [158], [159], [160]. This is often achieved by enforcing
the deep neural network model to be locally linear. Moreover,
if the model’s parameters are known and can be accessed,
then the explanation technique can use these parameters to
generate the model explanation. For such settings, relevance-
propagation-based methods, e.g., [161], [162], and gradient-
based approaches, e.g., [163], [164], [165], have been sug-
gested. In cases where the parameters of the neural network
cannot be accessed, model-agnostic approaches can prove
useful. This group of approaches seeks to explain a model’s
behavior locally by applying surrogate models [116], [166],
[167], [168], [169], which are interpretable by design and are
used to explain individual predictions of black-box machine
learning models. In order to test the performance of these
black-box explanations techniques, Liu et al. [170] suggested
a python-based benchmarking library.

B. Counterfactual Explanations
From the perspective of algorithmic recourse, the main pur-
pose of counterfactual explanations is to suggest constructive
interventions to the input of a deep neural network so that
the output changes to the advantage of an end user. In simple
terms, a minimal change to the feature vector that will flip
the classification outcome is computed and provided as an
explanation. By emphasizing both the feature importance and
the distribution of the distance to closest record (DCR) can
the recommendation aspect, counterfactual explanation meth-
be calculated and compared to the respective distances on
ods can be further divided into three different groups: works
the test set [142]. This measure is important to assess the
that assume that all features can be independently manipulated
extent of sample memorization. Overall, we conclude that
[171] and works that focus on manifold constraints to capture
a single measure is not sufficient to assess the generative
feature dependencies.
quality. For instance, a generative model that memorizes the
In the class of independence-based methods, where the input
original samples will score well in the machine learning
features of the predictive model are assumed to be indepen-
efficiency metric but fail the DCR check. Therefore, we highly
dent, some approaches use combinatorial solvers to generate
recommend using several evaluation measures that focus on
recourse in the presence of feasibility constraints [172], [173],
individual aspects of data quality.
[174], [175]. Another line of research deploys gradient-based
optimization to find low-cost counterfactual explanations in the
VI. E XPLANATION M ECHANISMS FOR D EEP
presence of feasibility and diversity constraints [176], [177].
L EARNING W ITH TABULAR DATA
The main problem with these approaches is that they abstract
Explainable machine learning is concerned with the prob- from input correlations.
lem of providing explanations for complex machine learn- To alleviate this problem and to suggest realistic-looking
ing models. With stricter regulations for automated decision- counterfactuals, researchers have suggested building recourse
making [41] and the adoption of machine learning models suggestions on generative models [178], [179], [180], [181],
in high-stakes domains such as finance and healthcare [45], [182]. The main idea is to change the geometry of the
[153], [154], interpretability is becoming a key concern. intervention space to a lower dimensional latent space, which
Toward this goal, various streams of research follow different encodes different factors of variation while capturing input
explainability paradigms. Among these, feature attribution dependencies. To this end, these methods primarily use (tabu-
methods and counterfactual explanations are two of the popular data) VAEs [140], [183]. In particular, Mahajan et al. [181]
lar forms [155], [156], [157]. Because these techniques are demonstrated how to encode various feasibility constraints
gaining importance for researchers and practitioners alike, into such models. However, an extensive comparison across
we dedicate the following to reviewing these methods. this class of methods is still missing since it is difficult to
measure how realistic the generated data are in the context of
A. Feature Highlighting Explanations algorithmic recourse.
Local input attribution techniques seek to explain the behav- More recently, a few works have suggested to develop
ior of machine learning models instance by instance. Those counterfactual explanations that are robust to model shifts
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

and noise in the recourse implementations [184], [185], [186]. TABLE IV

A comprehensive treatment on how to extend these lines of M AIN P ROPERTIES OF THE R EAL -W ORLD H ETEROGENEOUS TABULAR
D ATASETS U SED IN T HIS S URVEY. W E A LSO I NDICATE THE
work to arbitrary high-cardinality categorical variables is still D ATASET TASK , W HERE “B INARY ” S TANDS FOR B INARY
an open problem in the field. C LASSIFICATION AND “M ULTI -C LASS ” R EPRESENTS
For a more fine-grained overview over the literature on M ULTICLASS C LASSIFICATION
counterfactual explanations, we refer the interested reader
to the most recent surveys [187], [188]. Finally, Pawelczyk
et al. [157] implemented an open-source python library, which
provides support for many of the aforementioned counterfac-
tual explanation models.

VII. E XPERIMENTS
Although several experimental studies have been pub-
home equity. The task consists of using the information about
lished in recent years [8], [10], an exhaustive comparison
the applicant in their credit report to predict whether they will
between existing deep learning approaches for heterogeneous
repay their HELOC account within a two-year period.
tabular data is still missing in the literature. For example,
We further use the Adult Income dataset [54], which is
important aspects of deep learning models, such as training
among the most popular tabular datasets used in the surveyed
and inference time, model size, and interpretability, are not
work (five usages). It includes basic information about indi-
discussed.
viduals such as age, gender, and education. The target variable
To fill this gap, we present an extensive empirical com-
is binary; it represents high and low income.
parison of machine and deep learning methods on real-world
The largest tabular dataset in our study is HIGGS, which
datasets with varying characteristics in this section. We discuss
stems from particle physics. The task is to distinguish between
the dataset choice (VII-A), the results (VII-B), and present
signals with Higgs bosons (HIGGS) and a background
a comparison of the training and inference time for all the
process [192]. Monte Carlo simulations [193] were used to
machine learning models considered in this survey (VII-C).
produce the data. In the first 21 columns (columns 2-22), the
We also discuss the size of deep learning models. Finally,
particle detectors in the accelerator measure kinematic proper-
to the best of our knowledge, we present the first comparison
ties. In the last seven columns, these properties are analyzed.
of explainable deep learning methods for tabular data (VII-
In total, HIGGS includes 11 million rows. We also binarize the
D). We release the full source code of our experiments for
21st variable into a categorical variable with three groups since
maximum transparency.1
DeepFM, DeepGBM, TabTransformer, and SAINT models
require at least one categorical attribute, to benchmark the
A. Datasets method’s special functionality on large datasets.
In computer vision, there are many established datasets The Covertype dataset [54] is multiclassification dataset,
for the evaluation of new deep learning architectures such as which holds cartographic information about land cells (e.g.,
MNIST [98], CIFAR [189], and ImageNet [190]. On the con- elevation and slope). The goal is to predict which one out of
trary, there are no established standard heterogeneous datasets. seven forest cover types is present in the cell.
Carefully checking the works listed in Section IV, we iden- Finally, we utilize the California Housing dataset [194],
tified over 100 different datasets with different characteristics which contains information about a number of properties. The
in their respective experimental evaluation sections. We note prediction task (regression) is to estimate the price of the
that the small overlap between the mentioned works makes corresponding home.
it hard to compare the results across these works in general. The fundamental characteristics of the selected datasets are
Therefore, in this work, we deliberately select datasets cov- summarized in Table IV.
ering the entire range of characteristics, such as data domain
(e.g., finance, e-commerce, geography, and physics), different B. Open Performance Benchmark on Tabular Data
types of target variables (classification and regression), varying
1) Hyperparameter Selection: In order to do a fair eval-
number of categorical variables and continuous variables, and
uation, we use the Optuna library [199] with 100 iterations
differing sample sizes (small to large). Furthermore, most
for each model to tune hyperparameters. Each hyperparameter
of the selected datasets were previously featured in multiple
configuration was cross-validated with five folds. The hyper-
studies.
parameter ranges used are publicly available online along with
The first dataset of our study is the Home Equity Line of
our code. We laid out the search space based on the informa-
Credit (HELOC) dataset provided by FICO [191]. This dataset
tion given in the corresponding papers and recommendations
consists of anonymized information from real homeowners
from the framework’s authors.
who applied for home equity lines of credit. An HELOC is a
2) Data Preprocessing: We prepossessed the data in the
line of credit typically offered by a bank as a percentage of
same way for every machine learning model by applying zero-
1 Open benchmarking on tabular data for machine learning models: mean, unit-variance normalization to the numerical features
https://github.com/kathrinse/TabSurvey. and an ordinal encoding to the categorical ones using the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 13

TABLE V
O PEN P ERFORMANCE B ENCHMARK R ESULTS BASED ON (S TRATIFIED ) F IVEFOLD C ROSS VALIDATION . W E U SE THE S AME F OLD S PLITTING S TRATEGY
FOR E VERY D ATASET. T HE T OP R ESULTS FOR E ACH D ATASET A RE IN B OLD , W E A LSO U NDERLINE THE S ECOND -B EST R ESULTS . T HE
M EAN AND S TANDARD D EVIATION VALUES A RE R EPORTED FOR E ACH BASELINE M ODEL . M ISSING R ESULTS I NDICATE T HAT THE
C ORRESPONDING M ODEL C OULD N OT B E A PPLIED TO THE TASK T YPE (R EGRESSION OR M ULTICLASS C LASSIFICATION )

alphabetical order. According to Hancock and Khoshgof- learning approaches. This suggests that for very large tabu-
taar [47], the chosen encoding strategy shows comparable lar datasets with predominantly continuous features, modern
performance to more advanced methods. The missing values neural network architectures may have an advantage over
were substituted with zeros for the linear regression and classical approaches after all. In general, however, our results
models based on pure neural networks since these methods are consistent with the inferior performance of deep learning
cannot accept them otherwise. We explicitly specify categor- techniques in comparison to approaches based on decision tree
ical features for LightGBM, DeepFM, DeepGBM, TabNet, ensembles (such as GBDT) on tabular data that were observed
TabTransformer, and SAINT since these approaches provide in various Kaggle competitions [201].
special functionality dedicated to categorical values, e.g., Considering only deep learning approaches, we observe that
learning an embedding of the categories. As we noted in SAINT provided competitive results across datasets. However,
Section III-C, the results of experiments may be affected by for the other models, the performance was highly dependent on
the data preprocessing. the chosen dataset. DeepFM performed best (among the deep
3) Reproducibility and Extensibility: For maximum repro- learning models) on the Adult dataset and second-best on the
ducibility, we run all experiments in a docker container [200]. California Housing dataset, but returned only weak results on
We underline again that our full code is publicly released so the HELOC dataset.
that the experiments can be replicated. The mentioned datasets
are also publicly available and can be used as a benchmark
C. Run Time Comparison
for novel methods. We would highly welcome contributed
implementations of additional methods from the data science We also analyze the training and inference time of
community. the models in comparison to their performance. We plot
4) Results: The results of our experiments are shown in the time–performance characteristic for the models in
Table V. They draw a different picture than many recent Figs. 3 and 4 for the Adult and the HIGGS dataset, respec-
research papers may suggest: for all but the very large HIGGS tively. While the training time of gradient boosting-based
dataset, the best scores are still obtained by boosted decision models is lower than that of most deep neural network-based
tree ensembles. XGBoost and CatBoost outperform all deep methods, their inference time on the HIGGS dataset with
learning-based approaches on the small and medium datasets, 11 million samples is significantly higher: for XGBoost, the
the regression dataset, and the multiclass dataset. For the inference time amounts to 5995 s, whereas inference times
large-scale HIGGS, SAINT outperforms the classical machine for MLP and SAINT are 10.18 and 282 s, respectively. All
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 3. Train (left) and inference (right) time benchmarks for selected methods on the Adult dataset with 32.561 samples. The circle size reflects the accuracy
standard deviation.

Fig. 4. Train (left) and inference (right) time benchmarks for selected methods on the HIGGS dataset with 11 million samples. The circle size reflects the
accuracy standard deviation.

Admittedly, explanations can be provided in very different

forms, which may each have their own use cases. Hence,
we can only compare explanations that have a common
form. In this work, we chose feature attributions as the
explanation format because they are the prevalent form of
post hoc explainability for the models considered in this
work. Remarkably, the models that build on the transformer
architecture (Section IV-B2) often claim some extent of inter-
pretability through the attention maps [9]. To verify this
hypothesis and assess the attribution provided by some of
the frameworks in practice, we run an ablation test with
the features that were attributed the highest importance over
Fig. 5. Size comparison of deep learning models on the Adult dataset. The all samples. Furthermore, due to the lack of ground-truth
circle size reflects the standard deviation. attribution values, we compare individual attributions to the
well-known KernelSHAP values [116].
gradient boosting and deep learning models were trained on Evaluation of the quality of feature attribution is known to
the same GPU. be a nontrivial problem [202]. We measure the fidelity [203]
of the attributions by successively removing the features that
have the highest mean importance assigned (most relevant first
D. Interpretability Assessment (MoRF) [203]). We then retrain the model on the reduced
As opposed to the pure on-task performance, interpretabil- feature set. A sharp drop in predictive accuracy indicates
ity of the models is becoming an increasingly important that the discriminative features were successfully identified
characteristic. Therefore, we end this section with a distinct and removed. We do the same for the inverse order, least
assessment of the interpretability properties claimed by some relevant first (LeRF), which removes the features deemed
methods. The model size (number of parameters) can provide unimportant. In this case, the accuracy should stay high
a first intuition of the interpretability of the models. Therefore, as long as possible. For the attention maps of TabTrans-
we provide a size comparison of deep learning models in former and SAINT, we either use the sum over the entire
Fig. 5. columns of the intrafeature attention maps as an importance
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 15

TABLE VI
S PEARMAN R ANK C ORRELATION OF THE P ROVIDED ATTRIBUTION W ITH
K ERNEL SHAP VALUES AS G ROUND T RUTH . R ESULTS W ERE
C OMPUTED ON 750 R ANDOM S AMPLES
F ROM THE A DULT D ATASET

A. Summary and Trends

1) Decision Tree Ensembles Are Still State of the Art:
In a fair comparison on multiple datasets, we demonstrated
that models based on tree ensembles, such as XGBoost,
LightGBM, and CatBoost, still outperform the deep learning
models on most datasets that we considered and come with
the additional advantage of significantly less training time.
Fig. 6. Resulting curves of the global attribution benchmark for feature Even though it has been six years since the XGBoost publi-
attributions (fifteen runs on Adult). Standard errors are indicated by the shaded cation [46] and over 20 years since the publishing of original
area. For the MoRF order, an early drop in accuracy is desirable, while for gradient boosting paper [95], we can state that despite much
LeRF, the accuracy should stay as high as possible. (a) MoRF. (b) LeRF.
research effort in deep learning, the state of the art for tabular
estimate or only take the diagonal (feature self-attentions) as data remains largely unchanged. However, we observed that
attributions. for very large datasets, approaches based on deep learning
The obtained curves are visualized in Fig. 6. For the may still be able to achieve competitive performance and even
MoRF order, TabNet and TabTransformer with the diagonal outperform classical models. In summary, we think that a
of the attention head as attributions seem to perform best. fundamental reorientation of the domain may be necessary.
For LeRF, TabNet is the only significantly better method than For now, the question of whether the use of current deep
the others. For TabTransformer, taking the diagonal of the learning techniques is beneficial for tabular data can generally
attention matrix seems to increase the performance, whereas be answered in the negative. This applies in particular to
for SAINT, there is almost no difference. We additionally small heterogeneous datasets that are common in applications.
compare the attribution values obtained to values from the Hence, instead of proposing more and more complex models,
KernelSHAP attribution method. Unfortunately, there are no we argue that a more profound understanding of the reasons
ground-truth attributions to compare with. However, the SHAP for this performance gap is needed.
framework has a solid grounding in game theory and is 2) Unified Benchmarking: Furthermore, our results high-
widely deployed [43]. We only compare the absolute values light the need for unified benchmarks. There is no consensus
of the attributions, as the attention maps are constrained to in the machine learning community on how to make a fair
be positive. As a measure of agreement, we compute the and efficient comparison. Shwartz-Ziv and Armon [8] showed
Spearman rank correlation between the attributions by the that the choice of benchmarking datasets can have a non-
SHAP framework and the tabular data models and show the negligible impact on the performance assessment. While we
results in Table VI. The correlation we observe is surprisingly chose common datasets with varying characteristics for our
low across all models, and sometimes, it is even negative, experiments, a different choice of datasets or hyperparameter
which means that a higher SHAP attribution will probably such as the encoding use (e.g., one-hot encoding for cate-
result in a lower attribution by the model. gorical variables) may lead to a different outcome. Because
In these two simple benchmarks, the transformer models of the excessive number of datasets (in the 18 works listed
were not able to produce convincing feature attributions out- in Table II, over 100 different datasets are used), there is a
of-the-box. We come to the conclusion that more profound necessity for a standardized benchmarking procedure, which
benchmarks of the claimed interpretability characteristics and allows to identify significant progress with respect to the state
their usefulness in practice are necessary. of the art. With this work, we also propose an open-source
benchmark for deep learning models on tabular data. For
tabular data generation tasks, Xu et al. [130] proposed a sound
VIII. D ISCUSSION AND F UTURE P ROSPECTS evaluation framework with artificial and real-world datasets
In this section, we summarize our findings and discuss (Section V-B), but researchers need to agree on common
current and future trends in deep learning approaches for benchmarks in this subdomain as well.
tabular data (Section VIII-A). Moreover, we identify several 3) Tabular Data Preprocessing: Many of the challenges
open research questions that could be tackled to advance the for deep neural networks on tabular data are related to the
field of tabular deep neural networks (Section VIII-B). heterogeneity of the data (e.g., categorical and sparse values).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

16 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Therefore, some deep learning solutions transform them into a methods, such as incremental decision trees [207], [208], are
homogeneous representation more suitable to neural networks. often preferred in online learning applications.
While the additional overhead is small, such transforms can
boost performance considerably and should thus be among the B. Open Research Questions
first strategies applied in real-world scenarios. Several open problems need to be addressed in future
4) Architectures for Deep Learning on Tabular Data: research. In this section, we will list those we deem funda-
Architecturewise, there has been a clear trend toward mental to the domain.
transformer-based solutions (Section IV-B2) in recent years. 1) Information-Theoretic Analysis of Encodings: Encoding
These approaches offer multiple advantages over standard methods are highly popular when dealing with tabular data.
neural network architectures, for instance, learning with atten- However, the majority of data preprocessing approaches for
tion over both categorical and numerical features. More- deep neural networks are lossy in terms of information content.
over, self-supervised or unsupervised pretraining that leverages Therefore, it is challenging to achieve an efficient, almost
unlabeled tabular data to train parts of the deep learning lossless transformation of heterogeneous tabular data into
model is gaining popularity, not only among transformer-based homogeneous data. Nevertheless, the information-theoretic
approaches. Performancewise, multiple independent evalua- view on these transformations remains to be investigated in
tions demonstrate that deep neural network methods from the detail and could shed light on the underlying mechanisms.
hybrid (Section IV-B1) and transformer-based (Section IV-B2) 2) Computational Efficiency in Hybrid Models: The work
groups exhibit superior predictive performance compared to by Shwartz-Ziv and Armon [8] suggests that the combination
plain deep neural networks on various datasets [9], [48], of a GBDT and deep neural networks may improve the pre-
[62], [84]. This underlines the importance of special-purpose dictive performance of a machine learning system. However,
architectures for tabular data. it also leads to growing complexity. Training or inference
5) Deep Generative Models for Tabular Data: Powerful times, which far exceed those of classical machine learning
tabular data generation is essential for the development of approaches, are a recurring problem when developing hybrid
high-quality models, particularly in a privacy context. With models. We conclude that the integration of state-of-the-art
suitable data generators at hand, developers can use large, syn- approaches from classical machine learning and deep learn-
thetic, and yet realistic datasets to develop better models, while ing has not been conclusively resolved yet and future work
not being subject to privacy concerns [145]. Unfortunately, the should be conducted on how to mitigate the tradeoff between
generation task is as hard as inference in predictive models, predictive performance and computational complexity.
so progress in both areas will likely go hand in hand. 3) Individual Regularizations: We applaud recent research
6) Interpretable Deep Learning Models for Tabular Data: on individual regularization methods, in which we see a
Interpretability is undoubtedly desirable, particularly for tab- promising direction to tackle the problem of highly sensitive
ular data models frequently applied to personal data, e.g., features. We believe that representing the towering influence
in healthcare and finance. An increasing number of approaches of certain features is crucial to success. Whether context- and
offer it out-of-the-box, but most current deep neural network architecture-specific regularizations for tabular data can be
models are still mainly concerned with the optimization of a found remains an open question. In addition, it is relevant
chosen error metric. Therefore, extending existing open-source to explore the theoretical constraints that govern the success
libraries (see [157], [170]) aimed at interpreting black-box of regularization on tabular data more profoundly.
models helps advance the field. Moreover, interpretable deep 4) Novel Processes for Tabular Data Generation: For tab-
tabular learning is essential for understanding model decisions ular data generation, modified GANs and VAEs are prevalent.
and results, especially for life-critical applications. However, However, the modeling of dependencies and categorical dis-
much of the state-of-the-art recourse literature does not offer tributions remains the key challenge. Novel architectures in
easy support of heterogeneous tabular data and lacks metrics this area, such as diffusion models, have not been adapted to
to evaluate the quality of heterogeneous data recourse. Finally, the domain of tabular data. Furthermore, the definition of an
model explanations can be used to identify and mitigate entirely new generative process particularly focused on tabular
potential unwanted biases and eliminate unfair discrimination data might be worth investigating.
[204], [205]. 5) Interpretability: Going forward, counterfactual explana-
7) Learning From Evolving Data Streams: Many modern tions for deep tabular learning can be used to improve the per-
applications are subject to continuously evolving data streams, ceived fairness in human–artificial intelligence (AI) interaction
e.g., social media, online retail, or healthcare. Streaming data scenarios and to enable personalized decision-making [188].
are usually heterogeneous and potentially unlimited. There- However, the heterogeneity of tabular data poses problems for
fore, observations must be processed in a single pass and can- counterfactual explanation methods to be reliably deployed in
not be stored. Indeed, online learning models can only access practice. The problem of efficiently handling heterogeneous
a fraction of the data at each time step. Furthermore, they have tabular data in the presence of feasibility constraints remains
to deal with limited resources and shifting data distributions unsolved [157].
(i.e., concept drift) [206]. Hence, hyperparameter optimization 6) Transfer of Deep Learning Methods to Data Streams:
and model selection, as typically involved in deep learning, are Recent work shows that some of the limitations of neural
usually not feasible in a data stream. For this reason, despite networks in an evolving data stream can be overcome [25],
the success of deep learning in other domains, less complex [209]. Conversely, changes in the parameters of a neural
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 17

network may be effectively used to weigh the importance numerical variables, the deep learning model SAINT outper-
of input features over time [210] or to detect concept drift formed these classical approaches. Furthermore, we assessed
[211]. Accordingly, we argue that deep learning for streaming the explanation properties of deep learning models with the
data—in particular strategies for dealing with evolving and self-attention mechanism. Although the TabNet model shows
heterogeneous tabular data—should receive more attention in promising explanatory capabilities, inconsistencies between
the future. the explanations remain an open issue.
7) Transfer Learning for Tabular Data: Reusing knowledge Due to the importance of tabular data to industry and
gained solving one problem and applying it to a different task academia, new ideas in this area are in high demand and can
is the research problem addressed by transfer learning. While have a significant impact. With this review, we hope to provide
transfer learning is successfully used in computer vision and interested readers with the references and insights they need
natural language processing applications [212], there are no to address open challenges and effectively advance the field.
efficient and generally accepted ways to do transfer learning
R EFERENCES
for tabular data. Hence, a general research question can be how
to share knowledge between multiple (related) tabular datasets [1] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural Netw., vol. 61, pp. 85–117, May 2015.
efficiently. [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cam-
8) Data Augmentation for Tabular Data: Data augmenta- bridge, MA, USA: MIT Press, 2016.
tion has proven highly effective to prevent overfitting, espe- [3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
cially in computer vision [213]. While some data augmentation [4] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink, and
techniques for tabular data exist, e.g., SMOTE-NC [214], sim- J. Schmidhuber, “LSTM: A search space Odyssey,” IEEE Trans.
ple models fail to capture the dependency structure of the data. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017.
[5] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Therefore, generating additional samples in a continuous latent Process. Syst., 2017, pp. 5998–6008.
space is a promising direction. This was investigated by Darabi [6] S. O. Arik and T. Pfister, “TabNet: Attentive interpretable tabular
and Elor [37] for minority oversampling. Nevertheless, the learning,” 2019, arXiv:1908.07442.
[7] S. Popov, S. Morozov, and A. Babenko, “Neural oblivious decision
reported improvements are only marginal. Thus, future work ensembles for deep learning on tabular data,” 2019, arXiv:1909.06312.
is required to find simple, yet effective random transformations [8] R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all
to enhance tabular training sets. you need,” 2021, arXiv:2106.03253.
[9] G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and
9) Self-Supervised Learning: Large-scale labeled data are T. Goldstein, “SAINT: Improved neural networks for tabular data via
usually required to train deep neural networks; however, data row attention and contrastive pre-training,” 2021, arXiv:2106.01342.
labeling is an expensive task. To avoid this expensive step, [10] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka, “Well-tuned simple
nets excel on tabular datasets,” in Proc. Adv. Neural Inf. Process. Syst.,
self-supervised methods propose to learn general feature repre- 2021, pp. 1–14.
sentations from available unlabeled data. These methods have [11] D. Ulmer, L. Meijerink, and G. Cinà, “Trust Issues: Uncertainty estima-
also shown astonishing results in computer vision and natural tion does not enable reliable OOD detection on medical tabular data,”
in Proc. Mach. Learn. Health NeurIPS Workshop, 2020, pp. 341–354.
language processing [215], [216]. Only a few recent works in [12] S. Somani et al., “Deep learning and the electrocardiogram: Review
this direction [79], [80], [217] deal with heterogeneous data. of the current state-of-the-art,” EP Europace, vol. 23, no. 8,
Hence, novel self-supervised learning approaches dedicated to pp. 1179–1191, Aug. 2021.
[13] V. Borisov, E. Kasneci, and G. Kasneci, “Robust cognitive load
tabular data might be worth investigating. detection from wrist-band sensors,” Comput. Hum. Behav. Rep., vol. 4,
Aug. 2021, Art. no. 100116.
IX. C ONCLUSION [14] J. M. Clements, D. Xu, N. Yousefi, and D. Efimov, “Sequential deep
learning for credit risk monitoring with tabular financial data,” 2020,
This survey is the first work to systematically explore arXiv:2012.15330.
deep neural network approaches for heterogeneous tabular [15] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “DeepFM: A
data. In this context, we highlighted the main challenges and factorization-machine based neural network for CTR prediction,” 2017,
arXiv:1703.04247.
research advances in modeling, generating, and explaining tab- [16] Z. Shuai, L. Yao, A. Sun, and T. Yi, “Deep learning based recommender
ular data. We introduced a unified taxonomy that categorizes system: A survey and new perspectives,” ACM Comput. Surv., vol. 52,
deep learning approaches for tabular data into three branches: no. 1, pp. 1–38, 2017.
[17] Q. Zhang, L. Cao, C. Shi, and Z. Niu, “Neural time-aware sequential
data transformation methods, specialized architectures, and recommendation by jointly modeling preference dynamics and explicit
regularization models. We believe that our taxonomy will feature couplings,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
help catalog future research and better understand and address no. 10, pp. 5125–5137, Oct. 2022.
[18] M. Ahmed, H. Afzal, A. Majeed, and B. Khan, “A survey of evolution
the remaining challenges in applying deep learning to tabular in predictive models and impacting factors in customer churn,” Adv.
data. We hope that it will help researchers and practitioners Data Sci. Adapt. Anal., vol. 9, no. 3, Jul. 2017, Art. no. 1750007.
to find the most appropriate strategies and methods for their [19] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Com-
applications. mun. Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., 2016.
In addition, we also conducted an unbiased evaluation of [20] F. Cartella, O. Anunciação, Y. Funabiki, D. Yamaguchi, T. Akishita, and
the state-of-the-art deep learning approaches on multiple real- O. Elshocht, “Adversarial attacks for tabular data: Application to fraud
detection and imbalanced data,” in Proc. CEUR Workshop, vol. 2808,
world datasets. Deep neural network-based methods for het- 2021, pp. 1–9.
erogeneous tabular data are still inferior to machine learning [21] C. J. Urban and K. M. Gates, “Deep learning: A primer for psycholo-
methods based on decision tree ensembles for small- and gists,” Psychol. Methods, vol. 26, no. 6, pp. 743–773, 2021.
[22] G. Pang, C. Aggarwal, C. Shen, and N. Sebe, “Editorial deep learning
medium-sized datasets (less than ∼1M samples). Only for for anomaly detection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
a very large dataset mainly consisting of continuous and no. 6, pp. 2282–2286, Jun. 2022.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

18 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[23] S. Wang et al., “Multiview deep anomaly detection: A systematic [50] L. Katzir, G. Elidan, and R. El-Yaniv, “Net-DNF: Effective deep
exploration,” IEEE Trans. Neural Netw. Learn. Syst., early access, modeling of tabular data,” in Proc. Int. Conf. Learn. Represent., 2021,
Jun. 26, 2022, doi: 10.1109/TNNLS.2022.3184723. pp. 1–16.
[24] V. Škvára, J. Francå, M. Zorek, T. Pevnỳ, and V. Šmídl, “Comparison of [51] R. U. David and M. Lane, Introduction to Statistics. 2003. [Online].
anomaly detectors: Context matters,” IEEE Trans. Neural Netw. Learn. Available: http://onlinestatbook.com/
Syst., vol. 33, no. 6, pp. 2494–2507, Jun. 2022. [52] M. Ryan, Deep Learning With Structured Data. New York, NY, USA:
[25] D. Sahoo, Q. Pham, J. Lu, and S. C. H. Hoi, “Online deep learning: Simon & Schuster, 2020.
Learning deep neural networks on the fly,” 2017, arXiv:1711.03705. [53] M. W. Cvitkovic et al., “Deep learning in unconventional domains,”
[26] X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state-of-the- Ph.D. dissertation, California Inst. Technol., Pasadena, CA, USA,
art,” Knowl.-Based Syst., vol. 212, Jan. 2021, Art. no. 106622. 2020.
[27] P. Yin, G. Neubig, W.-T. Yih, and S. Riedel, “TaBERT: Pretrain- [54] D. Dua and C. Graff. (2017). UCI Machine Learning Repository.
ing for joint understanding of textual and tabular data,” 2020, [Online]. Available: http://archive.ics.uci.edu/ml
arXiv:2005.08314. [55] A. J. Miles, “The sunstroke epidemic of Cincinnati, Ohio, during
[28] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial networks in the summer of 1881,” Public Health Papers Rep., vol. 7, no. 1,
computer vision: A survey and taxonomy,” 2019, arXiv:1906.01529. pp. 293–304, 1881.
[29] D. Lichtenwalter, P. Burggräf, J. Wagner, and T. Weißer, “Deep [56] R. A. Fisher, “The use of multiple measurements in taxonomic prob-
multimodal learning for manufacturing problem solving,” Proc. CIRP, lems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, Aug. 1936.
vol. 99, pp. 615–620, 2021. [57] D. A. Jdanov, D. Jasilionis, V. M. Shkolnikov, and M. Barbieri, “Human
[30] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine mortality database,” in Encyclopedia Gerontology Population Aging,
learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. D. Gu and M. E. Dupre, Eds. Cham, Switzerland: Springer, 2020.
Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019. [58] E. Fix, Discriminatory Analysis: Nonparametric Discrimination, Con-
[31] D. Medvedev and A. D’yakonov, “New properties of the data distilla- sistency Properties. Wright-Patterson AFB, OH, USA: USAF school
tion method when working with tabular data,” 2020, arXiv:2010.09839. of Aviation Medicine, 1951.
[32] J. Li, Y. Li, X. Xiang, S.-T. Xia, S. Dong, and Y. Cai, “TNT: An [59] C. L. Giles, C. B. Miller, D. Chen, H. H. Chen, G. Z. Sun, and
interpretable tree-network-tree learning framework using knowledge Y. C. Lee, “Learning and extracting finite state automata with second-
distillation,” Entropy, vol. 22, no. 11, p. 1203, Oct. 2020. order recurrent neural networks,” Neural Comput., vol. 4, no. 3,
[33] D. Roschewitz, M.-A. Hartley, L. Corinzia, and M. Jaggi, “IFedAvg: pp. 393–405, May 1992.
Interpretable data-interoperability for federated learning,” 2021, [60] L. Willenborg and T. De Waal, Statistical Disclosure Control in
arXiv:2107.06580. Practice, vol. 111. New York, NY, USA: Springer, 1996.
[34] A. Sánchez-Morales, J.-L. Sancho-Gómez, J.-A. Martínez-García, and [61] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks:
A. R. Figueiras-Vidal, “Improving deep learning performance with Estimating the click-through rate for new ads,” in Proc. 16th Int. Conf.
missing values via deletion and compensation,” Neural Comput. Appl., World Wide Web (WWW), 2007, pp. 521–530.
vol. 32, no. 17, pp. 13233–13244, Sep. 2020. [62] G. Ke, Z. Xu, J. Zhang, J. Bian, and T.-Y. Liu, “DeepGBM: A deep
learning framework distilled by GBDT for online prediction tasks,” in
[35] M. Abroshan, K. H. Yip, C. Tekin, and M. Van Der Schaar, “Con-
Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
servative policy construction using variational autoencoders for logged
Jul. 2019, pp. 384–394.
data with missing values,” IEEE Trans. Neural Netw. Learn. Syst., early
access, Jan. 10, 2022, doi: 10.1109/TNNLS.2021.3136385. [63] I. Shavitt and E. Segal, “Regularization learning networks: Deep
learning for tabular datasets,” in Proc. Adv. Neural Inf. Process. Syst.,
[36] J. Engelmann and S. Lessmann, “Conditional Wasserstein GAN-based
2018, pp. 1379–1389.
oversampling of tabular data for imbalanced learning,” Expert Syst.
[64] T. B. Brown et al., “Language models are few-shot learners,” 2020,
Appl., vol. 174, Jul. 2021, Art. no. 114582.
arXiv:2005.14165.
[37] S. Darabi and Y. Elor, “Synthesising multi-modal minority samples for
[65] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
tabular data,” 2021, arXiv:2105.08204.
for image recognition at scale,” in Proc. Int. Conf. Learn. Represent.,
[38] S. Kamthe, S. Assefa, and M. Deisenroth, “Copula flows for synthetic 2021, pp. 1–11.
data generation,” 2021, arXiv:2101.00598. [66] S. Khan, M. Naseer, M. Hayat, S. Waqas Zamir, F. Shahbaz Khan, and
[39] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, M. Shah, “Transformers in vision: A survey,” 2021, arXiv:2101.01169.
“Generating multi-label discrete patient records using generative adver- [67] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
sarial networks,” in Proc. 2nd Mach. Learn. Healthcare Conf., 2017, anomaly detection: A review,” ACM Comput. Surv., vol. 54, no. 2,
pp. 286–305. pp. 1–38, Mar. 2021.
[40] State of California, Department of Justice. (2018). California Consumer [68] A. F. Karr, A. P. Sanil, and D. L. Banks, “Data quality: A statistical
Privacy Act (CCPA). Accessed: Dec. 20, 2022. [Online]. Available: perspective,” Stat. Methodol., vol. 3, no. 2, pp. 137–173, 2006.
https://oag.ca.gov/privacy/ccpa [69] L. Xu and K. Veeramachaneni, “Synthesizing tabular data using gen-
[41] GDPR. (2016). Regulation (EU) 2016/679 of the European Parliament erative adversarial networks,” 2018, arXiv:1811.11264.
and of the Council. Official Journal of the European Union. [Online]. [70] G. Ke et al., “LightGBM: A highly efficient gradient boosting
Available: http://www.privacyregulation.eu/en/13.htm decision tree,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[42] P. Voigt and A. Von Dem Bussche, “The EU general data protection pp. 3146–3154.
regulation (GDPR),” in A Practical Guide, vol. 10, 1st ed. Cham, [71] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and
Switzerland: Springer, 2017, Art. no. 3152676. A. Gulin, “CatBoost: Unbiased boosting with categorical features,” in
[43] M. Sahakyan, Z. Aung, and T. Rahwan, “Explainable artificial Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 6638–6648.
intelligence for tabular data: A survey,” IEEE Access, vol. 9, [72] Y. Zhu et al., “Converting tabular data into images for deep learning
pp. 135392–135422, 2021. with convolutional neural networks,” Sci. Rep., vol. 11, no. 1, pp. 1–11,
[44] B. I. Grisci, M. J. Krause, and M. Dorn, “Relevance aggregation for May 2021.
neural networks interpretability and knowledge discovery on tabular [73] N. Rahaman et al., “On the spectral bias of neural networks,” in Proc.
data,” Inf. Sci., vol. 559, pp. 111–129, Jun. 2021. Int. Conf. Mach. Learn., 2019, pp. 5301–5310.
[45] U. Bhatt et al., “Explainable machine learning in deployment,” in Proc. [74] B. R. Mitchell et al., “The spatial inductive bias of deep learning,”
Conf. Fairness, Accountability, Transparency, Jan. 2020, pp. 648–657. Ph.D. dissertation, Johns Hopkins Univ., Baltimore, MD, USA, 2017.
[46] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” [75] Y. Gorishniy, I. Rubachev, and A. Babenko, “On embeddings for
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, numerical features in tabular deep learning,” 2022, arXiv:2203.05556.
Aug. 2016, pp. 785–794. [76] E. Fitkov-Norris, S. Vahid, and C. Hand, “Evaluating the impact of
[47] J. T. Hancock and T. M. Khoshgoftaar, “Survey on categorical data for categorical data encoding and scaling on neural network classification
neural networks,” J. Big Data, vol. 7, no. 1, pp. 1–41, Dec. 2020. performance: The case of repeat consumption of identical cultural
[48] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting goods,” in Proc. Int. Conf. Eng. Appl. Neural Netw. Cham, Switzerland:
deep learning models for tabular data,” 2021, arXiv:2106.11959. Springer, 2012, pp. 343–352.
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [77] D. Baylor et al., “TFX: A TensorFlow-based production-scale machine
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. learning platform,” in Proc. 23rd ACM SIGKDD Int. Conf. Knowl.
(CVPR), Jun. 2016, pp. 770–778. Discovery Data Mining, 2017, pp. 1387–1395.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 19

[78] B. Sun et al., “SuperTML: Two-dimensional word embedding for [103] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip,
the precognition on structured tabular data,” in Proc. IEEE/CVF “A comprehensive survey on graph neural networks,” IEEE Trans.
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 4–24, Mar. 2020.
pp. 1–9. [104] C. Wang, M. Li, and A. J. Smola, “Language models with transform-
[79] J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar, “VIME: ers,” 2019, arXiv:1904.09408.
Extending the success of self- and semi-supervised learning to tabular [105] A. F. T. Martins and R. Fernandez Astudillo, “From softmax to
domaindim,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, sparsemax: A sparse model of attention and multi-label classification,”
pp. 1–11. 2016, arXiv:1602.02068.
[80] D. Bahri, H. Jiang, Y. Tay, and D. Metzler, “SCARF: Self- [106] G. Van Rossum and F. L. Drake, Jr., Python Reference Manual.
supervised contrastive learning using random feature corruption,” 2021, Amsterdam, The Netherlands: Centrum voor Wiskunde en Informatica,
arXiv:2106.15147. 1995.
[81] H.-T. Cheng et al., “Wide & deep learning for recommender sys- [107] M. Joseph, “PyTorch tabular: A framework for deep learning with
tems,” in Proc. 1st Workshop Deep Learn. Recommender Syst., 2016, tabular data,” 2021, arXiv:2104.13638.
pp. 7–10. [108] S. Boughorbel, F. Jarray, and A. Kadri, “Fairness in TabNet model
[82] N. Frosst and G. Hinton, “Distilling a neural network into a soft by disentangled representation for the prediction of hospital no-show,”
decision tree,” 2017, arXiv:1711.09784. 2021, arXiv:2103.04048.
[83] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “XDeepFM: [109] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan,
Combining explicit and implicit feature interactions for recommender “A survey on bias and fairness in machine learning,” ACM Comput.
systems,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Surv., vol. 54, no. 6, pp. 1–35, Jul. 2021.
Data Mining, Jul. 2018, pp. 1754–1763. [110] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “CutMix:
[84] G. Ke, J. Zhang, Z. Xu, J. Bian, and T.-Y. Liu. (2018). TabNN: Regularization strategy to train strong classifiers with localizable fea-
A Universal Neural Network Solution for Tabular Data. [Online]. tures,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
Available: https://openreview.net/forum?id=r1eJssCqY7 pp. 6023–6032.
[85] R. Agarwal et al., “Neural additive models: Interpretable machine [111] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup:
learning with neural nets,” 2020, arXiv:2004.13912. Beyond empirical risk minimization,” 2017, arXiv:1710.09412.
[86] Y. Luo, H. Zhou, W.-W. Tu, Y. Chen, W. Dai, and Q. Yang, “Network [112] V. Borisov, J. Haug, and G. Kasneci, “CancelOut: A layer for feature
on network for tabular data classification in real-world applications,” selection in deep neural networks,” in Proc. Int. Conf. Artif. Neural
in Proc. 43rd Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2020, Netw. Cham, Switzerland: Springer, 2019, pp. 72–83.
pp. 2317–2326. [113] G. Valdes, W. Arbelo, Y. Interian, and J. H. Friedman, “Lockout: Sparse
[87] Z. Liu, Q. Liu, H. Zhang, and Y. Chen, “DNN2LR: Interpretation- regularization of neural networks,” 2021, arXiv:2107.07160.
inspired feature crossing for real-world tabular data,” 2020, [114] J. Fiedler, “Simple modifications to improve tabular neural networks,”
arXiv:2008.09775. 2021, arXiv:2108.03214.
[88] S. Ivanov and L. Prokhorenkova, “Boost then Convolve: Gradient [115] K. Lounici, K. Meziani, and B. Riu, “Muddling label regularization:
boosting meets graph neural networks,” in Proc. Int. Conf. Learn. Deep learning for tabular datasets,” 2021, arXiv:2106.04462.
Represent., 2021, pp. 1–16. [116] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model
[89] H. Luo, F. Cheng, H. Yu, and Y. Yi, “SDTR: Soft decision tree regressor predictions,” in Proc. NeurIPS, 2017, pp. 1–10.
for tabular data,” IEEE Access, vol. 9, pp. 55999–56011, 2021. [117] H. Chen, S. Jajodia, J. Liu, N. Park, V. Sokolov, and
[90] X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “TabTrans- V. S. Subrahmanian, “FakeTables: Using GANs to generate functional
former: Tabular data modeling using contextual embeddings,” 2020, dependency preserving tables with bounded real data,” in Proc. Twenty-
arXiv:2012.06678. Eighth Int. Joint Conf. Artif. Intell., Aug. 2019, pp. 2074–2080.
[91] S. Cai, K. Zheng, G. Chen, H. V. Jagadish, B. C. Ooi, and M. Zhang, [118] L. Gondara and K. Wang, “MIDA: Multiple imputation using denoising
“ARM-Net: Adaptive relation modeling network for structured data,” autoencoders,” in Proc. Pacific–Asia Conf. Knowl. Discovery Data
in Proc. Int. Conf. Manage. Data, Jun. 2021, pp. 207–220. Mining. Cham, Switzerland: Springer, 2018, pp. 260–272.
[119] R. D. Camino et al., “Working with deep generative models and tabular
[92] J. Kossen, N. Band, C. Lyle, A. Gomez, T. Rainforth, and Y. Gal,
data imputation,” in Proc. ICML Artemiss Workshop, 2020, pp. 1–6.
“Self-attention between datapoints: Going beyond individual input-
output pairs in deep learning,” in Proc. Adv. Neural Inf. Process. Syst., [120] M. Quintana and C. Miller, “Towards class-balancing human comfort
A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, datasets with GANs,” in Proc. 6th ACM Int. Conf. Syst. Energy-Efficient
pp. 28742–28756. Buildings, Cities, Transp., Nov. 2019, pp. 391–392.
[121] A. Koivu, M. Sairanen, A. Airola, and T. Pahikkala, “Synthetic minority
[93] Y. Yamada, O. Lindenbaum, S. Negahban, and Y. Kluger, “Feature
oversampling of vital statistics data with generative adversarial net-
selection using stochastic gates,” in Proc. Mach. Learn. Syst., 2020,
works,” J. Amer. Med. Inform. Assoc., vol. 27, no. 11, pp. 1667–1674,
pp. 8952–8963.
Nov. 2020.
[94] D. Micci-Barreca, “A preprocessing scheme for high-cardinality cat-
[122] J. Fan, J. Chen, T. Liu, Y. Shen, G. Li, and X. Du, “Relational
egorical attributes in classification and prediction problems,” ACM
data synthesis using generative adversarial networks: A design space
SIGKDD Explor. Newslett., vol. 3, no. 1, pp. 27–32, Jul. 2001.
exploration,” Proc. VLDB Endowment, vol. 13, no. 12, pp. 1962–1975,
[95] J. H. Friedman, “Stochastic gradient boosting,” Comput. Statist. Data Aug. 2020.
Anal., vol. 38, no. 4, pp. 367–378, 2002.
[123] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
[96] P. Langley and S. Sage, “Oblivious decision trees and abstract cases,” “Analyzing and improving the image quality of StyleGAN,” in Proc.
in Proc. Work. Notes AAAI Workshop Case-Based Reasoning. Seattle, IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
WA, USA, 1994, pp. 113–117. pp. 8110–8119.
[97] B. Peters, V. Niculae, and A. F. T. Martins, “Sparse Sequence-to- [124] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, “Adversarial ranking
Sequence models,” 2019, arXiv:1905.05702. for language generation,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[98] Y. LeCun and C. Cortes. (2010). MNIST Handwritten Digit Database. pp. 1–11.
[Online]. Available: http://yann.lecun.com/exdb/mnist/ [125] S. Subramanian, S. Rajeswar, F. Dutil, C. Pal, and A. Courville,
[99] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” “Adversarial generation of natural language,” in Proc. 2nd Workshop
2016, arXiv:1604.06737. Represent. Learn. NLP, 2017, pp. 241–251.
[100] S. Rendle, “Factorization machines,” in Proc. IEEE Int. Conf. Data [126] N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data vault,”
Mining, Dec. 2010, pp. 995–1000. in Proc. IEEE Int. Conf. Data Sci. Adv. Analytics (DSAA), Oct. 2016,
[101] F. Moosmann, B. Triggs, and F. Jurie, “Fast discriminative visual pp. 399–410.
codebooks using randomized clustering forests,” in Proc. 20th Annu. [127] Z. Li, Y. Zhao, and J. Fu, “SynC: A copula based framework for
Conf. Neural Inf. Process. Syst. (NIPS). Cambridge, MA, USA: MIT generating synthetic data from aggregated sources,” in Proc. Int. Conf.
Press, 2006, pp. 985–992. Data Mining Workshops (ICDMW), Nov. 2020, pp. 571–578.
[102] X. He et al., “Practical lessons from predicting clicks on ads at [128] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao,
Facebook,” in Proc. 8th Int. Workshop Data Mining Online Advertising, “PrivBayes: Private data release via Bayesian networks,” ACM Trans.
2014, pp. 1–9. Database Syst., vol. 42, no. 4, pp. 1–41, Oct. 2017.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

20 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[129] C. Chow and C. Liu, “Approximating discrete probability distributions [153] E. Tjoa and C. Guan, “A survey on explainable artificial intelligence
with dependence trees,” IEEE Trans. Inf. Theory, vol. IT-14, no. 3, (XAI): Toward medical XAI,” IEEE Trans. Neural Netw. Learn. Syst.,
pp. 462–467, May 1968. vol. 32, no. 11, pp. 4793–4813, Nov. 2021.
[130] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, [154] J. Kauffmann, M. Esders, L. Ruff, G. Montavon, W. Samek, and
“Modeling tabular data using conditional GAN,” in Proc. Adv. Neural K.-R. Müller, “From clustering to cluster explanations via neural
Inf. Process. Syst., vol. 33, 2019, pp. 1–11. networks,” IEEE Trans. Neural Netw. Learn. Syst., early access,
[131] L. V. H. Vardhan and S. Kok, “Generating privacy-preserving synthetic Jul. 7, 2022, doi: 10.1109/TNNLS.2022.3185901.
tabular data using oblivious variational autoencoders,” in Proc. Work- [155] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and
shop Econ. Privacy Data Labor 37th Int. Conf. Mach. Learn., 2020, D. Pedreschi, “A survey of methods for explaining black box models,”
pp. 1–8. ACM Comput. Surv., vol. 51, no. 5, pp. 1–42, Sep. 2019.
[132] M. Baak, S. Brugman, I. F. Rojas, L. Dalmeida, R. E. Urlus, and [156] K. Gade, S. C. Geyik, K. Kenthapadi, V. Mithal, and A. Taly, “Explain-
J.-B. Oger, “Synthsonic: Fast, probabilistic modeling and synthesis able AI in industry,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl.
of tabular data,” in Proc. Int. Conf. Artif. Intell. Statist., 2022, Discovery Data Mining, Jul. 2019, pp. 3203–3204.
pp. 4747–4763. [157] M. Pawelczyk, S. Bielawski, J. Van Den Heuvel, T. Richter, and
[133] I. J. Goodfellow et al., “Generative adversarial networks,” 2014, G. Kasneci, “CARLA: A Python library to benchmark algorithmic
arXiv:1406.2661. recourse and counterfactual explanation algorithms,” in Proc. Adv.
[134] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation Neural Inf. Process. Syst. (NeurIPS) Benchmark Datasets Track, 2021,
learning with deep convolutional generative adversarial networks,” in pp. 1–22.
Proc. 4th Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–16. [158] Y. Lou, R. Caruana, and J. Gehrke, “Intelligible models for classifica-
[135] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver- tion and regression,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl.
sarial networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223. Discovery Data Mining (KDD), 2012, pp. 150–158.
[136] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, [159] D. Alvarez-Melis and T. S. Jaakkola, “Towards robust interpretabil-
“Improved training of Wasserstein GANs,” in Proc. 31st Int. Conf. ity with self-explaining neural networks,” in Proc. NeurIPS, 2018,
Neural Inf. Process. Syst., 2017, pp. 5769–5779. pp. 1–10.
[137] M. G. Bellemare et al., “The Cramér distance as a solution to biased [160] D. Wang, Q. Yang, A. Abdul, and B. Y. Lim, “Designing theory-driven
Wasserstein gradients,” 2017, arXiv:1705.10743. user-centric explainable AI,” in Proc. CHI, 2019, pp. 1–15.
[138] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio, [161] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and
“Boundary-seeking generative adversarial networks,” in Proc. Int. Conf. W. Samek, “On pixel-wise explanations for non-linear classifier deci-
Learn. Represent., 2018, pp. 1–17. sions by layer-wise relevance propagation,” PLoS ONE, vol. 10, no. 7,
[139] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton, Jul. 2015, Art. no. e0130140.
“VEEGAN: Reducing mode collapse in GANs using implicit varia- [162] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller,
tional learning,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., “Layer-wise relevance propagation: An overview,” in Explainable
2017, pp. 3310–3320. AI: Interpreting, Explaining and Visualizing Deep Learning. Cham,
[140] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Switzerland: Springer, 2019, pp. 193–209.
Proc. 2nd Int. Conf. Learn. Represent. (ICLR), Conf. Track, 2014, [163] G. Kasneci and T. Gottron, “LICON: A linear weighting scheme for
pp. 1–14. the contribution ofInput variables in deep artificial neural networks,” in
[141] C. Ma, S. Tschiatschek, R. Turner, J. M. Hernández-Lobato, and Proc. 25th ACM Int. Conf. Inf. Knowl. Manage., Oct. 2016, pp. 45–54.
C. Zhang, “VAEM: A deep generative model for heterogeneous mixed [164] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep
type data,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3319–3328.
pp. 1–11. [165] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
[142] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim, “Grad-CAM++: Generalized gradient-based visual explanations for
“Data synthesis based on generative adversarial networks,” Proc. VLDB deep convolutional networks,” in Proc. IEEE Winter Conf. Appl.
Endowment, vol. 11, no. 10, pp. 1071–1083, Jun. 2018. Comput. Vis. (WACV), Mar. 2018, pp. 1–9.
[143] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization [166] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should i trust
with Gumbel-Softmax,” in Proc. Int. Conf. Learn. Represent., 2017, you?’: Explaining the predictions of any classifier,” in Proc. 22nd
pp. 1–13. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
[144] B. Wen, L. O. Colon, K. P. Subbalakshmi, and R. Chandramouli, pp. 1135–1144.
“Causal-TGAN: Generating tabular data using causal generative adver- [167] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision
sarial networks,” 2021, arXiv:2104.10680. model-agnostic explanations,” in Proc. AAAI, 2018, pp. 1–9.
[145] J. Jordon, J. Yoon, and M. Van Der Schaar, “PATE-GAN: Generating [168] S. M. Lundberg et al., “From local explanations to global understanding
synthetic data with differential privacy guarantees,” in Proc. Int. Conf. with explainable AI for trees,” Nature Mach. Intell., vol. 2, pp. 56–67,
Learn. Represent., 2018, pp. 1–21. Jan. 2020.
[146] N. M. Jebreel, J. Domingo-Ferrer, A. Blanco-Justicia, and D. Sánchez, [169] J. Haug, S. Zürn, P. El-Jiz, and G. Kasneci, “On baselines for local
“Enhanced security and privacy via fragmented federated learning,” feature attributions,” 2021, arXiv:2101.00905.
IEEE Trans. Neural Netw. Learn. Syst., early access, Oct. 19, 2022, [170] Y. Liu, S. Khandagale, C. White, and W. Neiswanger, “Synthetic
doi: 10.1109/TNNLS.2022.3212627. benchmarks for scientific research in explainable machine learning,” in
[147] A. Mottini, A. Lheritier, and R. Acuna-Agost, “Airline passenger Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) Benchmark Datasets
name record generation using generative adversarial networks,” 2018, Track, 2021, pp. 1–25.
arXiv:1807.06657. [171] S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual explanations
[148] R. Camino, C. Hammerschmidt, and R. State, “Generating multi- without opening the black box: Automated decisions and the GDPR,”
categorical samples with generative adversarial networks,” in Proc. Harvard J. Law Technol., vol. 31, no. 2, p. 841, 2018.
ICML Workshop Theor. Found. Appl. Deep Generative Models, 2018, [172] B. Ustun, A. Spangher, and Y. Liu, “Actionable recourse in linear
pp. 1–7. classification,” in Proc. Conf. Fairness, Accountability, Transparency,
[149] M. K. Baowaly, C.-C. Lin, C.-L. Liu, and K.-T. Chen, “Synthesizing Jan. 2019, pp. 10–19.
electronic health records using improved generative adversarial net- [173] C. Russell, “Efficient search for diverse coherent explanations,” in Proc.
works,” J. Amer. Med. Inform. Assoc., vol. 26, no. 3, pp. 228–241, Conf. Fairness, Accountability, Transparency, Jan. 2019, pp. 20–28.
Mar. 2019. [174] K. Rawal and H. Lakkaraju, “Beyond individualized recourse: Inter-
[150] Z. Zhao, A. Kunar, H. Van der Scheer, R. Birke, and pretable and interactive summaries of actionable recourses,” in Proc.
L. Y. Chen, “CTAB-GAN: Effective table data synthesizing,” 2021, NeurIPS, 2020, pp. 12187–12198.
arXiv:2102.08369. [175] A.-H. Karimi, G. Barthe, B. Balle, and I. Valera, “Model-agnostic
[151] V. Borisov, K. Seßler, T. Leemann, M. Pawelczyk, and G. Kasneci, counterfactual explanations for consequential decisions,” in Proc. Int.
“Language models are realistic tabular data generators,” 2022, Conf. Artif. Intell. Statist., 2020, pp. 895–905.
arXiv:2210.06280. [176] A. Dhurandhar et al., “Explanations based on the missing: Towards
[152] F. J. Massey, Jr., “The Kolmogorov-Smirnov test for goodness of fit,” contrastive explanations with pertinent negatives,” in Proc. Adv. Neural
J. Amer. Statist. Assoc., vol. 46, no. 253, pp. 68–78, 1951. Inf. Process. Syst. (NeurIPS), 2018, pp. 1–12.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 21

[177] B. Mittelstadt, C. Russell, and S. Wachter, “Explaining explanations in [198] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas
AI,” in Proc. Conf. Fairness, Accountability, Transparency, Jan. 2019, immanent in nervous activity,” Bull. Math. Biophys., vol. 5, no. 4,
pp. 279–288. pp. 115–133, 1943.
[178] M. Pawelczyk, K. Broelemann, and G. Kasneci, “Learning model- [199] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A
agnostic counterfactual explanations for tabular data,” in Proc. Web next-generation hyperparameter optimization framework,” in Proc. 25th
Conf., Apr. 2020, pp. 3126–3132. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2019,
[179] M. Downs, J. L. Chu, Y. Yacoby, F. Doshi-Velez, and W. Pan, “CRUDS: pp. 1–10.
Counterfactual recourse using disentangled subspaces,” in Proc. ICML [200] D. Merkel, “Docker: Lightweight Linux containers for consistent
Workshop Hum. Interpretability Mach. Learn. (WHI), 2020, 1–23. development and deployment,” Linux J., vol. 2014, no. 239, p. 2, 2014.
[180] S. Joshi, O. Koyejo, W. Vijitbenjaronk, B. Kim, and J. Ghosh, “Towards [201] C. S. Bojer and J. P. Meldgaard, “Kaggle forecasting competitions: An
realistic individual recourse and actionable explanations in black-box overlooked learning opportunity,” Int. J. Forecasting, vol. 37, no. 2,
decision making systems,” 2019, arXiv:1907.09615. pp. 587–603, Apr. 2021.
[181] D. Mahajan, C. Tan, and A. Sharma, “Preserving causal constraints [202] Y. Rong, T. Leemann, V. Borisov, G. Kasneci, and E. Kasneci,
in counterfactual explanations for machine learning classifiers,” 2019, “A consistent and efficient evaluation strategy for attribution methods,”
arXiv:1912.03277. in Proc. Int. Conf. Mach. Learn., 2022, pp. 18770–18795.
[182] M. Pawelczyk, K. Broelemann, and G. Kasneci, “On counterfactual [203] R. Tomsett, D. Harborne, S. Chakraborty, P. Gurram, and A. Preece,
explanations under predictive multiplicity,” in Proc. Conf. Uncertainty “Sanity checks for saliency metrics,” in Proc. AAAI Conf. Artif. Intell.,
Artif. Intell. (UAI), 2020, pp. 809–818. vol. 34, no. 4, 2020, pp. 6021–6029.
[183] A. Nazábal, P. M. Olmos, Z. Ghahramani, and I. Valera, “Han- [204] E. Ntoutsi et al., “Bias in data-driven artificial intelligence systems-an
dling incomplete heterogeneous data using VAEs,” Pattern Recognit., introductory survey,” Wiley Interdiscipl. Reviews: Data Mining Knowl.
vol. 107, Nov. 2020, Art. no. 107501. Discovery, vol. 10, no. 3, p. e1356, 2020.
[184] S. Upadhyay, S. Joshi, and H. Lakkaraju, “Towards robust and reli- [205] A. Giloni et al., “BENN: Bias estimation using a deep neural network,”
able algorithmic recourse,” in Proc. Adv. Neural Inf. Process. Syst. IEEE Trans. Neural Netw. Learn. Syst., early access, May 11, 2022,
(NeurIPS), vol. 34, 2021, pp.16926–16937. doi: 10.1109/TNNLS.2022.3172365.
[185] R. Dominguez-Olmedo, A.-H. Karimi, and B. Schölkopf, “On the [206] Y. Sun, K. Tang, Z. Zhu, and X. Yao, “Concept drift adaptation by
adversarial robustness of causal algorithmic recourse,” in Proc. Int. exploiting historical knowledge,” IEEE Trans. Neural Netw. Learn.
Conf. Mach. Learn. (ICML), 2022, pp. 5324–5342. Syst., vol. 29, no. 10, pp. 4822–4832, Oct. 2018.
[186] M. Pawelczyk, T. Datta, J. Van-Den-Heuvel, G. Kasneci, and [207] P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proc.
H. Lakkaraju, “Probabilistically robust recourse: Navigating the trade- 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD),
offs between costs and robustness in algorithmic recourse,” 2022, 2000, pp. 71–80.
arXiv:2203.06768. [208] C. Manapragada, G. I. Webb, and M. Salehi, “Extremely fast decision
[187] A.-H. Karimi, G. Barthe, B. Schölkopf, and I. Valera, “A survey tree,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data
of algorithmic recourse: Definitions, formulations, solutions, and Mining, Jul. 2018, pp. 1953–1962.
prospects,” 2020, arXiv:2010.04050. [209] P. Duda, M. Jaworski, A. Cader, and L. Wang, “On training deep neural
[188] S. Verma, J. Dickerson, and K. Hines, “Counterfactual explanations for networks using a streaming approach,” J. Artif. Intell. Soft Comput.
machine learning: A review,” 2020, arXiv:2010.10596. Res., vol. 10, no. 1, pp. 15–26, Jan. 2020.
[189] A. Krizhevsky, “Learning multiple layers of features from tiny images,” [210] J. Haug, M. Pawelczyk, K. Broelemann, and G. Kasneci, “Leveraging
Univ. Toronto, Toronto, ON, Canada, Tech. Rep., 2009. [Online]. Avail- model inherent variable importance for stable online feature selection,”
able: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
[190] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Aug. 2020, pp. 1478–1502.
“ImageNet: A large-scale hierarchical image database,” in Proc. IEEE [211] J. Haug and G. Kasneci, “Learning parameter distributions to detect
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255. concept drift in data streams,” in Proc. 25th Int. Conf. Pattern Recognit.
[191] FICO. (2019). Home Equity Line of Credit (HELOC) (ICPR), Jan. 2021, pp. 9452–9459.
Dataset. Accessed: Jun. 15, 2022. [Online]. Available: [212] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on
https://community.fico.com/s/explainable-machine-learning-challenge deep transfer learning,” in Proc. Int. Conf. Artif. Neural Netw. Cham,
[192] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles Switzerland: Springer, 2018, pp. 270–279.
in high-energy physics with deep learning,” Nature Commun., vol. 5, [213] C. Shorten and T. M. Khoshgoftaar, “A survey on image data aug-
no. 1, pp. 1–9, Sep. 2014. mentation for deep learning,” J. Big Data, vol. 6, no. 1, pp. 1–48,
[193] C. Z. Mooney, Monte Carlo Simulation. Newbury Park, CA, USA: Dec. 2019.
SAGE, 1997. [214] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
[194] R. K. Pace and R. Barry, “Sparse spatial autoregressions,” Statist. “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell.
Probab. Lett., vol. 33, pp. 291–297, May 1997. Res., vol. 16, no. 1, pp. 321–357, Jan. 2002.
[195] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, [215] L. Jing and Y. Tian, “Self-supervised visual feature learning with deep
Classification Regression Trees. Evanston, IL, USA: Routledge, neural networks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell.,
2017. vol. 43, no. 11, pp. 4037–4058, Nov. 2021.
[196] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, [216] X. Liu et al., “Self-supervised learning: Generative or contrastive,”
2001. IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 857–876, Jan. 2021.
[197] K. Broelemann and G. Kasneci, “A gradient-based split criterion for [217] T. Ucar, E. Hajiramezanali, and L. Edwards, “SubTab: Subsetting
highly accurate and transparent model trees,” in Proc. IJCAI, 2019, features of tabular data for self-supervised representation learning,” in
pp. 1–8. Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 1–13.

On-Load Tap-Changers For Power Transformers: MR Publication
100% (2)
On-Load Tap-Changers For Power Transformers: MR Publication
24 pages
Academic Full Length Practice Test
0% (1)
Academic Full Length Practice Test
25 pages
Mill Reject Technical Specifications
No ratings yet
Mill Reject Technical Specifications
403 pages
Test Bank For Testbank Modern Deep Learning For Tabular Data Novel Approaches To Common Modeling Problems 1st Edition Download
100% (1)
Test Bank For Testbank Modern Deep Learning For Tabular Data Novel Approaches To Common Modeling Problems 1st Edition Download
406 pages
Appearance Release: Complete Only For Hazardous Activity
No ratings yet
Appearance Release: Complete Only For Hazardous Activity
1 page
Moa
No ratings yet
Moa
4 pages
Midwifery Society of Nepal (MIDSoN)
No ratings yet
Midwifery Society of Nepal (MIDSoN)
5 pages
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
No ratings yet
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
47 pages
Modern Deep Learning For Tabular Data: Novel Approaches To Common Modeling Problems 1st Edition Andre Ye
No ratings yet
Modern Deep Learning For Tabular Data: Novel Approaches To Common Modeling Problems 1st Edition Andre Ye
40 pages
Yamaha R1 Service Manual 2007
100% (1)
Yamaha R1 Service Manual 2007
426 pages
IS15477 - 2019 Tile Adhesive
No ratings yet
IS15477 - 2019 Tile Adhesive
21 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
22 pages
Tabular Data - Deep Learning Is Not All You Need
No ratings yet
Tabular Data - Deep Learning Is Not All You Need
13 pages
Why Tree Based Method
No ratings yet
Why Tree Based Method
14 pages
!discussion 1
No ratings yet
!discussion 1
8 pages
Revisiting Deep Learning Models For Tabular Data
No ratings yet
Revisiting Deep Learning Models For Tabular Data
12 pages
DL For Finance
No ratings yet
DL For Finance
22 pages
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
No ratings yet
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
13 pages
DL Tabular
No ratings yet
DL Tabular
43 pages
TabTransformer - Tabular Data Modeling Using Contextual Embeddings
No ratings yet
TabTransformer - Tabular Data Modeling Using Contextual Embeddings
17 pages
Well Tuned Simple Nets
No ratings yet
Well Tuned Simple Nets
23 pages
MLP Tabular
No ratings yet
MLP Tabular
19 pages
Publi-6721 2
No ratings yet
Publi-6721 2
17 pages
Surverypaper GNN
No ratings yet
Surverypaper GNN
24 pages
Modelling Tabular Data Using Conditional GAN's
No ratings yet
Modelling Tabular Data Using Conditional GAN's
15 pages
Deep Generative Model
No ratings yet
Deep Generative Model
27 pages
Resnet
No ratings yet
Resnet
25 pages
Application of Data Augmentation On Deep Learning
No ratings yet
Application of Data Augmentation On Deep Learning
13 pages
Accurate Predictions On Small Data With A Tabular Foundation Model
No ratings yet
Accurate Predictions On Small Data With A Tabular Foundation Model
23 pages
Tacl A 00544
No ratings yet
Tacl A 00544
23 pages
Trompt Towards A Better Deep Neural Network For Tabular Data
No ratings yet
Trompt Towards A Better Deep Neural Network For Tabular Data
43 pages
E11 BR PD
No ratings yet
E11 BR PD
6 pages
On Embeddings For Numerical Features in Tabular Deep Learning
No ratings yet
On Embeddings For Numerical Features in Tabular Deep Learning
21 pages
Mambular: A Sequential Model For Tabular Deep Learning
No ratings yet
Mambular: A Sequential Model For Tabular Deep Learning
21 pages
Ibm Tabformer
No ratings yet
Ibm Tabformer
5 pages
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
No ratings yet
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
14 pages
Design Proposal of An Automatic Smart MultiInsect Mosquito Killing System IEEE
No ratings yet
Design Proposal of An Automatic Smart MultiInsect Mosquito Killing System IEEE
6 pages
A Closer Look at Deep Learning On Tabular Data
No ratings yet
A Closer Look at Deep Learning On Tabular Data
43 pages
Deep Learning
No ratings yet
Deep Learning
34 pages
The Computational Limits of Deep Learning: Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso
No ratings yet
The Computational Limits of Deep Learning: Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso
46 pages
Fuzzy CNN
No ratings yet
Fuzzy CNN
10 pages
Tabular Data Classification and Regression XGBoost or Deep Learning With Retrieval-Augmented Generation
No ratings yet
Tabular Data Classification and Regression XGBoost or Deep Learning With Retrieval-Augmented Generation
14 pages
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
No ratings yet
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
33 pages
FANUC Software WeldPRO
No ratings yet
FANUC Software WeldPRO
2 pages
Tabnet: Attentive Interpretable Tabular Learning: Sercan O. Arık Tomas Pfister
No ratings yet
Tabnet: Attentive Interpretable Tabular Learning: Sercan O. Arık Tomas Pfister
12 pages
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
No ratings yet
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
39 pages
GANThesis
No ratings yet
GANThesis
81 pages
Tree-Hybrid MLPs
No ratings yet
Tree-Hybrid MLPs
14 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Deep Learning in Data Science Theoretical Foundati
No ratings yet
Deep Learning in Data Science Theoretical Foundati
6 pages
A Data-Centric Perspective On Evaluating Machine Learning Models For Tabular Data
No ratings yet
A Data-Centric Perspective On Evaluating Machine Learning Models For Tabular Data
35 pages
T R: T D L M N N 2023: AB Abular EEP Earning Eets Earest Eighbors in
No ratings yet
T R: T D L M N N 2023: AB Abular EEP Earning Eets Earest Eighbors in
39 pages
Deep Learning As A Frontier of Machine Learning A
No ratings yet
Deep Learning As A Frontier of Machine Learning A
10 pages
Deep Learning As A Frontier of Machine Learning A
No ratings yet
Deep Learning As A Frontier of Machine Learning A
10 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
20 pages
Synchro PRO 2018 - Technical Overview
No ratings yet
Synchro PRO 2018 - Technical Overview
11 pages
ReadISACA QAE Databases On ISACA PERFORMTica
No ratings yet
ReadISACA QAE Databases On ISACA PERFORMTica
9 pages
Design Basis: CE 315-Design of Concrete Structure - I Instructor: Dr. E. R. Latifee
No ratings yet
Design Basis: CE 315-Design of Concrete Structure - I Instructor: Dr. E. R. Latifee
2 pages
Manual Ecoaire, Eco Insert, Eco2 & Kerala 2012 ENG
No ratings yet
Manual Ecoaire, Eco Insert, Eco2 & Kerala 2012 ENG
52 pages
Waste Hierarchy
No ratings yet
Waste Hierarchy
4 pages
Our Development Board: Product Details
No ratings yet
Our Development Board: Product Details
4 pages
Navigating Veterinary Practice in The Digital Age: Implementing A Web-Based Information Management System at Animals' Choice Clinic
No ratings yet
Navigating Veterinary Practice in The Digital Age: Implementing A Web-Based Information Management System at Animals' Choice Clinic
9 pages
1MWh ESS and 303KW PV System For Biova V2
No ratings yet
1MWh ESS and 303KW PV System For Biova V2
1 page
SPH Catalogue
No ratings yet
SPH Catalogue
127 pages
Sikorsky v. City of Newburgh, No. 23-1171 (2d Cir. May 2, 2025)
No ratings yet
Sikorsky v. City of Newburgh, No. 23-1171 (2d Cir. May 2, 2025)
13 pages
Toms River Animal Shelter Expenses 2023
No ratings yet
Toms River Animal Shelter Expenses 2023
27 pages
Avaya 9641GS IP Deskphone: Phones & Devices
No ratings yet
Avaya 9641GS IP Deskphone: Phones & Devices
4 pages
ESG Module Handbook 23.24A
No ratings yet
ESG Module Handbook 23.24A
12 pages
Agree or Disagree
No ratings yet
Agree or Disagree
2 pages
Michael's Resume 2024
No ratings yet
Michael's Resume 2024
3 pages
Business Model Canvas
No ratings yet
Business Model Canvas
3 pages
Peta1 Q1
No ratings yet
Peta1 Q1
2 pages
6 Internship Contract Agreement f2f
No ratings yet
6 Internship Contract Agreement f2f
2 pages
MX Sdi0 en GC DWG 4007 A Cat B
No ratings yet
MX Sdi0 en GC DWG 4007 A Cat B
1 page
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Graph Data Modeling and Analytics with Neo4j: Definitive Reference for Developers and Engineers
From Everand
Graph Data Modeling and Analytics with Neo4j: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
From Everand
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Deep Neural Networks and Tabular Data A Survey

Uploaded by

Deep Neural Networks and Tabular Data A Survey

Uploaded by

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Deep Neural Networks and Tabular Data: A Survey

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 3

take one out of a limited set of values. Examples of typical

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 5

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 7

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

the potential strings into integers), and the numerical features

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 9

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 11

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

and noise in the recourse implementations [184], [185], [186]. TABLE IV

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 13

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Admittedly, explanations can be provided in very different

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 15

A. Summary and Trends

16 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 17

18 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 19

20 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

BORISOV et al.: DEEP NEURAL NETWORKS AND TABULAR DATA: A SURVEY 21

You might also like