MLP Tabular
MLP Tabular
Abstract cessfully applied to tabular data (Kadra et al., 2021; Arik &
Deep Learning has revolutionized the field of AI Pfister, 2021; Gorishniy et al., 2021; Hollmann et al., 2023).
and led to remarkable achievements in applica- To clear the cloud of uncertainty and help the research com-
tions involving image and text data. Unfortu-
arXiv:2402.03970v1 [cs.LG] 6 Feb 2024
1
Tabular Data: Is Attention All You Need?
Grinsztajn et al., 2022; McElfresh et al., 2023). However, TabNet (Arik & Pfister, 2021), an innovative model in this
our experiments reveal that transformer-based architectures area, employs attention mechanisms sequentially to priori-
are not better than variants of MLP networks with resid- tize the most significant features. SAINT (Somepalli et al.,
ual connections, therefore, questioning the ongoing trend 2021), another transformer-based model, draws inspiration
of transformers-based methods for tabular data (Gorishniy from the seminal transformer architecture (Vaswani et al.,
et al., 2021; Huang et al., 2020; Song et al., 2019; Somepalli 2017). It addresses data challenges by applying attention
et al., 2021; Hollmann et al., 2023). We further present anal- both to rows and columns. They also offer a self-supervised
yses showing that the choice of the experimental protocol is pretraining phase, particularly beneficial when labels are
the source of the orthogonal conclusions, especially when scarce. The FT-Transformer (Gorishniy et al., 2021) stands
the hyper-parameters of neural networks are not tuned with out with its two-component structure: the Feature Tokenizer
a sufficient HPO budget. and the Transformer. The Feature Tokenizer is responsible
for converting the input x (comprising both numerical and
Therefore, our work presents the following contributions:
categorical features) into embeddings. These embeddings
are then fed into the Transformer, forming the basis for sub-
• A fair and large-scale experimental protocol for com- sequent processing. Moreover, TabPFN (Hollmann et al.,
paring neural network variants against decision trees 2023) stands out as a cutting-edge method in the realm of
on tabular datasets; supervised classification for small tabular datasets.
• Empirical findings suggesting that neural networks are Significant research has delved into understanding the con-
competitive against decision trees, and that transform- texts where Neural Networks (NNs) excel, and where they
ers are not better than variants of traditional MLPs; fall short (Shwartz-Ziv & Armon, 2021; Borisov et al., 2022;
Grinsztajn et al., 2022). The recent study by (McElfresh
• Analysis about the influence of the HPO budget on the
et al., 2023) is highly related to ours in terms of the research
predictive quality of neural networks.
focus. However, the authors used only random search for
tuning the hyperparameters of neural networks, whereas
2. Related Work we employ Tree-structured Parzen Estimator (TPE), which
Given the prevalence of tabular data in numerous areas, provides a more guided and efficient search strategy. Addi-
including healthcare, finance, psychology, and anomaly de- tionally, their study was limited to evaluating a maximum of
tection, as highlighted in various studies (Johnson et al., 30 hyperparameter configurations, in contrast to our more
2016; Ulmer et al., 2020; Urban & Gates, 2021; Chandola extensive exploration of 100 configurations. Furthermore,
et al., 2009; Guo et al., 2017; A. & E., 2022), there has been despite using the validation set for hyperparameter opti-
significant research dedicated to developing algorithms that mization, they do not retrain the model on the combined
effectively address the challenges inherent in this domain. training and validation data using the best-found configura-
tion prior to evaluation on the test set. Our paper delineates
Gradient Boosted Decision Trees (GBDTs) (Friedman, from prior studies by applying a methodologically-correct
2001), including popular implementations like XG- experimental protocol involving thorough HPO for neural
Boost (Chen & Guestrin, 2016), LightGBM (Ke et al., 2017), networks.
and Catboost (Prokhorenkova et al., 2018), are widely fa-
vored by practitioners for their robust performance on tabu-
lar datasets.
3. Research Questions
In terms of neural networks, a prior work shows that metic- In a nutshell, we address the following research questions:
ulously searching for the optimal combination of regular-
ization techniques in simple multilayer perceptrons (MLPs) 1. Are decision trees superior to neural networks in terms
called Regularization Cocktails (Kadra et al., 2021) can of the predictive performance?
yield impressive results. Another recent paper proposes a
notable adaptation of the renowned ResNet architecture for
2. Do attention-based networks outperform multilayer
tabular data (Gorishniy et al., 2021). This version of ResNet,
perceptrons with residual connections (ResNets)?
originally conceived for image processing (He et al., 2016),
has been effectively repurposed for tabular datasets in their
research. We demonstrate that with thorough hyperparam- 3. How does the hyperparameter optimization (HPO) bud-
eter tuning, a ResNet model tailored for tabular data rivals get influence the performance of neural networks?
the performance of transformer-based architectures.
Reflecting their success in various domains, transformers To address these questions, we carry out an extensive empir-
have also garnered attention in the tabular data domain. ical assessment following the protocol of Section 5.
2
Tabular Data: Is Attention All You Need?
ResNeXt ResNeXt
Cat Embed Concat Linear Norm Linear Output
Block 1 Block N
Num
Linear Act Linear
..
Sum + ResNeXt Block
x
..
Paths
4. Revisiting MLPs with Residual Connections enabling the model to capture more complex relationships.
Building on the success of ResNet in vision datasets, (Gor- Normalization: In the architecture of our adapted ResNeXt
ishniy et al., 2021) have introduced an adaptation of ResNet model, we apply Batch Normalization (Ioffe & Szegedy,
for tabular data, demonstrating its strong performance. 2015) to normalize the outputs across the network’s layers.
A logical extension of this work is the exploration of a This approach is critical in ensuring stable training and
ResNeXt (Xie et al., 2017) adaptation for tabular datasets. effective convergence, as it helps to standardize the inputs
In our adaptation, we introduce multiple parallel paths in to each layer.
the ResNeXt block, a key feature that distinguishes it from Cardinality: Perhaps the most important component of
the traditional ResNet architecture. Despite this increase the ResNeXt architecture, cardinality refers to the number
in architectural complexity, designed to capture more nu- of parallel paths in the network. This concept, adapted
anced patterns in tabular data, the overall parameter count from grouped convolutions in vision tasks, allows the model
of our ResNeXt model remains comparable to that of the to learn more complex representations in tabular data, by
original ResNet model. This is achieved by a design choice making the network wider.
similar to that in the original ResNeXt architecture, where
the hidden units are distributed across multiple paths, each Residual Connections: Consistent with the original
receiving a fraction determined by the cardinality. This as- ResNeXt architecture and adopted by the ResNet architec-
pect achieves a balance between architectural sophistication ture, residual connections are employed. These connections
and model efficiency, without a substantial increase in the help in mitigating the vanishing gradient problem and enable
model’s size or computational demands. the training of deeper networks.
In this context, we present a straightforward adaptation of Dropout Layers: The architecture incorporates dropout lay-
ResNeXt for tabular data. The empirical results of Section 6 ers, both in the hidden layers and the residual connections,
indicate that this adaptation not only competes effectively to prevent overfitting by providing a form of regularization.
with transformer-based models but also shows strong perfor- Figure 1 illustrates the architecture of the adapted ResNeXt
mance in comparison to Gradient Boosted Decision Trees architecture, including a detailed view of its characteristic
(GBDT) models. ResNeXt block.
Our adaptation primarily involves the transformation of the
architecture to handle the distinct characteristics of tabular 5. Experimental Protocol
datasets, which typically include a mix of numerical and cat-
egorical features. Key components of the adapted ResNeXt In this study, we assess all the methods using OpenMLCC18,
architecture are: a popular well-established tabular benchmark used to com-
pare various methods in the community (Bischl et al., 2021),
Input Handling: The model accommodates both numerical which comprises 72 diverse datasets1 . The datasets con-
and categorical data. For categorical features, an embed-
1
ding layer transforms these features into a continuous space, Due to memory issues encountered with several methods, we
exclude four datasets from our analysis.
3
Tabular Data: Is Attention All You Need?
tain from 5 to 3073 features and from 500 to 100,000 in- Parameter Type Range Log Scale
stances, covering various binary and multi-class problems. Number of layers Integer [1, 8]
The benchmark excludes artificial datasets, subsets or bina- Layer Size Integer [64, 1024]
rizations of larger datasets, and any dataset solvable by a Learning rate Float [10−5 , 10−2 ] ✓
single feature or a simple decision tree. For the full list of Weight decay Float [10−6 , 10−3 ] ✓
Residual Dropout Float [0, 0.5]
datasets used in our study, please refer to Appendix C.
Hidden Dropout Float [0, 0.5]
Our evaluation employs a nested cross-validation approach. Dim. embedding Integer [64, 512]
Dim. hidden factor Float [1.0, 4.0]
Initially, we partition the data into 10 folds. Nine of these
Cardinality Categorical {2, 4, 8, 16, 32}
folds are then used for hyperparameter tuning. Each hy-
perparameter configuration is evaluated using 9-fold cross Table 1. Search space for the ResNeXt model
validation. The results from the cross-validation are used
to estimate the performance of the model under a specific
hyperparameter configuration. 5.1. Baselines
For hyperparameter optimization, we utilize Optuna (Ak- In our experiments, we compare a range of methods catego-
iba et al., 2019), a well-known HPO library with the Tree- rized into three distinct groups:
structured Parzen Estimator (TPE) algorithm for hyperpa-
rameter optimization, the default Optuna HPO method. The Gradient Boosted Decision Trees Domain: Initially,
optimization is constrained by a budget of either 100 trials we consider XGBoost (Chen & Guestrin, 2016), a well-
or a maximum duration of 23 hours. Upon determining the established GBDT library that uses asymmetric trees. The
optimal hyperparameters using Optuna, we train the model library does not natively handle categorical features, which
on the combined training and validation folds. To enhance is why we apply one-hot encoding, where, the categorical
efficiency, we execute every outer fold in parallel across all feature is represented as a sparse vector, where, only the
datasets. All experiments are run on NVIDIA RTX2080Ti entry corresponding to the current feature value is encoded
GPUs with a memory of 16 GB. Our evaluation protocol with a 1. Moreover, we consider Catboost, a well-known
dictates that for every algorithm, up to 68K different models library for GBDT that employs oblivious trees as weak learn-
will be evaluated, leading to a total of approximately 600K ers and natively handles categorical features with various
individual evaluations. As our study encompasses seven dis- strategies. We utilize the official library proposed by the
tinct methods, this methodology culminates in a substantial authors for our experiments (Prokhorenkova et al., 2018).
total of over 4M evaluations, involving more than 400K Traditional Deep Learning (DL) Methods: Recent works
unique models. have shown that MLPs that feature residual connections out-
Lastly, we report the model’s performance as the average perform plain MLPs and make for very strong competitors
Area Under the Receiver Operating Characteristic (ROC- to state-of-the-art architectures (Kadra et al., 2021; Gorish-
AUC) across the 10 outer test folds. Given the prevalence niy et al., 2021), as such, in our study we include the ResNet
of imbalanced datasets in the OpenMLCC18 benchmark, implementation provided in Gorishniy et al. (2021).
we employ ROC-AUC as our primary metric. ROC-AUC Transformer-Based Architectures: As state-of-the-art spe-
quantifies the ability of a model to distinguish between cialized deep learning architectures, we consider TabNet
classes, calculated as the area under the curve plotted with which employs sequential attention to selectively utilize
True Positive Rate (TPR) against False Positive Rate (FPR) the most pertinent features at each decision step. For the
at various threshold settings. This measure offers a more implementation of TabNet, we use a well-maintained pub-
reliable assessment of model performance across varied lic implementation3 . Moreover, we consider SAINT which
class distributions, as it is less influenced by the imbalance introduces a hybrid deep learning approach tailored for tab-
in the dataset. ular data challenges. SAINT applies attention mechanisms
In our study, we adhered to the official hyperparameter across both rows and columns and integrates an advanced
search spaces from the respective papers for tuning every embedding technique. We use the official implementation
method. The search space utilized for our adapted ResNeXt for our experiments (Somepalli et al., 2021). Additionally,
model is detailed in Table 1. For a detailed description of the we consider FT-Transformer, an adaptation of the Trans-
hyperparameter search spaces of all other methods included former architecture for tabular data. It transforms categori-
in our analysis, we direct the reader to Appendix B. We cal and numerical features into embeddings, which are then
open-source our code to promote reproducibility2 . processed through a series of Transformer layers. For our
experiments, we use the official implementation from the au-
thors (Gorishniy et al., 2021). Lastly, we consider TabPFN,
2 3
https://github.com/releaunifreiburg/Revisiting-MLPs https://github.com/dreamquark-ai/tabnet
4
Tabular Data: Is Attention All You Need?
a meta-learned transformer architecture that performs in- the difference in results is statistically significant. Next,
context learning (ICL). We use the official implementation Figure 2 bottom, shows that when HPO optimization is
from the authors (Hollmann et al., 2023) for our experi- performed, the top-4 methods are consistent, however, the
ments. DL methods manage to have a better rank compared to
GBDT methods. After, performing HPO, the XGBoost per-
6. Experiments and Results formance improves and the differences in results between
SAINT, XGBoost, FT-Transformer, CatBoost, ResNet, and
Research Question 1: Are decision trees superior to neural ResNeXt are not statistically significant. Although the per-
networks in terms of the predictive performance? formance of TabNet improves with HPO, the method still
achieves the worst performance compared to the other meth-
'