ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
niy et al., 2021) for better selective feature interactions. model’s effectiveness.
Although these tailored designs gained performances on • We propose two tabular-data-specific Mixup variants,
supervised tabular data tasks, their performances are still H IDDEN -M IX and F EAT-M IX, which are superior to
not comparable with GBDT approaches (e.g., XGboost) on the vanilla input Mixup approach on tabular data.
a diverse array of datasets (Borisov et al., 2021).
Our work pushes this research envelop: We develop a new
2. Related Work
neural network that, for the first time, outperforms GBDTs
on a wide range of public tabular datasets. This is achieved Supervised Tabular Learning. Since neural networks
based on the cooperation of a new tabular data tailored have been demonstrated to be efficient on various data types
architecture called E XCEL F ORMER and a bespoke train- (e.g., images (Khan et al., 2022)), plentiful efforts were
ing methodology, which jointly learns appropriate feature made to harness the power of neural networks on tabular
representation update functions and judicious feature inter- data. However, so far GBDT approaches (e.g., XGboost)
actions (satisfying aforementioned (i) and (ii)). For better still remain as the go-to choice (Katzir et al., 2020) for
feature representations, we propose an attention module, various supervised tabular tasks (Borisov et al., 2021; Grin-
called attentive intra-feature update module (AiuM), which sztajn et al., 2022), due to their superior performances on
is more powerful than previous non-attentive representa- diverse tabular datasets. To achieve GBDT-level results,
tion update approaches (e.g., linear or non-linear projection recent studies focused on devising sophisticated neural mod-
networks). For feature interactions, we present a conser- ules for heterogeneous feature interactions (Gorishniy et al.,
vative approach based on a novel module called directed 2021; Chen et al., 2022; Yan et al., 2023), mimicking tree-
inter-feature attention module (DiaM), which avoids com- like approaches (Katzir et al., 2020; Popov et al., 2019;
promising the semantics of critical features by only allowing Arik & Pfister, 2021) to find decision paths, or resorting
features of lower importance to fetch information from those to conventional approaches (Cheng et al., 2016; Guo et al.,
of higher importance. Our E XCEL F ORMER is mainly built 2017). Apart from model designs, various data represen-
by stacking alternately these two types of modules in turn. tation approaches, such as feature embedding (Gorishniy
Since the main ingredients AiuM and DiaM are both flexible et al., 2022; Chen et al., 2023), discretization of continuous
attention based modules, our training methodology aims to features (Guo et al., 2021; Wang et al., 2020), and Boolean
prevent E XCEL F ORMER from converging to an overly com- algebra based methods (Wang et al., 2021), were applied to
plicated representation function that overfits irregular target deal with irregular target patterns (Tancik et al., 2020; Grin-
functions and from introducing useless feature interactions sztajn et al., 2022). These attempts suggested the potentials
to hurt generalization. At the start of training, a novel initial- of neural networks, but still yielded inferior performances
ization approach assigns minuscule values to the weights of comparing with GBDTs on a wide range of tabular datasets.
DiaM and AiuM, so as to attenuate the intra-feature represen- Several challenges for neural networks on tabular data were
tation updates and inter-feature interactions. During training, summed up in (Grinsztajn et al., 2022). But no solution has
the effects of DiaM and AiuM then grow progressively to op- been given, and these challenges still remain open. Besides,
timum levels under the guidance of our new regularization there were some attempts (Wang & Sun, 2022; Arik & Pfis-
schemes F EAT-M IX and H IDDEN -M IX. H IDDEN -M IX and ter, 2021; Yoon et al., 2020) to apply self-supervision to
F EAT-M IX are two variants of Mixup (Zhang et al., 2018) tabular datasets. However, these approaches are dataset- or
specifically for tabular data, which avoid the disadvantages domain-specific, and appear difficult to be adopted widely
of the original Mixup approach (to be discussed in Sec. 4) due to the heterogeneity of tabular datasets.
and respectively prioritize to promote the learning of feature Mixup and Its Variants. The original Mixup (Zhang
representations and feature interactions. et al., 2018) generates a new data by convex interpolations
Our main contributions are summarized as follows. of two given data, which was proved beneficial on vari-
ous image datasets (Tajbakhsh et al., 2020; Touvron et al.,
• We present the first neural network that outperforms 2021a) and some tabular datasets. However, we found that
GBDTs (e.g., XGboost), which is verified by compre- the original Mixup may conflict with irregular target pat-
hensive experiments on 28 public tabular datasets. terns (to be discussed in Sec. 4) and hardly cooperate with
the cutting-edge models (Gorishniy et al., 2021; Somepalli
• We identify two key capabilities of neural networks et al., 2021). ManifoldMix (Verma et al., 2019) and Flow-
for effectively handling tabular data, which will inspire Mixup (Chen et al., 2020) applied convex interpolations
further researches. to the hidden states, which did not fundamentally alter the
• To equip our E XCEL F ORMER model with the two key way to synthesize new data and exhibited similar character-
capabilities, we develop new modules and a novel istics as the vanilla input Mixup. The follow-up variants
training methodology that cooperatively promote the CutMix (Yun et al., 2019), AttentiveMix (Walawalkar et al.,
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
Linear
Linear
0.3 1.54 Yes 0.43
Embedding Layer
Prediction Head
Importance
Linear
K
Norm
Norm
𝒛 ∈ ℝ𝒇×𝒅
𝒛′ ∈ ℝ𝒇×𝒅
𝒙 ∈ ℝ𝒇
V
Linear
Linear
Element-wise product Matrix product
Element-wise add Tanh
Figure 1. Illustrating our proposed E XCEL F ORMER model. AiuM and DiaM denote the attentive intra-feature update module and directed
inter-feature attention module, respectively. “Norm” denotes a LayerNorm layer (Ba et al., 2016). Before being fed into the model, the
input features are sorted according to a feature importance metric (e.g., mutual information).
2020), SaliencyMix (Uddin et al., 2020), ResizeMix (Qin Transformer-like models (Yan et al., 2023; Gorishniy et al.,
et al., 2020), and PuzzleMix (Kim et al., 2020) spliced 2021), the commonly-used position-wise feed-forward net-
two images spatially, which defended local patterns of im- work (FFN) (Vaswani et al., 2017) was employed for feature
ages but are not directly available to tabular data. Darabi representation update. However, we empirically discovered
et al (Darabi et al., 2021) and Gowthami et al (Somepalli that the FFN containing two linear projections and a ReLU
et al., 2021) applied Mixup and CutMix-like approaches activation is not flexible enough to fit irregular target func-
in tabular data pre-training. It was shown (Kadra et al., tions, and hence design an attention approach to handle
2021) that a search through regularization approaches could intra-feature representation updates, by:
promote the performance of a simple neural network up to
(l) (l) (l) (l)
the XGboost level. But, time-consuming hyper-parameter z 0 = tanh (zW1 + b1 ) (zW2 + b2 ), (1)
tuning is a necessity in their settings, while XGboost and
(l) (l) (l)
Catboost may not be extensively tuned. In contrast, our where W1 ∈ Rd×d , W2 ∈ Rd×d , b1 ∈ Rd , and
(l)
E XCEL F ORMER models with fixed settings achieve GBDT- b2 ∈ Rd are all learnable parameters for the l-th layer,
level performances without hyper-parameter tuning. denotes element-wise product, and z and z 0 denote the
input and output representations, respectively. Our exper-
3. E XCEL F ORMER iments show that Eq. (1) is more powerful than FFN with
the same computational costs. Notably, the operations in
3.1. The Overall Architecture Eq. (1) do not conduct any feature interactions.
Fig. 1 shows our proposed E XCEL F ORMER model. E X -
CEL F ORMER is built mainly based on two simple ingredi-
3.3. Directed Inter-feature Attention Module (DiaM)
ents, the attentive intra-feature update module (AiuM) and It was pointed out (Ng, 2004) that neural networks are inher-
directed inter-feature attention module (DiaM), which re- ently inefficient to organize feature interactions, yet previous
spectively conduct feature representation update and feature work empirically demonstrated the benefits of feature inter-
interactions. During the processing, f features of an input actions (Chen et al., 2022; Cheng et al., 2016). Thus, we
data x ∈ Rf are first tokenized by a neural embedding layer present a conservative approach for feature interactions that
into representations of size d each, denoted as z (0) ∈ Rf ×d . allows only the lower target-relevant features to gain access
It is then successively processed by L DiaMs and L AiuMs to the information of the higher target-relevant features. Be-
alternately. These two modules both have a LayerNorm fore feeding features into E XCEL F ORMER, we sort them in
head, and are accompanied with additive shortcut connec- descending order according to the feature importance (we
tions as illustrated in Fig. 1. Finally, a probability vector use mutual information in this paper) with respect to the
of C categories p ∈ RC (C > 2) for multi-class classifi- targets in the training set. For judiciously handling feature
cation or a scale value p ∈ R1 for regression and binary interactions, we perform a special self-attention operation
classification is produced by a prediction head. with an unoptimizable mask M , as:
√
3.2. Attentive Intra-feature Update Module (AiuM) z 0 = σ(((zWq )(zWk )T ⊕ M )/ d)(zWv ), (2)
A possible conflict between the irregularity of target func- where Wq , Wk , Wv ∈ Rd×d are all learnable matrices, ⊕
tions and over-smooth solutions produced by neural net- is element-wise addition, and σ is the softmax operating
works was identified (Grinsztajn et al., 2022). In known along the last dimension. The elements in the lower triangle
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
features
rep.
sample 𝑧! sample 𝑧" synthesized 𝑧!
(a) An example operation of HIDDEN-MIX (after feature embedding)
features
Table 1. Performance comparison of various E XCEL F ORMER versions with extensively tuned XGboost and Catboost on 28 public
datasets. The performances of E XCEL F ORMERs that outperform both XGboost and Catboost are marked in bold, while those that
outperform either XGboost or Catboost are underlined. The performances of XGboost and Catboost are bold if they are the best results.
Datasets AN IS CP VI YP GE CH SU BA BR
XGboost -0.1076 95.78 -2.1370 -0.1140 -0.0275 68.75 85.66 -0.0177 88.97 -0.0769
Catboost -0.0929 95.26 -2.5160 -0.1181 -0.0275 66.54 86.62 -0.0220 89.16 -0.0931
E XCEL F ORMER-F EAT-M IX -0.0782 96.38 -2.6590 -1.6220 -0.0276 70.38 85.89 -0.0184 89.00 -0.1123
E XCEL F ORMER-H IDDEN -M IX -0.0786 96.72 -2.2320 -0.2440 -0.0276 70.72 85.89 -0.0174 88.65 -0.0696
E XCEL F ORMER (Mix Tuned) -0.0876 96.51 -2.2020 -0.1070 -0.0275 68.36 85.80 -0.0173 89.21 -0.0627
E XCEL F ORMER (Fully Tuned) -0.0778 96.56 -2.1980 -0.0899 -0.0275 68.94 85.89 -0.0161 89.16 -0.0641
Datasets (Continued) EY MA AI PO BP CR CA HS HO
XGboost 72.88 93.69 -0.0001605 -4.331 99.96 85.11 -0.4359 -0.1707 -3.139
Catboost 72.41 93.66 -0.0001616 -4.622 99.95 85.12 -0.4359 -0.1746 -3.279
E XCEL F ORMER-F EAT-M IX 71.44 93.38 -0.0001689 -5.694 99.94 85.23 -0.4331 -0.1835 -3.305
E XCEL F ORMER-H IDDEN -M IX 72.09 93.66 -0.0001627 -2.862 99.95 85.22 -0.4587 -0.1773 -3.147
E XCEL F ORMER (Mix Tuned) 74.14 94.04 -0.0001615 -2.629 99.93 85.26 -0.4316 -0.1726 -3.159
E XCEL F ORMER (Fully Tuned) 78.94 94.11 -0.0001612 -2.636 99.96 85.36 -0.4336 -0.1727 -3.214
CA CP HO
HE GE JA
Figure 5. Ablation study on our proposed ingredients of E XCEL F ORMER using six datasets. “–” denotes removal and “+” denotes
inclusion. The bars colored in “purple” indicate worse performances compared with the baseline, while the bars in “orange”
indicate better performances. Note that the lower “RMSE”, the better; the higher “ACCURACY”, the better.
5.3. Usage Suggestions put Mixup (Zhang et al., 2018) (α = 0.5) is used to replace
our proposed Mixup schemes. One can see that the perfor-
In practice, we would suggest that a user may use our E X -
mances often decrease when an ingredient is removed or
CEL F ORMER as follows: (1) first try E XCEL F ORMER with
replaced, suggesting that all of our ingredients are beneficial
fixed hyper-parameters, and it can meet the needs in most
in general. But, it is also witnessed that the compared model
situations; (2) try the setting of “Mix Tuned” if the fixed E X -
versions perform worse than the baseline on 1 or 2 out of the
CEL F ORMER versions are not satisfactory; (3) finally, tune
6 datasets, indicating that an ingredient may have negative
all the hyper-parameters of E XCEL F ORMER if better perfor-
impact on some datasets. In the model development, we
mances are desired. Fig. 4 gives performance comparisons
retain all these designs since they show positive impacts on
on different types of tasks, based on which we offer two
most of the datasets. Notably, it is difficult to optimize a
further suggestions to users. (i) If extremely high effects
design that is always effective since tabular data are of high
are desired, it is wise to tune E XCEL F ORMER following
diversity and our goal is to present a neural network that can
the “Mix Tuning” setting or “Fully Tuning” setting, for any
accommodate as many tasks as possible.
type of tasks. (ii) For a multi-class classification task, E X -
CEL F ORMER should be the first choice, since it commonly Comparing the baseline with the versions with the input
outperforms GBDTs, even without hyper-parameter tuning. Mixup or with no Mixup, it is clear that our proposed Mixup
schemes are more suitable to tabular data and are outper-
5.4. Ablation Study forming on 5 or 6 out of the 6 datasets, respectively. Com-
paring the no-Mixup version and the version with the input
We analyze the effects of our proposed ingredients empiri- Mixup, the version with the input Mixup performs better on
cally on 6 tabular datasets (we find that the conclusions on 4 out of 6 datasets; the no-Mixup version is better on the
the other datasets are similar). We take the best-performing other 2 datasets. These results further indicate that using the
model of E XCEL F ORMER-F EAT-M IX and E XCEL F ORMER- input Mixup is not consistently effective across various tab-
H IDDEN -M IX (without hyper-parameter tuning) as the base- ular datasets, though it beats our proposed Mixup schemes
line, and either remove or replace one ingredient each time on the GE dataset.
for comparison. Fig. 5 reports the performances of the fol-
lowing E XCEL F ORMER versions: (1) He’s initialization is
used to replace our attenuated initialization approach for 6. Conclusions
AiuM and DiaM, (2) a vanilla self-attention module (vanilla In this paper, we developed a new neural network model,
SA) is used to replace DiaM for heterogeneous feature inter- E XCEL F ORMER, for supervised tabular data tasks (e.g.,
actions, (3) the linear feed-forward network (FFN) is used classification and regression), and achieved performances
to replace AiuM for feature representation updates, (4) both beyond the level of GBDTs without bells and whistles. Our
F EAT-M IX and H IDDEN -M IX are not used, and (5) the in- proposed E XCEL F ORMER can achieve competitive perfor-
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
mances compared to the extensively tuned XGboost and Chen, T. and Guestrin, C. XGBoost: A scalable tree boost-
Catboost even without hyper-parameter tuning, while hyper- ing system. In ACM SIGKDD International Conference
parameter tuning can improve E XCEL F ORMER’s perfor- on Knowledge Discovery and Data Mining, 2016.
mances further. Such superiority is demonstrated by com-
prehensive experiments on 28 public tabular datasets, and is Cheng, H.-T., Koc, L., Harmsen, J., et al. Wide & deep
achieved by the cooperation of a simple but efficient model learning for recommender systems. In Workshop on Deep
architecture and an accompanied training methodology. We Learning for Recommender Systems, 2016.
expect that our E XCEL F ORMER together with the training Darabi, S., Fazeli, S., Pazoki, A., Sankararaman, S., and
methodology will serve as an effective tool for supervised Sarrafzadeh, M. Contrastive Mixup: Self-and semi-
tabular data applications, and inspire future studies to de- supervised learning for tabular domain. arXiv preprint
velop better approaches for dealing with tabular data. arXiv:2108.12296, 2021.
Dong, L., Xu, S., and Xu, B. Speech-Transformer: A
Acknowledgements no-recurrence sequence-to-sequence model for speech
This research was partially supported by the National Key recognition. In International Conference on Acoustics,
R&D Program of China under grant No. 2018AAA0102102 Speech and Signal Processing, 2018.
and National Natural Science Foundation of China under
Duan, T., Anand, A., Ding, D. Y., Thai, K. K., Basu, S., Ng,
grants No. 62132017.
A., and Schuler, A. NGBoost: Natural gradient boosting
for probabilistic prediction. In International Conference
References on Machine Learning, 2020.
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. Glorot, X. and Bengio, Y. Understanding the difficulty of
Optuna: A next-generation hyperparameter optimization training deep feedforward neural networks. In Interna-
framework. In The ACM SIGKDD International Confer- tional Conference on Artificial Intelligence and Statistics,
ence on Knowledge Discovery & Data Mining, 2019. 2010.
Arik, S. Ö. and Pfister, T. TabNet: Attentive interpretable Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A.
tabular learning. In The AAAI Conference on Artificial Revisiting deep learning models for tabular data. In Ad-
Intelligence, 2021. vances in Neural Information Processing Systems, 2021.
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. Gorishniy, Y., Rubachev, I., and Babenko, A. On embed-
arXiv preprint arXiv:1607.06450, 2016. dings for numerical features in tabular deep learning.
In Advances in Neural Information Processing Systems,
Bachlechner, T., Majumder, B. P., Mao, H., Cottrell, G., and 2022.
McAuley, J. Rezero is all you need: Fast convergence
at large depth. In Uncertainty in Artificial Intelligence, Grinsztajn, L., Oyallon, E., and Varoquaux, G. Why do
2021. tree-based models still outperform deep learning on typ-
ical tabular data? In Advances in Neural Information
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, Processing Systems, 2022.
M., and Kasneci, G. Deep neural networks and tabular
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. DeepFM:
data: A survey. arXiv preprint arXiv:2110.01889, 2021.
A factorization-machine based neural network for CTR
Chen, J., Yu, H., Feng, R., Chen, D. Z., and Wu, J. Flow- prediction. In International Joint Conference on Artificial
Mixup: Classifying multi-labeled medical images with Intelligence, 2017.
corrupted labels. In International Conference on Bioin- Guo, H., Chen, B., Tang, R., Zhang, W., Li, Z., and He, X.
formatics and Biomedicine, 2020. An embedding learning framework for numerical features
in CTR prediction. In ACM SIGKDD Conference on
Chen, J., Liao, K., Wan, Y., Chen, D. Z., and Wu, J. DANets:
Knowledge Discovery and Data Mining, 2021.
Deep abstract networks for tabular data classification and
regression. In The AAAI Conference on Artificial Intelli- He, K., Zhang, X., Ren, S., and Sun, J. Delving deep
gence, 2022. into rectifiers: Surpassing human-level performance on
ImageNet classification. In International Conference on
Chen, J., Liao, K., Fang, Y., Chen, D. Z., and Wu, J. Tab- Computer Vision, 2015.
Caps: A capsule neural network for tabular data classifi-
cation with BoW routing. In International Conference on Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Learning Representations, 2023. Neural Computation, 1997.
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
Kadra, A., Lindauer, M., Hutter, F., and Grabocka, J. Well Srivastava, R. K., Greff, K., and Schmidhuber, J. High-
tuned simple nets excel on tabular datasets. Advances in way networks. In International Conference on Machine
Neural Information Processing Systems, 2021. Learning Workshop, 2015.
Katzir, L., Elidan, G., and El-Yaniv, R. Net-DNF: Effective Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J. N., Wu, Z.,
deep modeling of tabular data. In International Confer- and Ding, X. Embracing imperfect datasets: A review of
ence on Learning Representations, 2020. deep learning solutions for medical image segmentation.
Medical Image Analysis, 2020.
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S.,
and Shah, M. Transformers in vision: A survey. ACM Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil,
Computing Surveys, 2022. S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron,
J., and Ng, R. Fourier features let networks learn high fre-
Kim, J.-H., Choo, W., and Song, H. O. Puzzle Mix: Ex-
quency functions in low dimensional domains. Advances
ploiting saliency and local statistics for optimal mixup.
in Neural Information Processing Systems, 2020.
In International Conference on Machine Learning, 2020.
Loshchilov, I. and Hutter, F. Decoupled weight decay reg- Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
ularization. In International Conference on Learning A., and Jégou, H. Training data-efficient image Trans-
Representations, 2018. formers & distillation through attention. In International
Conference on Machine Learning, 2021a.
Ng, A. Y. Feature selection, L1 vs. L2 regularization, and
rotational invariance. In International Conference on Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and
Machine Learning, 2004. Jégou, H. Going deeper with image Transformers. In
IEEE/CVF International Conference on Computer Vision,
Popov, S., Morozov, S., and Babenko, A. Neural oblivious 2021b.
decision ensembles for deep learning on tabular data. In
International Conference on Learning Representations, Uddin, A. S., Monira, M. S., Shin, W., Chung, T., and Bae,
2019. S.-H. SaliencyMix: A saliency guided data augmenta-
tion strategy for better regularization. In International
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., Conference on Learning Representations, 2020.
and Gulin, A. CatBoost: Unbiased boosting with categor-
ical features. Advances in Neural Information Processing Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
Systems, 2018. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. Advances in Neural Information
Qin, J., Fang, J., Zhang, Q., Liu, W., Wang, X., and Wang, Processing Systems, 2017.
X. ResizeMix: Mixing data with preserved object infor-
mation and true labels. arXiv preprint arXiv:2012.11101, Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas,
2020. I., Lopez-Paz, D., and Bengio, Y. Manifold Mixup: Better
representations by interpolating hidden states. In Interna-
Radford, A., Narasimhan, K., et al. Improving lan- tional Conference on Machine Learning, 2019.
guage understanding by generative pre-training.
https://s3-us-west-2.amazonaws. Walawalkar, D., Shen, Z., Liu, Z., and Savvides, M. Atten-
com/openai-assets/research-covers/ tive Cutmix: An enhanced data augmentation approach
language-unsupervised/language_ for deep learning based image classification. In Inter-
understanding_paper.pdf, 2018. national Conference on Acoustics, Speech and Signal
Processing, 2020.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M.,
Hamprecht, F., Bengio, Y., and Courville, A. On the spec- Wang, Z. and Sun, J. TransTab: Learning transferable
tral bias of neural networks. In International Conference tabular Transformers across tables. In Advances in Neural
on Machine Learning, 2019. Information Processing Systems, 2022.
Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact Wang, Z., Zhang, W., Ning, L., and Wang, J. Transparent
solutions to the nonlinear dynamics of learning in deep classification with multilayer logical perceptrons and ran-
linear neural networks. In International Conference on dom binarization. In The AAAI Conference on Artificial
Learning Representations, 2014. Intelligence, 2020.
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, Wang, Z., Zhang, W., Liu, N., and Wang, J. Scalable rule-
C. B., and Goldstein, T. SAINT: Improved neural net- based representation learning for interpretable classifica-
works for tabular data via row attention and contrastive tion. Advances in Neural Information Processing Systems,
pre-training. arXiv preprint arXiv:2106.01342, 2021. 2021.
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
Yan, J., Chen, J., Wu, Y., Chen, D. Z., and Wu, J. T2G-
former: Organizing tabular features into relation graphs
promotes heterogeneous feature interaction. The AAAI
Conference on Artificial Intelligence, 2023.
Yoon, J., Zhang, Y., Jordon, J., and van der Schaar, M.
VIME: Extending the success of self-and semi-supervised
learning to tabular domain. Advances in Neural Informa-
tion Processing Systems, 2020.
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y.
CutMix: Regularization strategy to train strong classifiers
with localizable features. In International Conference on
Computer Vision, 2019.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.
Understanding deep learning requires rethinking general-
ization. Communications of the ACM, 2021.
Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.
Mixup: Beyond empirical risk minimization. In Interna-
tional Conference On Learning Representations, 2018.
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
Table 2. The details of the datasets used. “# Num” and “# Cat” denote the numbers of numerical and categorical features, respectively.
“# Sample” is for the size of a dataset.
Dataset Abbr. Task Type Metric # Features # Num # Cat # Sample Link
analcatdata supreme AN regression nRMSE 7 2 5 4,052 https://www.openml.org/d/44055
isolet IS multiclass ACC 613 613 0 7,797 https://www.openml.org/d/44135
cpu act CP regression nRMSE 21 21 0 8,192 https://www.openml.org/d/44132
visualizing soil VI regression nRMSE 4 3 1 8,641 https://www.openml.org/d/44056
yprop 4 1 YP regression nRMSE 62 42 20 8,885 https://www.openml.org/d/44054
gesture GE multiclass ACC 32 32 0 9,873 https://www.openml.org/d/4538
churn CH binclass AUC 11 10 1 10,000 https://www.kaggle.com/
shrutimechlearn/churn-modelling
sulfur SU regression nRMSE 6 6 0 10,081 https://www.openml.org/d/44145
bank-marketing BA binclass AUC 7 7 0 10,578 https://www.openml.org/d/44126
Brazilian houses BR regression nRMSE 8 8 0 10,692 https://www.openml.org/d/44141
eye EY multiclass ACC 26 26 0 10,936 http://www.cis.hut.fi/
eyechallenge2005
MagicTelescope MA binclass AUC 10 10 0 13,376 https://www.openml.org/d/44125
Ailerons AI regression nRMSE 33 33 0 13,750 https://www.openml.org/d/44137
pol PO regression nRMSE 26 26 0 15,000 https://www.openml.org/d/722
binarized-pol BP binclass AUC 48 48 0 15,000 https://www.openml.org/d/722
credit CR binclass AUC 10 10 0 16,714 https://www.openml.org/d/44089
california CA regression nRMSE 8 8 0 20,640 https://www.dcc.fc.up.pt/˜ltorgo/
Regression/cal_housing.html
house sales HS regression nRMSE 15 15 0 21,613 https://www.openml.org/d/44144
house HO regression nRMSE 16 16 0 22,784 https://www.openml.org/d/574
diamonds DI regression nRMSE 6 6 0 53,940 https://www.openml.org/d/44140
helena HE multiclass ACC 27 27 0 65,196 https://www.openml.org/d/41169
jannis JA multiclass ACC 54 54 0 83,733 https://www.openml.org/d/41168
higgs-small HI binclass AUC 28 28 0 98,049 https://www.openml.org/d/23512
road-safety RO binclass AUC 32 29 3 111,762 https://www.openml.org/d/44161
medicalcharges ME regression nRMSE 3 3 0 163,065 https://www.openml.org/d/44146
SGEMM GPU kernel performance SG regression nRMSE 9 3 6 241,600 https://www.openml.org/d/44069
covtype CO multiclass nRMSE 54 54 0 581,012 https://www.openml.org/d/1596
nyc-taxi-green-dec-2016 NY regression nRMSE 9 9 0 581,835 https://www.openml.org/d/44143
B. Hyper-Parameter Tuning
For XGboost and Catboost, we follow the implementations in (Gorishniy et al., 2021), while increasing the number of
estimators/iterations (i.e., decision trees) and the number of tuning iterations, so as to attain best-performing models.
For our E XCEL F ORMER, we apply the Optuna based tuning (Akiba et al., 2019). The hyper-parameter search spaces of
E XCEL F ORMER, XGboost, and Catboost are reported in Tables 3, 4, and 5, respectively. For E XCEL F ORMER, we tune just
50 iterations on the configurations with our proposed Mixup schemes (Mix tuning), while for full tuning, we tune further 50
iterations using the acquired hyper-parameters from Mix tuning as initialization.
E XCEL F ORMER: A Neural Network Surpassing GBDTs on Tabular Data
Table 3. The hyper-parameter tuning space for E XCEL F ORMER. The items marked with “*” are used in the Mix tuning, while all the
items are used in the full tuning.
Hyper-parameter Distribution
# Layers L UniformInt[2, 5]
Representation size d {64, 128, 256}
# Heads {4, 8, 16, 32}
Residual dropout rate Uniform[0, 0.5]
Learning rate LogUniform[3 × 10−5 , 10−3 ]
Weight decay {0.0, LogUniform[10−6 , 10−3 ]}
(*) Mixup type {F EAT-M IX, H IDDEN -M IX }
(*) α of Beta distribution Uniform[0.1, 3.0]