0% found this document useful (0 votes)

55 views12 pages

Revisiting Deep Learning Models For Tabular Data

This document reviews deep learning models for tabular data, highlighting the lack of proper comparisons and effective baselines in existing literature. It introduces two architectures, a ResNet-like model and a simple adaptation of the Transformer (FT-Transformer), which demonstrate strong performance across various tasks. The study concludes that while these models can outperform others, there is still no universally superior solution compared to Gradient Boosted Decision Trees.

Uploaded by

Tăng Hà

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views12 pages

Revisiting Deep Learning Models For Tabular Data

Uploaded by

Tăng Hà

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Revisiting Deep Learning Models for Tabular Data

Yury Gorishniy∗†‡ Ivan Rubachev†♣ Valentin Khrulkov† Artem Babenko†♣

† Yandex, Russia
‡ Moscow Institute of Physics and Technology, Russia
♣ National Research University Higher School of Economics, Russia

Abstract
The existing literature on deep learning for tabular data proposes a wide range of
novel architectures and reports competitive results on various datasets. However,
the proposed models are usually not properly compared to each other and existing
works often use different benchmarks and experiment protocols. As a result,
it is unclear for both researchers and practitioners what models perform best.
Additionally, the field still lacks effective baselines, that is, the easy-to-use models
that provide competitive performance across different problems.
In this work, we perform an overview of the main families of DL architectures for
tabular data and raise the bar of baselines in tabular DL by identifying two simple
and powerful deep architectures. The first one is a ResNet-like architecture which
turns out to be a strong baseline that is often missing in prior works. The second
model is our simple adaptation of the Transformer architecture for tabular data,
which outperforms other solutions on most tasks. Both models are compared to
many existing architectures on a diverse set of tasks under the same training and
tuning protocols. We also compare the best DL models with Gradient Boosted
Decision Trees and conclude that there is still no universally superior solution. The
source code is available at https://github.com/yandex-research/rtdl.

1 Introduction
Due to the tremendous success of deep learning on such data domains as images, audio and texts
(Goodfellow et al., 2016), there has been a lot of research interest to extend this success to problems
with data stored in tabular format. In these problems, data points are represented as vectors of
heterogeneous features, which is typical for industrial applications and ML competitions, where
neural networks have a strong non-deep competitor in the form of GBDT (Chen and Guestrin, 2016;
Ke et al., 2017; Prokhorenkova et al., 2018). Along with potentially higher performance, using
deep learning for tabular data is appealing as it would allow constructing multi-modal pipelines for
problems, where only one part of the input is tabular, and other parts include images, audio and
other DL-friendly data. Such pipelines can then be trained end-to-end by gradient optimization for
all modalities. For these reasons, a large number of DL solutions were recently proposed, and new
models continue to emerge (Arik and Pfister, 2020; Badirli et al., 2020; Hazimeh et al., 2020; Huang
et al., 2020; Klambauer et al., 2017; Popov et al., 2020; Song et al., 2019; Wang et al., 2017, 2020).
Unfortunately, due to the lack of established benchmarks (such as ImageNet (Deng et al., 2009) for
computer vision or GLUE (Wang et al., 2019a) for NLP), existing papers use different datasets for
evaluation and proposed DL models are often not adequately compared to each other. Therefore, from
the current literature, it is unclear what DL model generally performs better than others and whether
GBDT is surpassed by DL models. Additionally, despite the large number of novel architectures,
the field still lacks simple and reliable solutions that allow achieving competitive performance
with moderate effort and provide stable performance across many tasks. In that regard, Multilayer
∗
Correspondence to: [email protected]

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Perceptron (MLP) remains the main simple baseline for the field, however, it does not always
represent a significant challenge for other competitors.
The described problems impede the research process and make the observations from the papers not
conclusive enough. Therefore, we believe it is timely to review the recent developments from the
field and raise the bar of baselines in tabular DL. We start with a hypothesis that well-studied DL
architecture blocks may be underexplored in the context of tabular data and may be used to design
better baselines. Thus, we take inspiration from well-known battle-tested architectures from other
fields and obtain two simple models for tabular data. The first one is a ResNet-like architecture (He
et al., 2015) and the second one is FT-Transformer — our simple adaptation of the Transformer
architecture (Vaswani et al., 2017) for tabular data. Then, we compare these models with many
existing solutions on a diverse set of tasks under the same protocols of training and hyperparameters
tuning. First, we reveal that none of the considered DL models can consistently outperform the
ResNet-like model. Given its simplicity, it can serve as a strong baseline for future work. Second,
FT-Transformer demonstrates the best performance on most tasks and becomes a new powerful
solution for the field. Interestingly, FT-Transformer turns out to be a more universal architecture for
tabular data: it performs well on a wider range of tasks than the more “conventional” ResNet and
other DL models. Finally, we compare the best DL models to GBDT and conclude that there is still
no universally superior solution.
We summarize the contributions of our paper as follows:

1. We thoroughly evaluate the main models for tabular DL on a diverse set of tasks to investigate
their relative performance.
2. We demonstrate that a simple ResNet-like architecture is an effective baseline for tabular
DL, which was overlooked by existing literature. Given its simplicity, we recommend this
baseline for comparison in future tabular DL works.
3. We introduce FT-Transformer — a simple adaptation of the Transformer architecture for
tabular data that becomes a new powerful solution for the field. We observe that it is a more
universal architecture: it performs well on a wider range of tasks than other DL models.
4. We reveal that there is still no universally superior solution among GBDT and deep models.

2 Related work
The “shallow” state-of-the-art for problems with tabular data is currently ensembles of decision
trees, such as GBDT (Gradient Boosting Decision Tree) (Friedman, 2001), which are typically
the top-choice in various ML competitions. At the moment, there are several established GBDT
libraries, such as XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), CatBoost
(Prokhorenkova et al., 2018), which are widely used by both ML researchers and practitioners. While
these implementations vary in detail, on most of the tasks, their performances do not differ much
(Prokhorenkova et al., 2018).
During several recent years, a large number of deep learning models for tabular data have been
developed (Arik and Pfister, 2020; Badirli et al., 2020; Hazimeh et al., 2020; Huang et al., 2020;
Klambauer et al., 2017; Popov et al., 2020; Song et al., 2019; Wang et al., 2017). Most of these
models can be roughly categorized into three groups, which we briefly describe below.
Differentiable trees. The first group of models is motivated by the strong performance of decision
tree ensembles for tabular data. Since decision trees are not differentiable and do not allow gradient
optimization, they cannot be used as a component for pipelines trained in the end-to-end fashion.
To address this issue, several works (Hazimeh et al., 2020; Kontschieder et al., 2015; Popov et al.,
2020; Yang et al., 2018) propose to “smooth” decision functions in the internal tree nodes to make the
overall tree function and tree routing differentiable. While the methods of this family can outperform
GBDT on some tasks (Popov et al., 2020), in our experiments, they do not consistently outperform
ResNet.
Attention-based models. Due to the ubiquitous success of attention-based architectures for different
domains (Dosovitskiy et al., 2021; Vaswani et al., 2017), several authors propose to employ attention-
like modules for tabular DL as well (Arik and Pfister, 2020; Huang et al., 2020; Song et al., 2019). In
our experiments, we show that the properly tuned ResNet outperforms the existing attention-based

2
models. Nevertheless, we identify an effective way to apply the Transformer architecture (Vaswani
et al., 2017) to tabular data: the resulting architecture outperforms ResNet on most of the tasks.
Explicit modeling of multiplicative interactions. In the literature on recommender systems and
click-through-rate prediction, several works criticize MLP since it is unsuitable for modeling mul-
tiplicative interactions between features (Beutel et al., 2018; Qin et al., 2021; Wang et al., 2017).
Inspired by this motivation, some works (Beutel et al., 2018; Wang et al., 2017, 2020) have proposed
different ways to incorporate feature products into MLP. In our experiments, however, we do not find
such methods to be superior to properly tuned baselines.
The literature also proposes some other architectural designs (Badirli et al., 2020; Klambauer et al.,
2017) that cannot be explicitly assigned to any of the groups above. Overall, the community has
developed a variety of models that are evaluated on different benchmarks and are rarely compared
to each other. Our work aims to establish a fair comparison of them and identify the solutions that
consistently provide high performance.

3 Models for tabular data problems

In this section, we describe the main deep architectures that we highlight in our work, as well
as the existing solutions included in the comparison. Since we argue that the field needs strong
easy-to-use baselines, we try to reuse well-established DL building blocks as much as possible when
designing ResNet (section 3.2) and FT-Transformer (section 3.3). We hope this approach will result
in conceptually familiar models that require less effort to achieve good performance. Additional
discussion and technical details for all the models are provided in supplementary.
Notation. In this work, we consider supervised learning problems. D={(xi , yi )}ni=1 denotes a
(num) (cat) (num) (cat)
dataset, where xi =(xi , xi ) ∈ X represents numerical xij and categorical xij features
of an object and yi ∈ Y denotes the corresponding object label. The total number of features is
denoted as k. The dataset is split into three disjoint subsets: D = Dtrain ∪ Dval ∪ Dtest , where
Dtrain is used for training, Dval is used for early stopping and hyperparameter tuning, and Dtest
is used for the final evaluation. We consider three types of tasks: binary classification Y = {0, 1},
multiclass classification Y = {1, . . . , C} and regression Y = R.

3.1 MLP

We formalize the “MLP” architecture in Equation 1.

MLP(x) = Linear (MLPBlock (. . . (MLPBlock(x))))

(1)
MLPBlock(x) = Dropout(ReLU(Linear(x)))

3.2 ResNet

We are aware of one attempt to design a ResNet-like baseline (Klambauer et al., 2017) where the
reported results were not competitive. However, given ResNet’s success story in computer vision (He
et al., 2015) and its recent achievements on NLP tasks (Sun and Iyyer, 2021), we give it a second try
and construct a simple variation of ResNet as described in Equation 2. The main building block is
simplified compared to the original architecture, and there is an almost clear path from the input to
output which we find to be beneficial for the optimization. Overall, we expect this architecture to
outperform MLP on tasks where deeper representations can be helpful.

ResNet(x) = Prediction (ResNetBlock (. . . (ResNetBlock (Linear(x)))))

ResNetBlock(x) = x + Dropout(Linear(Dropout(ReLU(Linear(BatchNorm(x)))))) (2)
Prediction(x) = Linear (ReLU (BatchNorm (x)))

3
3.3 FT-Transformer

In this section, we introduce FT-Transformer (Feature Tokenizer + Transformer) — a simple

adaptation of the Transformer architecture (Vaswani et al., 2017) for the tabular domain. Figure 1
demonstrates the main parts of FT-Transformer. In a nutshell, our model transforms all features
(categorical and numerical) to embeddings and applies a stack of Transformer layers to the em-
beddings. Thus, every Transformer layer operates on the feature level of one object. We compare
FT-Transformer to conceptually similar AutoInt in section 5.2.

T0 TL ŷ
x
<latexit sha1_base64="NRz7UU+K2ImEJjN88KAz5rtUl6k=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GwCrtR0DJoYxkxL0iWMDuZTYbMzC4zs0JY8wm22tuJrT9j65c42WyhiQcuHM65l3M5QcyZNq775RTW1jc2t4rbpZ3dvf2D8uFRW0eJIrRFIh6pboA15UzSlmGG026sKBYBp51gcjv3O49UaRbJppnG1Bd4JFnICDZWemgO3EG54lbdDGiVeDmpQI7GoPzdH0YkEVQawrHWPc+NjZ9iZRjhdFbqJ5rGmEzwiPYslVhQ7afZqzN0ZpUhCiNlRxqUqb8vUiy0norAbgpsxnrZm4v/eoFYSjbhtZ8yGSeGSrIIDhOOTITmPaAhU5QYPrUEE8Xs74iMscLE2LZKthRvuYJV0q5VvYtq7f6yUr/J6ynCCZzCOXhwBXW4gwa0gMAInuEFXp0n5815dz4WqwUnvzmGP3A+fwATmpQl</latexit>

<latexit sha1_base64="Dh41pn4g0OI9DEsQjNd4FNJH48k=">AAAB/nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idTJIhs7PLTK+wLAF/wavevYlXf8WrX+Ik2YMmFjQUVd1UU0EshUHX/XIKa+sbm1vF7dLO7t7+QfnwqGWiRDPeZJGMdCeghkuheBMFSt6JNadhIHk7mNzO/PYj10ZE6gHTmPshHSkxFIyildq9McUsnfbLFbfqzkFWiZeTCuRo9MvfvUHEkpArZJIa0/XcGP2MahRM8mmplxgeUzahI961VNGQGz+bvzslZ1YZkGGk7Sgkc/X3RUZDY9IwsJshxbFZ9mbiv14QLiXj8NrPhIoT5IotgoeJJBiRWRdkIDRnKFNLKNPC/k7YmGrK0DZWsqV4yxWsklat6l1Ua/eXlfpNXk8RTuAUzsGDK6jDHTSgCQwm8Awv8Oo8OW/Ou/OxWC04+c0x/IHz+QMUxpZ0</latexit>

<latexit sha1_base64="USlI0QVWwW7mMQKtQqhjoQhYsLQ=">AAAB/HicbVA9SwNBFHwXv2L8ilraLAbBKtxFQcugjYVFhFwSSI6wt9lLluzuHbt7QjjiX7DV3k5s/S+2/hI3yRWaOPBgmHmPeUyYcKaN6345hbX1jc2t4nZpZ3dv/6B8eNTScaoI9UnMY9UJsaacSeobZjjtJIpiEXLaDse3M7/9SJVmsWyaSUIDgYeSRYxgYyW/2c/up/1yxa26c6BV4uWkAjka/fJ3bxCTVFBpCMdadz03MUGGlWGE02mpl2qaYDLGQ9q1VGJBdZDNn52iM6sMUBQrO9Kgufr7IsNC64kI7abAZqSXvZn4rxeKpWQTXQcZk0lqqCSL4CjlyMRo1gQaMEWJ4RNLMFHM/o7ICCtMjO2rZEvxlitYJa1a1buo1h4uK/WbvJ4inMApnIMHV1CHO2iADwQYPMMLvDpPzpvz7nwsVgtOfnMMf+B8/gAT4pVN</latexit>

<latexit sha1_base64="ehueANtMlSBGnzPvtot/dsmURoc=">AAAB+HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MKE2dnNzKwRCV/gVe/ejFf/xqtf4gB7ULCSTipV3alOBYng2rjul5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWMbmd+6wGV5rG8N+ME/YgOJA85o8ZK9cdeseSW3TnIKvEyUoIMtV7xu9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80en5MwqfRLGyo40ZK7+vpjQSOtxFNjNiJqhXvZm4r9eEC0lm/Dan3CZpAYlWwSHqSAmJrMWSJ8rZEaMLaFMcfs7YUOqKDO2q4ItxVuuYJU0K2XvolypX5aqN1k9eTiBUzgHD66gCndQgwYwQHiGF3h1npw35935WKzmnOzmGP7A+fwBH76Tpg==</latexit>

<latexit sha1_base64="Er266MYslE46l94jwOZg6ifeTYI=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoPgKexGQY9BLx4TyAuSJcxOepMhM7PLzKwQQ77Aq969iVf/xqtf4iTZgyYWNBRV3VRTYcKZNp735eQ2Nre2d/K7hb39g8Oj4vFJS8epotikMY9VJyQaOZPYNMxw7CQKiQg5tsPx/dxvP6LSLJYNM0kwEGQoWcQoMVaqN/rFklf2FnDXiZ+REmSo9YvfvUFMU4HSUE607vpeYoIpUYZRjrNCL9WYEDomQ+xaKolAHUwXj87cC6sM3ChWdqRxF+rviykRWk9EaDcFMSO96s3Ff71QrCSb6DaYMpmkBiVdBkcpd03szltwB0whNXxiCaGK2d9dOiKKUGO7KthS/NUK1kmrUvavypX6dal6l9WThzM4h0vw4Qaq8AA1aAIFhGd4gVfnyXlz3p2P5WrOyW5O4Q+czx/m35OC</latexit>
T [CLS]
<latexit sha1_base64="ZD0P/xFYEhljWhZw+G1A2/+q8fw=">AAACBXicbVC7SgNBFJ31GeMramkzGASrsBsFLYNpLCwimgcka5idzCZDZmaXmbtiWFL7C7ba24mt32HrlzhJttDEAxcO59zLuZwgFtyA6345S8srq2vruY385tb2zm5hb79hokRTVqeRiHQrIIYJrlgdOAjWijUjMhCsGQyrE7/5wLThkbqDUcx8SfqKh5wSsNJ9B9gjAKTt6vWtP+4Wim7JnQIvEi8jRZSh1i18d3oRTSRTQAUxpu25Mfgp0cCpYON8JzEsJnRI+qxtqSKSGT+dfj3Gx1bp4TDSdhTgqfr7IiXSmJEM7KYkMDDz3kT81wvkXDKEF37KVZwAU3QWHCYCQ4QnleAe14yCGFlCqOb2d0wHRBMKtri8LcWbr2CRNMol77RUvjkrVi6zenLoEB2hE+Shc1RBV6iG6ogijZ7RC3p1npw35935mK0uOdnNAfoD5/MHQi+ZSw==</latexit>
[CLS]
<latexit sha1_base64="ZD0P/xFYEhljWhZw+G1A2/+q8fw=">AAACBXicbVC7SgNBFJ31GeMramkzGASrsBsFLYNpLCwimgcka5idzCZDZmaXmbtiWFL7C7ba24mt32HrlzhJttDEAxcO59zLuZwgFtyA6345S8srq2vruY385tb2zm5hb79hokRTVqeRiHQrIIYJrlgdOAjWijUjMhCsGQyrE7/5wLThkbqDUcx8SfqKh5wSsNJ9B9gjAKTt6vWtP+4Wim7JnQIvEi8jRZSh1i18d3oRTSRTQAUxpu25Mfgp0cCpYON8JzEsJnRI+qxtqSKSGT+dfj3Gx1bp4TDSdhTgqfr7IiXSmJEM7KYkMDDz3kT81wvkXDKEF37KVZwAU3QWHCYCQ4QnleAe14yCGFlCqOb2d0wHRBMKtri8LcWbr2CRNMol77RUvjkrVi6zenLoEB2hE+Shc1RBV6iG6ogijZ7RC3p1npw35935mK0uOdnNAfoD5/MHQi+ZSw==</latexit>

Predict
<latexit sha1_base64="afg15pp3NZ1NWyWmdfHkNCjRHEE=">AAACCXicbVDLSgNBEJyNrxhfqx69DAbBU9iNgh6DXjxGMA9IljA720mGzD6Y6Q2GJV/gL3jVuzfx6ld49UucJHvQxIKGoqqbaspPpNDoOF9WYW19Y3OruF3a2d3bP7APj5o6ThWHBo9lrNo+0yBFBA0UKKGdKGChL6Hlj25nfmsMSos4esBJAl7IBpHoC87QSD3b7iI8ImJWVxAIjtOeXXYqzhx0lbg5KZMc9Z793Q1inoYQIZdM647rJOhlTKHgEqalbqohYXzEBtAxNGIhaC+bfz6lZ0YJaD9WZiKkc/X3RcZCrSehbzZDhkO97M3Efz0/XErG/rWXiShJESK+CO6nkmJMZ7XQQCjgKCeGMK6E+Z3yIVOMoymvZEpxlytYJc1qxb2oVO8vy7WbvJ4iOSGn5Jy45IrUyB2pkwbhZEyeyQt5tZ6sN+vd+lisFqz85pj8gfX5A/SrmsE=</latexit>

Feature <latexit sha1_base64="rwpUdRkfJbsopsGX8Flo22FRYiQ=">AAACCXicbVDLSgNBEJz1GeNr1aOXwSB4CrtR0GNQEI8RzAOSJcxOepMhsw9meoNhyRf4C1717k28+hVe/RInyR40saChqOqmmvITKTQ6zpe1srq2vrFZ2Cpu7+zu7dsHhw0dp4pDnccyVi2faZAigjoKlNBKFLDQl9D0hzdTvzkCpUUcPeA4AS9k/UgEgjM0Ute2OwiPiJjdAsNUwaRrl5yyMwNdJm5OSiRHrWt/d3oxT0OIkEumddt1EvQyplBwCZNiJ9WQMD5kfWgbGrEQtJfNPp/QU6P0aBArMxHSmfr7ImOh1uPQN5shw4Fe9Kbiv54fLiRjcOVlIkpShIjPg4NUUozptBbaEwo4yrEhjCthfqd8wBTjaMormlLcxQqWSaNSds/LlfuLUvU6r6dAjskJOSMuuSRVckdqpE44GZFn8kJerSfrzXq3PuarK1Z+c0T+wPr8AfYKmsI=</latexit>

Transformer
Tokenizer
<latexit sha1_base64="CCtNxMB+t4G4yc9qxfWQSZtMPMY=">AAACC3icbVDLSsNAFJ34rPUV69JNsAiuSlIFXRbduKzQF7ShTKa37dDJJMzcSGvoJ/gLbnXvTtz6EW79EqdtFtp64MLhnHs5lxPEgmt03S9rbX1jc2s7t5Pf3ds/OLSPCg0dJYpBnUUiUq2AahBcQh05CmjFCmgYCGgGo9uZ33wApXkkaziJwQ/pQPI+ZxSN1LULHYQxIqa1aASSP4Kadu2iW3LncFaJl5EiyVDt2t+dXsSSECQyQbVue26MfkoVciZgmu8kGmLKRnQAbUMlDUH76fz3qXNmlJ7Tj5QZic5c/X2R0lDrSRiYzZDiUC97M/FfLwiXkrF/7adcxgmCZIvgfiIcjJxZMU6PK2AoJoZQprj53WFDqihDU1/elOItV7BKGuWSd1Eq318WKzdZPTlyQk7JOfHIFamQO1IldcLImDyTF/JqPVlv1rv1sVhds7KbY/IH1ucPyEabxQ==</latexit>
<latexit sha1_base64="AWQBENphD40p0dPT6tWGUljC5co=">AAACDXicbVDLSgNBEJyNrxhf6+PmZTEInsJuFPQY9OIxQl6QhDA76SRDZmaXmV4xLvkGf8Gr3r2JV7/Bq1/i5HHQxIKGoqqbaiqMBTfo+19OZmV1bX0ju5nb2t7Z3XP3D2omSjSDKotEpBshNSC4gipyFNCINVAZCqiHw5uJX78HbXikKjiKoS1pX/EeZxSt1HGPWggPiJhWNFWmF2kJetxx837Bn8JbJsGc5Mkc5Y773epGLJGgkAlqTDPwY2ynVCNnAsa5VmIgpmxI+9C0VFEJpp1Ovx97p1bpejbajkJvqv6+SKk0ZiRDuykpDsyiNxH/9UK5kIy9q3bKVZwgKDYL7iXCw8ibVON1uQaGYmQJZZrb3z02oJoytAXmbCnBYgXLpFYsBOeF4t1FvnQ9rydLjskJOSMBuSQlckvKpEoYeSTP5IW8Ok/Om/PufMxWM8785pD8gfP5A3d1nLE=</latexit>

Figure 1: The FT-Transformer architecture. Firstly, Feature Tokenizer transforms features to embed-
dings. The embeddings are then processed by the Transformer module and the final representation of
the [CLS] token is used for prediction.

x(num) W (num) b(num) Ti

[ [ {
<latexit sha1_base64="mttgI9sS3iXJbfSzMYYNA/9R/oI=">AAACAHicbVC7SgNBFL0bXzG+opY2g0GITdiNgpYBG8sI5iHJGmYns8mQmdllZlYMSxp/wVZ7O7H1T2z9EifJFpp44MLhnHs5lxPEnGnjul9ObmV1bX0jv1nY2t7Z3SvuHzR1lChCGyTikWoHWFPOJG0YZjhtx4piEXDaCkZXU7/1QJVmkbw145j6Ag8kCxnBxkp3j/dpWSbidNIrltyKOwNaJl5GSpCh3it+d/sRSQSVhnCsdcdzY+OnWBlGOJ0UuommMSYjPKAdSyUWVPvp7OEJOrFKH4WRsiMNmqm/L1IstB6LwG4KbIZ60ZuK/3qBWEg24aWfMhknhkoyDw4TjkyEpm2gPlOUGD62BBPF7O+IDLHCxNjOCrYUb7GCZdKsVryzSvXmvFRzs3rycATHUAYPLqAG11CHBhAQ8Awv8Oo8OW/Ou/MxX8052c0h/IHz+QP+WZbb</latexit> <latexit sha1_base64="Bt+fns0sB6ifYUPOPY//MPQiXiU=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GwCrtR0DJoYxkxL0iWMDuZTYbMzC4zs0JY8wm22tuJrT9j65c42WyhiQcuHM65l3M5QcyZNq775RTW1jc2t4rbpZ3dvf2D8uFRW0eJIrRFIh6pboA15UzSlmGG026sKBYBp51gcjv3O49UaRbJppnG1Bd4JFnICDZWemgO2KBccatuBrRKvJxUIEdjUP7uDyOSCCoN4VjrnufGxk+xMoxwOiv1E01jTCZ4RHuWSiyo9tPs1Rk6s8oQhZGyIw3K1N8XKRZaT0VgNwU2Y73szcV/vUAsJZvw2k+ZjBNDJVkEhwlHJkLzHtCQKUoMn1qCiWL2d0TGWGFibFslW4q3XMEqadeq3kW1dn9Zqd/k9RThBE7hHDy4gjrcQQNaQGAEz/ACr86T8+a8Ox+L1YKT3xzDHzifP22OlF4=</latexit>

<latexit sha1_base64="5UdFhh/uIOfb9VPonxmlL1BUNZI=">AAACAHicbVC7SgNBFL0bXzG+opY2g0GITdhNBC0DNpYRzEOSNcxOZpMhM7PLzKwQljT+gq32dmLrn9j6JU6SLTTxwIXDOfdyLieIOdPGdb+c3Nr6xuZWfruws7u3f1A8PGrpKFGENknEI9UJsKacSdo0zHDaiRXFIuC0HYyvZ377kSrNInlnJjH1BR5KFjKCjZXu2w9pWSbifNovltyKOwdaJV5GSpCh0S9+9wYRSQSVhnCsdddzY+OnWBlGOJ0WeommMSZjPKRdSyUWVPvp/OEpOrPKAIWRsiMNmqu/L1IstJ6IwG4KbEZ62ZuJ/3qBWEo24ZWfMhknhkqyCA4TjkyEZm2gAVOUGD6xBBPF7O+IjLDCxNjOCrYUb7mCVdKqVrxapXp7Uaq7WT15OIFTKIMHl1CHG2hAEwgIeIYXeHWenDfn3flYrOac7OYY/sD5/AHJPZa6</latexit>

<latexit sha1_base64="supoUfk0uRGDmkDkd5C82Fny8ls=">AAACAHicbVC7SgNBFL0bXzG+opY2g0GITdhNBC0DNpYRzEOSNcxOZpMhM7PLzKwQljT+gq32dmLrn9j6JU6SLTTxwIXDOfdyLieIOdPGdb+c3Nr6xuZWfruws7u3f1A8PGrpKFGENknEI9UJsKacSdo0zHDaiRXFIuC0HYyvZ377kSrNInlnJjH1BR5KFjKCjZXug4e0LBNxPu0XS27FnQOtEi8jJcjQ6Be/e4OIJIJKQzjWuuu5sfFTrAwjnE4LvUTTGJMxHtKupRILqv10/vAUnVllgMJI2ZEGzdXfFykWWk9EYDcFNiO97M3Ef71ALCWb8MpPmYwTQyVZBIcJRyZCszbQgClKDJ9Ygoli9ndERlhhYmxnBVuKt1zBKmlVK16tUr29KNXdrJ48nMAplMGDS6jDDTSgCQQEPMMLvDpPzpvz7nwsVnNOdnMMf+B8/gDa8ZbF</latexit>

0.8
<latexit sha1_base64="lMD0gR5e8K8efa7H8kks9I4WqjQ=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GwWnajYMqgjWVE84BkCbOT2WTIzOwyMyuENZ9gq72d2Poztn6Jk2QLTTxw4XDOvZzLCRPOtPG8L6ewtr6xuVXcLu3s7u0flA+PWjpOFaFNEvNYdUKsKWeSNg0znHYSRbEIOW2H45uZ336kSrNYPphJQgOBh5JFjGBjpXvPrfXLFc/15kCrxM9JBXI0+uXv3iAmqaDSEI617vpeYoIMK8MIp9NSL9U0wWSMh7RrqcSC6iCbvzpFZ1YZoChWdqRBc/X3RYaF1hMR2k2BzUgvezPxXy8US8kmqgUZk0lqqCSL4CjlyMRo1gMaMEWJ4RNLMFHM/o7ICCtMjG2rZEvxlytYJa2q61+41bvLSv06r6cIJ3AK5+DDFdThFhrQBAJDeIYXeHWenDfn3flYrBac/OYY/sD5/AGZjpPY</latexit>

⇥ +
⇥
<latexit sha1_base64="2KnxdAUbbP9fFOkK5C3uPUW4Iho=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoMgCGE3CnoMevGYgHlAsoTZSW8yZGZ2mZkVYsgXeNW7N/Hq33j1S5wke9DEgoaiqptqKkw408bzvpzc2vrG5lZ+u7Czu7d/UDw8auo4VRQbNOaxaodEI2cSG4YZju1EIREhx1Y4upv5rUdUmsXywYwTDAQZSBYxSoyV6he9Yskre3O4q8TPSAky1HrF724/pqlAaSgnWnd8LzHBhCjDKMdpoZtqTAgdkQF2LJVEoA4m80en7plV+m4UKzvSuHP198WECK3HIrSbgpihXvZm4r9eKJaSTXQTTJhMUoOSLoKjlLsmdmctuH2mkBo+toRQxezvLh0SRaixXRVsKf5yBaukWSn7l+VK/apUvc3qycMJnMI5+HANVbiHGjSAAsIzvMCr8+S8Oe/Ox2I152Q3x/AHzucPpiuTWQ==</latexit>

0.1
<latexit sha1_base64="UHP2rx9JrjCBYyeSBwtGEQUyORk=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GwWnZjQMugjWVE84BkCbOT2WTIzOwyMyuENZ9gq72d2Poztn6Jk2QLTTxw4XDOvZzLCRPOtPG8L6ewtr6xuVXcLu3s7u0flA+PWjpOFaFNEvNYdUKsKWeSNg0znHYSRbEIOW2H45uZ336kSrNYPphJQgOBh5JFjGBjpXvP9fvliud6c6BV4uekAjka/fJ3bxCTVFBpCMdad30vMUGGlWGE02mpl2qaYDLGQ9q1VGJBdZDNX52iM6sMUBQrO9Kgufr7IsNC64kI7abAZqSXvZn4rxeKpWQTXQUZk0lqqCSL4CjlyMRo1gMaMEWJ4RNLMFHM/o7ICCtMjG2rZEvxlytYJa2q61+41btapX6d11OEEziFc/DhEupwCw1oAoEhPMMLvDpPzpvz7nwsVgtOfnMMf+B8/gCOgpPR</latexit>

<latexit sha1_base64="hCVpvvS/tPbELM3KKoiYDAExazg=">AAAB/XicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHoxWME84BkCbOT2WTMzO4y0yuEJfgLXvXuTbz6LV79EifJHjSxoKGo6qaaChIpDLrul7Oyura+sVnYKm7v7O7tlw4OmyZONeMNFstYtwNquBQRb6BAyduJ5lQFkreC0c3Ubz1ybUQc3eM44b6ig0iEglG0UrOLQnHTK5XdijsDWSZeTsqQo94rfXf7MUsVj5BJakzHcxP0M6pRMMknxW5qeELZiA54x9KI2hA/m307IadW6ZMw1nYiJDP190VGlTFjFdhNRXFoFr2p+K8XqIVkDK/8TERJijxi8+AwlQRjMq2C9IXmDOXYEsq0sL8TNqSaMrSFFW0p3mIFy6RZrXjnlerdRbl2nddTgGM4gTPw4BJqcAt1aACDB3iGF3h1npw35935mK+uOPnNEfyB8/kDEGCV3g==</latexit>
+
<latexit sha1_base64="2KnxdAUbbP9fFOkK5C3uPUW4Iho=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoMgCGE3CnoMevGYgHlAsoTZSW8yZGZ2mZkVYsgXeNW7N/Hq33j1S5wke9DEgoaiqptqKkw408bzvpzc2vrG5lZ+u7Czu7d/UDw8auo4VRQbNOaxaodEI2cSG4YZju1EIREhx1Y4upv5rUdUmsXywYwTDAQZSBYxSoyV6he9Yskre3O4q8TPSAky1HrF724/pqlAaSgnWnd8LzHBhCjDKMdpoZtqTAgdkQF2LJVEoA4m80en7plV+m4UKzvSuHP198WECK3HIrSbgpihXvZm4r9eKJaSTXQTTJhMUoOSLoKjlLsmdmctuH2mkBo+toRQxezvLh0SRaixXRVsKf5yBaukWSn7l+VK/apUvc3qycMJnMI5+HANVbiHGjSAAsIzvMCr8+S8Oe/Ox2I152Q3x/AHzucPpiuTWQ==</latexit>
Add
<latexit sha1_base64="8AGrbPTAelS0oWlrprd/K5pMJOo=">AAACA3icbVC7SgNBFJ2NrxhfUUubwSBYhd0oaBm1sYxgHpBdwuzsJBkyM7vM3BXDktJfsNXeTmz9EFu/xEmyhSYeuHA4517O5YSJ4AZc98sprKyurW8UN0tb2zu7e+X9g5aJU01Zk8Yi1p2QGCa4Yk3gIFgn0YzIULB2OLqZ+u0Hpg2P1T2MExZIMlC8zykBK/k+sEcAyK6iaNIrV9yqOwNeJl5OKihHo1f+9qOYppIpoIIY0/XcBIKMaOBUsEnJTw1LCB2RAetaqohkJshmP0/wiVUi3I+1HQV4pv6+yIg0ZixDuykJDM2iNxX/9UK5kAz9yyDjKkmBKToP7qcCQ4ynheCIa0ZBjC0hVHP7O6ZDogkFW1vJluItVrBMWrWqd1at3Z1X6td5PUV0hI7RKfLQBaqjW9RATURRgp7RC3p1npw35935mK8WnPzmEP2B8/kDCWOYpg==</latexit>

1.2 <latexit sha1_base64="FMnrTS4Kc6nFN8MOmMgnHody/QM=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GwWnZjQMugjWVE84BkCbOT2WTIzOwyMyuENZ9gq72d2Poztn6Jk2QLTTxw4XDOvZzLCRPOtPG8L6ewtr6xuVXcLu3s7u0flA+PWjpOFaFNEvNYdUKsKWeSNg0znHYSRbEIOW2H45uZ336kSrNYPphJQgOBh5JFjGBjpXvfrfbLFc/15kCrxM9JBXI0+uXv3iAmqaDSEI617vpeYoIMK8MIp9NSL9U0wWSMh7RrqcSC6iCbvzpFZ1YZoChWdqRBc/X3RYaF1hMR2k2BzUgvezPxXy8US8kmugoyJpPUUEkWwVHKkYnRrAc0YIoSwyeWYKKY/R2REVaYGNtWyZbiL1ewSlpV179wq3e1Sv06r6cIJ3AK5+DDJdThFhrQBAJDeIYXeHWenDfn3flYrBac/OYY/sD5/AGRrJPT</latexit>

⇥<latexit sha1_base64="hCVpvvS/tPbELM3KKoiYDAExazg=">AAAB/XicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHoxWME84BkCbOT2WTMzO4y0yuEJfgLXvXuTbz6LV79EifJHjSxoKGo6qaaChIpDLrul7Oyura+sVnYKm7v7O7tlw4OmyZONeMNFstYtwNquBQRb6BAyduJ5lQFkreC0c3Ubz1ybUQc3eM44b6ig0iEglG0UrOLQnHTK5XdijsDWSZeTsqQo94rfXf7MUsVj5BJakzHcxP0M6pRMMknxW5qeELZiA54x9KI2hA/m307IadW6ZMw1nYiJDP190VGlTFjFdhNRXFoFr2p+K8XqIVkDK/8TERJijxi8+AwlQRjMq2C9IXmDOXYEsq0sL8TNqSaMrSFFW0p3mIFy6RZrXjnlerdRbl2nddTgGM4gTPw4BJqcAt1aACDB3iGF3h1npw35935mK+uOPnNEfyB8/kDEGCV3g==</latexit>
+
<latexit sha1_base64="2KnxdAUbbP9fFOkK5C3uPUW4Iho=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoMgCGE3CnoMevGYgHlAsoTZSW8yZGZ2mZkVYsgXeNW7N/Hq33j1S5wke9DEgoaiqptqKkw408bzvpzc2vrG5lZ+u7Czu7d/UDw8auo4VRQbNOaxaodEI2cSG4YZju1EIREhx1Y4upv5rUdUmsXywYwTDAQZSBYxSoyV6he9Yskre3O4q8TPSAky1HrF724/pqlAaSgnWnd8LzHBhCjDKMdpoZtqTAgdkQF2LJVEoA4m80en7plV+m4UKzvSuHP198WECK3HIrSbgpihXvZm4r9eKJaSTXQTTJhMUoOSLoKjlLsmdmctuH2mkBo+toRQxezvLh0SRaixXRVsKf5yBaukWSn7l+VK/apUvc3qycMJnMI5+HANVbiHGjSAAsIzvMCr8+S8Oe/Ox2I152Q3x/AHzucPpiuTWQ==</latexit>

<latexit sha1_base64="Er266MYslE46l94jwOZg6ifeTYI=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoPgKexGQY9BLx4TyAuSJcxOepMhM7PLzKwQQ77Aq969iVf/xqtf4iTZgyYWNBRV3VRTYcKZNp735eQ2Nre2d/K7hb39g8Oj4vFJS8epotikMY9VJyQaOZPYNMxw7CQKiQg5tsPx/dxvP6LSLJYNM0kwEGQoWcQoMVaqN/rFklf2FnDXiZ+REmSo9YvfvUFMU4HSUE607vpeYoIpUYZRjrNCL9WYEDomQ+xaKolAHUwXj87cC6sM3ChWdqRxF+rviykRWk9EaDcFMSO96s3Ff71QrCSb6DaYMpmkBiVdBkcpd03szltwB0whNXxiCaGK2d9dOiKKUGO7KthS/NUK1kmrUvavypX6dal6l9WThzM4h0vw4Qaq8AA1aAIFhGd4gVfnyXlz3p2P5WrOyW5O4Q+czx/m35OC</latexit>
T <latexit sha1_base64="T5I//+tkz1f90mnelN6ef3xPXgc=">AAAB+XicbVDLSgMxFL2pr1pfVZdugkVwVWaqYJcFNy6r2Ae0Q8mkmTY0yQxJRihD/8Ct7t2JW7/GrV9i2s5CWw9cOJxzL+dywkRwYz3vCxU2Nre2d4q7pb39g8Oj8vFJ28SppqxFYxHrbkgME1yxluVWsG6iGZGhYJ1wcjv3O09MGx6rRztNWCDJSPGIU2Kd9NDPBuWKV/UWwOvEz0kFcjQH5e/+MKapZMpSQYzp+V5ig4xoy6lgs1I/NSwhdEJGrOeoIpKZIFt8OsMXThniKNZulMUL9fdFRqQxUxm6TUns2Kx6c/FfL5QryTaqBxlXSWqZosvgKBXYxnheAx5yzagVU0cI1dz9jumYaEKtK6vkSvFXK1gn7VrVv6rW7q8rjXpeTxHO4BwuwYcbaMAdNKEFFCJ4hhd4RRl6Q+/oY7laQPnNKfwB+vwB1/iUBQ==</latexit>
Feed
Forward
<latexit sha1_base64="uu4pAS0iKHPD05rqbBaCrazt+qQ=">AAACBHicbVDLSgNBEJz1GeMr6tHLYBA8hd0o6DEoiMcI5gHJEmZne5Mhsw9nesWw5OoveNW7N/Hqf3j1S5wke9DEgoaiqptqykuk0GjbX9bS8srq2npho7i5tb2zW9rbb+o4VRwaPJaxantMgxQRNFCghHaigIWehJY3vJr4rQdQWsTRHY4ScEPWj0QgOEMjuV2ER0TMrgH8ca9Utiv2FHSRODkpkxz1Xum768c8DSFCLpnWHcdO0M2YQsEljIvdVEPC+JD1oWNoxELQbjZ9ekyPjeLTIFZmIqRT9fdFxkKtR6FnNkOGAz3vTcR/PS+cS8bgws1ElKQIEZ8FB6mkGNNJI9QXCjjKkSGMK2F+p3zAFONoeiuaUpz5ChZJs1pxTivV27Ny7TKvp0AOyRE5IQ45JzVyQ+qkQTi5J8/khbxaT9ab9W59zFaXrPzmgPyB9fkD3FOZGw==</latexit>

{
<latexit sha1_base64="QHisn6MigSAPGmh+4xLpH/q3tjc=">AAACCXicbVDLSsNAFJ3UV62vqEs3wSK4KkkVdFkUxGUF+4A2lMlk0g6dyYSZm2oJ/QJ/wa3u3Ylbv8KtX+K0zUJbD1w4nHMv53KChDMNrvtlFVZW19Y3ipulre2d3T17/6CpZaoIbRDJpWoHWFPOYtoABpy2E0WxCDhtBcPrqd8aUaWZjO9hnFBf4H7MIkYwGKln212gjwCQ3Uj1gFU46dllt+LO4CwTLydllKPes7+7oSSpoDEQjrXueG4CfoYVMMLppNRNNU0wGeI+7RgaY0G1n80+nzgnRgmdSCozMTgz9fdFhoXWYxGYTYFhoBe9qfivF4iFZIgu/YzFSQo0JvPgKOUOSGdaixMyRQnwsSGYKGZ+d8gAK0zAlFcypXiLFSyTZrXinVWqd+fl2lVeTxEdoWN0ijx0gWroFtVRAxE0Qs/oBb1aT9ab9W59zFcLVn5ziP7A+vwBBK2ayw==</latexit>

<latexit sha1_base64="b3zYGO+ofUb1C5CJVf7A6auKO9o=">AAAB+HicbVDLSgNBEOz1GeMr6tHLYhA8hd0o6DHoxWMC5gHJEmYnvcmQmdllZlaIS77Aq969iVf/xqtf4iTZgyYWNBRV3VRTYcKZNp735aytb2xubRd2irt7+weHpaPjlo5TRbFJYx6rTkg0ciaxaZjh2EkUEhFybIfju5nffkSlWSwfzCTBQJChZBGjxFip0e2Xyl7Fm8NdJX5OypCj3i999wYxTQVKQznRuut7iQkyogyjHKfFXqoxIXRMhti1VBKBOsjmj07dc6sM3ChWdqRx5+rvi4wIrScitJuCmJFe9mbiv14olpJNdBNkTCapQUkXwVHKXRO7sxbcAVNIDZ9YQqhi9neXjogi1NiuirYUf7mCVdKqVvzLSrVxVa7d5vUU4BTO4AJ8uIYa3EMdmkAB4Rle4NV5ct6cd+djsbrm5Dcn8AfO5w/x65OJ</latexit> >tixetal/<JO56x/w5OfA8ncD5mrbsjd+dc6tc5VN4elR4BAkmdME3aYIu8JA4OTB4UUv5d7aVxVrSLzvVqKdVCm7fUYriuiN1igojXen9ihqQY9ZDINVAcbxs7ORXKHVwXkUQpaCTkNBdNJplo41vibm9eFJmCuJticSrIw4ivr+5xRqdWhC3Ms6cd70jmjsOBKBV1ithMRXIxoqXFfKHjygoykQi7tuuRnzQKVQTxYw999i3jCpyO5XJdN8mF7lyX2e0piFxjGBZhCJQBTCzfwSWlSkffn5ujfIbyFhEUkE2hjZaxaic0gkTr6xYJFbRT5oljPapHew+7tri2dRbux2btya537pNZKcYTRV3VRBNWYygZTi4ftqx/fVi969qA77SIalZlldmQmcvnYmEJHg5CMWxoHD6o0dh8AhYLHt6rMeG1zOEBNgSLDVbciH+BAAA>"=o9OKua6A7fVJC5C1bUfo+OGYz3b"=46esab_1ahs tixetal<

(cat) Norm
W1
<latexit sha1_base64="/hGSn+eNOnEKOJo4YFijInUK3ng=">AAACBXicbVC7SgNBFJ31GeMramkzGASrsBsFLYM2VhLBPCBZw+xkNhkyj2XmrhiW1P6CrfZ2Yut32Pol7iZbaOKBC4dz7uVcThAJbsF1v5yl5ZXVtfXCRnFza3tnt7S337Q6NpQ1qBbatANimeCKNYCDYO3IMCIDwVrB6CrzWw/MWK7VHYwj5ksyUDzklEAq3XeBPQJAcqONnBR7pbJbcafAi8TLSRnlqPdK392+prFkCqgg1nY8NwI/IQY4FWxS7MaWRYSOyIB1UqqIZNZPpl9P8HGq9HGoTToK8FT9fZEQae1YBummJDC0814m/usFci4Zwgs/4SqKgSk6Cw5jgUHjrBLc54ZREOOUEGp4+jumQ2IIhbS4rBRvvoJF0qxWvNNK9fasXLvM6ymgQ3SETpCHzlENXaM6aiCKDHpGL+jVeXLenHfnY7a65OQ3B+gPnM8fVlSZVw==</latexit>

(cat) (cat)
x1 <latexit sha1_base64="DmJz/8TmsLFraTAUOsFhdM3L8Kg=">AAACBHicbZA9SwNBEIbn/IzxK2ppsxiE2IS7KGgZsLGMYD4gOcPeZi9Zsrt37u4J4bjWv2CrvZ3Y+j9s/SVukis08YWBh3dmmOENYs60cd0vZ2V1bX1js7BV3N7Z3dsvHRy2dJQoQpsk4pHqBFhTziRtGmY47cSKYhFw2g7G19N++5EqzSJ5ZyYx9QUeShYygo21/PZ9WrF4lvVTL+uXym7VnQktg5dDGXI1+qXv3iAiiaDSEI617npubPwUK8MIp1mxl2gaYzLGQ9q1KLGg2k9nT2fo1DoDFEbKljRo5v7eSLHQeiICOymwGenF3tT8txeIhcsmvPJTJuPEUEnmh8OEIxOhaSJowBQlhk8sYKKY/R2REVaYGJtb0YbiLUawDK1a1Tuv1m4vynU3j6cAx3ACFfDgEupwAw1oAoEHeIYXeHWenDfn3fmYj644+c4R/JHz+QOxCJhS</latexit>

b1
k Fi

[ [
A
<latexit sha1_base64="QgyrtO/u2p9ElRRvbchzHKmq1Es=">AAACAnicbZA9SwNBEIbn4leMX1FLm8UgxCbcRUHLoI1lBPMByRn2NnvJkt29Y3dPDEc6/4Kt9nZi6x+x9Ze4Sa7QxBcGHt6ZYYY3iDnTxnW/nNzK6tr6Rn6zsLW9s7tX3D9o6ihRhDZIxCPVDrCmnEnaMMxw2o4VxSLgtBWMrqf91gNVmkXyzoxj6gs8kCxkBBtrdR7v07LF00nP6xVLbsWdCS2Dl0EJMtV7xe9uPyKJoNIQjrXueG5s/BQrwwink0I30TTGZIQHtGNRYkG1n85enqAT6/RRGClb0qCZ+3sjxULrsQjspMBmqBd7U/PfXiAWLpvw0k+ZjBNDJZkfDhOOTISmeaA+U5QYPraAiWL2d0SGWGFibGoFG4q3GMEyNKsV76xSvT0v1a6yePJwBMdQBg8uoAY3UIcGEIjgGV7g1Xly3px352M+mnOynUP4I+fzBxH+l3k=</latexit> <latexit sha1_base64="pc8bsy797mWrCpIxMzJgbKVefU4=">AAACAnicbVC7SgNBFL3rM8ZX1NJmMAixCbtR0DJoYxnBPCBZw+xkkgyZxzIzK4Qlnb9gq72d2Pojtn6Jk2QLTTxw4dxz7uVeThRzZqzvf3krq2vrG5u5rfz2zu7efuHgsGFUogmtE8WVbkXYUM4krVtmOW3FmmIRcdqMRjdTv/lItWFK3ttxTEOBB5L1GcHWSe2oGzykJdecTbqFol/2Z0DLJMhIETLUuoXvTk+RRFBpCcfGtAM/tmGKtWWE00m+kxgaYzLCA9p2VGJBTZjOXp6gU6f0UF9pV9Kimfp7I8XCmLGI3KTAdmgWvan4rxeJhcu2fxWmTMaJpZLMD/cTjqxC0zxQj2lKLB87golm7ndEhlhjYl1qeRdKsBjBMmlUysF5uXJ3UaxeZ/Hk4BhOoAQBXEIVbqEGdSCg4Ble4NV78t68d+9jPrriZTtH8Afe5w/tHZdj</latexit>

B +
<latexit sha1_base64="R209j2X5lW3M7o+Picn9JeY7Cvg=">AAAB+nicbVBNSwMxFHxbv2r9qnr0EiyCp7JbBT0WBfFY0W0L7VKyabYNTbJLkhXK2p/gVe/exKt/xqu/xLTdg7YOPBhm3mMeEyacaeO6X05hZXVtfaO4Wdra3tndK+8fNHWcKkJ9EvNYtUOsKWeS+oYZTtuJoliEnLbC0fXUbz1SpVksH8w4oYHAA8kiRrCx0v1Nj/XKFbfqzoCWiZeTCuRo9Mrf3X5MUkGlIRxr3fHcxAQZVoYRTielbqppgskID2jHUokF1UE2e3WCTqzSR1Gs7EiDZurviwwLrccitJsCm6Fe9Kbiv14oFpJNdBlkTCapoZLMg6OUIxOjaQ+ozxQlho8twUQx+zsiQ6wwMbatki3FW6xgmTRrVe+sWrs7r9Sv8nqKcATHcAoeXEAdbqEBPhAYwDO8wKvz5Lw5787HfLXg5DeH8AfO5w9XWpRQ</latexit>

Add
<latexit sha1_base64="JNZElOSOYEo2LotYBfGqEcU7W+M=">AAAB+HicbVDLSgNBEOz1GeMr6tHLYhA8hd0o6DHoxWMC5gHJEmYnvcmQmdllZlaIS77Aq969iVf/xqtf4iTZgyYWNBRV3VRTYcKZNp735aytb2xubRd2irt7+weHpaPjlo5TRbFJYx6rTkg0ciaxaZjh2EkUEhFybIfju5nffkSlWSwfzCTBQJChZBGjxFipMe6Xyl7Fm8NdJX5OypCj3i999wYxTQVKQznRuut7iQkyogyjHKfFXqoxIXRMhti1VBKBOsjmj07dc6sM3ChWdqRx5+rvi4wIrScitJuCmJFe9mbiv14olpJNdBNkTCapQUkXwVHKXRO7sxbcAVNIDZ9YQqhi9neXjogi1NiuirYUf7mCVdKqVvzLSrVxVa7d5vUU4BTO4AJ8uIYa3EMdmkAB4Rle4NV5ct6cd+djsbrm5Dcn8AfO5w8LOpOZ</latexit>

<latexit sha1_base64="T5I//+tkz1f90mnelN6ef3xPXgc=">AAAB+XicbVDLSgMxFL2pr1pfVZdugkVwVWaqYJcFNy6r2Ae0Q8mkmTY0yQxJRihD/8Ct7t2JW7/GrV9i2s5CWw9cOJxzL+dywkRwYz3vCxU2Nre2d4q7pb39g8Oj8vFJ28SppqxFYxHrbkgME1yxluVWsG6iGZGhYJ1wcjv3O09MGx6rRztNWCDJSPGIU2Kd9NDPBuWKV/UWwOvEz0kFcjQH5e/+MKapZMpSQYzp+V5ig4xoy6lgs1I/NSwhdEJGrOeoIpKZIFt8OsMXThniKNZulMUL9fdFRqQxUxm6TUns2Kx6c/FfL5QryTaqBxlXSWqZosvgKBXYxnheAx5yzagVU0cI1dz9jumYaEKtK6vkSvFXK1gn7VrVv6rW7q8rjXpeTxHO4BwuwYcbaMAdNKEFFCJ4hhd4RRl6Q+/oY7laQPnNKfwB+vwB1/iUBQ==</latexit>

{
<latexit sha1_base64="nv53o5L/K+7SaQVN82Zv0vxp7aA=">AAAB+HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI+oF4+QyCOBDZkdemHCzO5mZtYECV/gVe/ejFf/xqtf4gB7ULCSTipV3alOBYng2rjul5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFDzChuFGYDtRSGUgsBWM7mZ+6xGV5nH0YMYJ+pIOIh5yRo2V6je9Ysktu3OQVeJlpAQZar3id7cfs1RiZJigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LI2oRO1P5o9OyZlV+iSMlZ3IkLn6+2JCpdZjGdhNSc1QL3sz8V8vkEvJJrz2JzxKUoMRWwSHqSAmJrMWSJ8rZEaMLaFMcfs7YUOqKDO2q4ItxVuuYJU0K2XvolypX5aqt1k9eTiBUzgHD66gCvdQgwYwQHiGF3h1npw35935WKzmnOzmGP7A+fwByOOTbw==</latexit>

<latexit sha1_base64="8AGrbPTAelS0oWlrprd/K5pMJOo=">AAACA3icbVC7SgNBFJ2NrxhfUUubwSBYhd0oaBm1sYxgHpBdwuzsJBkyM7vM3BXDktJfsNXeTmz9EFu/xEmyhSYeuHA4517O5YSJ4AZc98sprKyurW8UN0tb2zu7e+X9g5aJU01Zk8Yi1p2QGCa4Yk3gIFgn0YzIULB2OLqZ+u0Hpg2P1T2MExZIMlC8zykBK/k+sEcAyK6iaNIrV9yqOwNeJl5OKihHo1f+9qOYppIpoIIY0/XcBIKMaOBUsEnJTw1LCB2RAetaqohkJshmP0/wiVUi3I+1HQV4pv6+yIg0ZixDuykJDM2iNxX/9UK5kAz9yyDjKkmBKToP7qcCQ4ynheCIa0ZBjC0hVHP7O6ZDogkFW1vJluItVrBMWrWqd1at3Z1X6td5PUV0hI7RKfLQBaqjW9RATURRgp7RC3p1npw35935mK8WnPzmEP2B8/kDCWOYpg==</latexit>

B
<latexit sha1_base64="QwGPqiurbeq7I17Iqz6AGyXq/G0=">AAAB+HicbVDLTgJBEOzFF+IL9ehlIzHxRHbRRI8ELx4hkUcCGzI79MKEmdnNzKwJEr7Aq969Ga/+jVe/xAH2oGAlnVSqulOdChPOtPG8Lye3sbm1vZPfLeztHxweFY9PWjpOFcUmjXmsOiHRyJnEpmGGYydRSETIsR2O7+Z++xGVZrF8MJMEA0GGkkWMEmOlRq1fLHllbwF3nfgZKUGGer/43RvENBUoDeVE667vJSaYEmUY5Tgr9FKNCaFjMsSupZII1MF08ejMvbDKwI1iZUcad6H+vpgSofVEhHZTEDPSq95c/NcLxUqyiW6DKZNJalDSZXCUctfE7rwFd8AUUsMnlhCqmP3dpSOiCDW2q4ItxV+tYJ20KmX/qlxpXJeqtayePJzBOVyCDzdQhXuoQxMoIDzDC7w6T86b8+58LFdzTnZzCn/gfP4AyneTcA==</latexit>

<latexit sha1_base64="2KnxdAUbbP9fFOkK5C3uPUW4Iho=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoMgCGE3CnoMevGYgHlAsoTZSW8yZGZ2mZkVYsgXeNW7N/Hq33j1S5wke9DEgoaiqptqKkw408bzvpzc2vrG5lZ+u7Czu7d/UDw8auo4VRQbNOaxaodEI2cSG4YZju1EIREhx1Y4upv5rUdUmsXywYwTDAQZSBYxSoyV6he9Yskre3O4q8TPSAky1HrF724/pqlAaSgnWnd8LzHBhCjDKMdpoZtqTAgdkQF2LJVEoA4m80en7plV+m4UKzvSuHP198WECK3HIrSbgpihXvZm4r9eKJaSTXQTTJhMUoOSLoKjlLsmdmctuH2mkBo+toRQxezvLh0SRaixXRVsKf5yBaukWSn7l+VK/apUvc3qycMJnMI5+HANVbiHGjSAAsIzvMCr8+S8Oe/Ox2I152Q3x/AHzucPpiuTWQ==</latexit>

<latexit sha1_base64="QwGPqiurbeq7I17Iqz6AGyXq/G0=">AAAB+HicbVDLTgJBEOzFF+IL9ehlIzHxRHbRRI8ELx4hkUcCGzI79MKEmdnNzKwJEr7Aq969Ga/+jVe/xAH2oGAlnVSqulOdChPOtPG8Lye3sbm1vZPfLeztHxweFY9PWjpOFcUmjXmsOiHRyJnEpmGGYydRSETIsR2O7+Z++xGVZrF8MJMEA0GGkkWMEmOlRq1fLHllbwF3nfgZKUGGer/43RvENBUoDeVE667vJSaYEmUY5Tgr9FKNCaFjMsSupZII1MF08ejMvbDKwI1iZUcad6H+vpgSofVEhHZTEDPSq95c/NcLxUqyiW6DKZNJalDSZXCUctfE7rwFd8AUUsMnlhCqmP3dpSOiCDW2q4ItxV+tYJ20KmX/qlxpXJeqtayePJzBOVyCDzdQhXuoQxMoIDzDC7w6T86b8+58LFdzTnZzCn/gfP4AyneTcA==</latexit>

Multi-Head <latexit sha1_base64="c1OfdORp1ZajtiDYjPAU7BhzjEA=">AAACDXicbVDLSsNAFJ3UV62v+Ni5CRbBjSWpgi6LbroRKtgHtKFMJjft0MmDmRuxhn6Dv+BW9+7Erd/g1i8xabPQ1gMXDufcy7kcJxJcoWl+aYWl5ZXVteJ6aWNza3tH391rqTCWDJosFKHsOFSB4AE0kaOATiSB+o6AtjO6zvz2PUjFw+AOxxHYPh0E3OOMYir19YMewgMiJjexQH5aB+pOSn29bFbMKYxFYuWkTHI0+vp3zw1Z7EOATFClupYZoZ1QiZwJmJR6sYKIshEdQDelAfVB2cn0+4lxnCqu4YUynQCNqfr7IqG+UmPfSTd9ikM172Xiv57jzyWjd2knPIhihIDNgr1YGBgaWTWGyyUwFOOUUCZ5+rvBhlRShmmBWSnWfAWLpFWtWGeV6u15uXaV11Mkh+SInBCLXJAaqZMGaRJGHskzeSGv2pP2pr1rH7PVgpbf7JM/0D5/ABUwm9I=</latexit>

Self-Attention
<latexit sha1_base64="rGI9i03ASytyFYmCpyjOC7/5Hig=">AAACEXicbVC7TsMwFHXKq5RXgAWJxaJCYqFKChKMBRbGIuhDaqPKcZ3Wqp1E9g2iispP8AussLMhVr6AlS/BaTtAy5EsHZ1zH77HjwXX4DhfVm5hcWl5Jb9aWFvf2Nyyt3fqOkoUZTUaiUg1faKZ4CGrAQfBmrFiRPqCNfzBVeY37pnSPArvYBgzT5JeyANOCRipY++1gT0AQHrLRHB8AcDCzBgVOnbRKTlj4HniTkkRTVHt2N/tbkQTaQZQQbRuuU4MXkoUcCrYqNBONIsJHZAeaxkaEsm0l44vGOFDo3RxECnzQsBj9XdHSqTWQ+mbSkmgr2e9TPzX8+XMZgjOvZSHcWIupZPFQSIwRDiLB3e5YhTE0BBCFTd/x7RPFKFgQsxCcWcjmCf1csk9KZVvTouVy2k8ebSPDtARctEZqqBrVEU1RNEjekYv6NV6st6sd+tjUpqzpj276A+szx+UNJ29</latexit>

(cat)
<latexit sha1_base64="T5I//+tkz1f90mnelN6ef3xPXgc=">AAAB+XicbVDLSgMxFL2pr1pfVZdugkVwVWaqYJcFNy6r2Ae0Q8mkmTY0yQxJRihD/8Ct7t2JW7/GrV9i2s5CWw9cOJxzL+dywkRwYz3vCxU2Nre2d4q7pb39g8Oj8vFJ28SppqxFYxHrbkgME1yxluVWsG6iGZGhYJ1wcjv3O09MGx6rRztNWCDJSPGIU2Kd9NDPBuWKV/UWwOvEz0kFcjQH5e/+MKapZMpSQYzp+V5ig4xoy6lgs1I/NSwhdEJGrOeoIpKZIFt8OsMXThniKNZulMUL9fdFRqQxUxm6TUns2Kx6c/FfL5QryTaqBxlXSWqZosvgKBXYxnheAx5yzagVU0cI1dz9jumYaEKtK6vkSvFXK1gn7VrVv6rW7q8rjXpeTxHO4BwuwYcbaMAdNKEFFCJ4hhd4RRl6Q+/oY7laQPnNKfwB+vwB1/iUBQ==</latexit>
{

W2
Norm
<latexit sha1_base64="T5I//+tkz1f90mnelN6ef3xPXgc=">AAAB+XicbVDLSgMxFL2pr1pfVZdugkVwVWaqYJcFNy6r2Ae0Q8mkmTY0yQxJRihD/8Ct7t2JW7/GrV9i2s5CWw9cOJxzL+dywkRwYz3vCxU2Nre2d4q7pb39g8Oj8vFJ28SppqxFYxHrbkgME1yxluVWsG6iGZGhYJ1wcjv3O09MGx6rRztNWCDJSPGIU2Kd9NDPBuWKV/UWwOvEz0kFcjQH5e/+MKapZMpSQYzp+V5ig4xoy6lgs1I/NSwhdEJGrOeoIpKZIFt8OsMXThniKNZulMUL9fdFRqQxUxm6TUns2Kx6c/FfL5QryTaqBxlXSWqZosvgKBXYxnheAx5yzagVU0cI1dz9jumYaEKtK6vkSvFXK1gn7VrVv6rW7q8rjXpeTxHO4BwuwYcbaMAdNKEFFCJ4hhd4RRl6Q+/oY7laQPnNKfwB+vwB1/iUBQ==</latexit>

(cat) (cat) d
<latexit sha1_base64="r06vOZ/g3fOQ2cojjOr/nRnSHEQ=">AAACBHicbZA9SwNBEIbn/IzxK2ppsxiE2IS7KGgZsLGMYD4gOcPeZi9Zsrt37u4J4bjWv2CrvZ3Y+j9s/SVukis08YWBh3dmmOENYs60cd0vZ2V1bX1js7BV3N7Z3dsvHRy2dJQoQpsk4pHqBFhTziRtGmY47cSKYhFw2g7G19N++5EqzSJ5ZyYx9QUeShYygo21/PZ9WrF4lvXTWtYvld2qOxNaBi+HMuRq9EvfvUFEEkGlIRxr3fXc2PgpVoYRTrNiL9E0xmSMh7RrUWJBtZ/Ons7QqXUGKIyULWnQzP29kWKh9UQEdlJgM9KLvan5by8QC5dNeOWnTMaJoZLMD4cJRyZC00TQgClKDJ9YwEQx+zsiI6wwMTa3og3FW4xgGVq1qnderd1elOtuHk8BjuEEKuDBJdThBhrQBAIP8Awv8Oo8OW/Ou/MxH11x8p0j+CPn8weynZhT</latexit>

x2 b2
<latexit sha1_base64="/hGSn+eNOnEKOJo4YFijInUK3ng=">AAACBXicbVC7SgNBFJ31GeMramkzGASrsBsFLYM2VhLBPCBZw+xkNhkyj2XmrhiW1P6CrfZ2Yut32Pol7iZbaOKBC4dz7uVcThAJbsF1v5yl5ZXVtfXCRnFza3tnt7S337Q6NpQ1qBbatANimeCKNYCDYO3IMCIDwVrB6CrzWw/MWK7VHYwj5ksyUDzklEAq3XeBPQJAcqONnBR7pbJbcafAi8TLSRnlqPdK392+prFkCqgg1nY8NwI/IQY4FWxS7MaWRYSOyIB1UqqIZNZPpl9P8HGq9HGoTToK8FT9fZEQae1YBummJDC0814m/usFci4Zwgs/4SqKgSk6Cw5jgUHjrBLc54ZREOOUEGp4+jumQ2IIhbS4rBRvvoJF0qxWvNNK9fasXLvM6ymgQ3SETpCHzlENXaM6aiCKDHpGL+jVeXLenHfnY7a65OQ3B+gPnM8fVlSZVw==</latexit>

<latexit sha1_base64="zS1hBakp+h1/7KshCTKeV9nKij0=">AAAB+HicbVDLSgNBEOz1GeMr6tHLYhA8hd0o6DHoxWMC5gHJEmZne5MhM7PLzKwQQ77Aq969iVf/xqtf4iTZgyYWNBRV3VRTYcqZNp735aytb2xubRd2irt7+weHpaPjlk4yRbFJE56oTkg0ciaxaZjh2EkVEhFybIeju5nffkSlWSIfzDjFQJCBZDGjxFipEfVLZa/izeGuEj8nZchR75e+e1FCM4HSUE607vpeaoIJUYZRjtNiL9OYEjoiA+xaKolAHUzmj07dc6tEbpwoO9K4c/X3xYQIrccitJuCmKFe9mbiv14olpJNfBNMmEwzg5IuguOMuyZxZy24EVNIDR9bQqhi9neXDoki1NiuirYUf7mCVdKqVvzLSrVxVa7d5vUU4BTO4AJ8uIYa3EMdmkAB4Rle4NV5ct6cd+djsbrm5Dcn8AfO5w8ALpOS</latexit>

<latexit sha1_base64="FTiak1c7vEOBR1Xm49FlOYbli34=">AAACAnicbZA9SwNBEIbn4leMX1FLm8UgxCbcRUHLoI1lBPMByRn2NnvJkt3bY3dPDEc6/4Kt9nZi6x+x9Ze4Sa7QxBcGHt6ZYYY3iDnTxnW/nNzK6tr6Rn6zsLW9s7tX3D9oapkoQhtEcqnaAdaUs4g2DDOctmNFsQg4bQWj62m/9UCVZjK6M+OY+gIPIhYygo21Oo/3adni6aRX7RVLbsWdCS2Dl0EJMtV7xe9uX5JE0MgQjrXueG5s/BQrwwink0I30TTGZIQHtGMxwoJqP529PEEn1umjUCpbkUEz9/dGioXWYxHYSYHNUC/2pua/vUAsXDbhpZ+yKE4Mjcj8cJhwZCSa5oH6TFFi+NgCJorZ3xEZYoWJsakVbCjeYgTL0KxWvLNK9fa8VLvK4snDERxDGTy4gBrcQB0aQEDCM7zAq/PkvDnvzsd8NOdkO4fwR87nDxOSl3o=</latexit>

<latexit sha1_base64="XRdiGQpfjg256FRSJ9xKeMuxlqA=">AAACAnicbVC7SgNBFL3rM8ZX1NJmMAixCbtR0DJoYxnBPCCJYXYymwyZxzIzK4Qlnb9gq72d2Pojtn6Jk2QLTTxw4dxz7uVeThhzZqzvf3krq2vrG5u5rfz2zu7efuHgsGFUogmtE8WVboXYUM4krVtmOW3FmmIRctoMRzdTv/lItWFK3ttxTLsCDySLGMHWSe2wV3lIS645m/QKRb/sz4CWSZCRImSo9Qrfnb4iiaDSEo6NaQd+bLsp1pYRTif5TmJojMkID2jbUYkFNd109vIEnTqljyKlXUmLZurvjRQLY8YidJMC26FZ9Kbiv14oFi7b6KqbMhknlkoyPxwlHFmFpnmgPtOUWD52BBPN3O+IDLHGxLrU8i6UYDGCZdKolIPzcuXuoli9zuLJwTGcQAkCuIQq3EIN6kBAwTO8wKv35L15797HfHTFy3aO4A+8zx/uuZdk</latexit>

+
Ti
<latexit sha1_base64="2KnxdAUbbP9fFOkK5C3uPUW4Iho=">AAAB+HicbVDLSgNBEOyNrxhfUY9eFoMgCGE3CnoMevGYgHlAsoTZSW8yZGZ2mZkVYsgXeNW7N/Hq33j1S5wke9DEgoaiqptqKkw408bzvpzc2vrG5lZ+u7Czu7d/UDw8auo4VRQbNOaxaodEI2cSG4YZju1EIREhx1Y4upv5rUdUmsXywYwTDAQZSBYxSoyV6he9Yskre3O4q8TPSAky1HrF724/pqlAaSgnWnd8LzHBhCjDKMdpoZtqTAgdkQF2LJVEoA4m80en7plV+m4UKzvSuHP198WECK3HIrSbgpihXvZm4r9eKJaSTXQTTJhMUoOSLoKjlLsmdmctuH2mkBo+toRQxezvLh0SRaixXRVsKf5yBaukWSn7l+VK/apUvc3qycMJnMI5+HANVbiHGjSAAsIzvMCr8+S8Oe/Ox2I152Q3x/AHzucPpiuTWQ==</latexit>

<latexit sha1_base64="b3zYGO+ofUb1C5CJVf7A6auKO9o=">AAAB+HicbVDLSgNBEOz1GeMr6tHLYhA8hd0o6DHoxWMC5gHJEmYnvcmQmdllZlaIS77Aq969iVf/xqtf4iTZgyYWNBRV3VRTYcKZNp735aytb2xubRd2irt7+weHpaPjlo5TRbFJYx6rTkg0ciaxaZjh2EkUEhFybIfju5nffkSlWSwfzCTBQJChZBGjxFip0e2Xyl7Fm8NdJX5OypCj3i999wYxTQVKQznRuut7iQkyogyjHKfFXqoxIXRMhti1VBKBOsjmj07dc6sM3ChWdqRx5+rvi4wIrScitJuCmJFe9mbiv14olpJNdBNkTCapQUkXwVHKXRO7sxbcAVNIDZ9YQqhi9neXjogi1NiuirYUf7mCVdKqVvzLSrVxVa7d5vUU4BTO4AJ8uIYa3EMdmkAB4Rle4NV5ct6cd+djsbrm5Dcn8AfO5w/x65OJ</latexit> >tixetal/<JO56x/w5OfA8ncD5mrbsjd+dc6tc5VN4elR4BAkmdME3aYIu8JA4OTB4UUv5d7aVxVrSLzvVqKdVCm7fUYriuiN1igojXen9ihqQY9ZDINVAcbxs7ORXKHVwXkUQpaCTkNBdNJplo41vibm9eFJmCuJticSrIw4ivr+5xRqdWhC3Ms6cd70jmjsOBKBV1ithMRXIxoqXFfKHjygoykQi7tuuRnzQKVQTxYw999i3jCpyO5XJdN8mF7lyX2e0piFxjGBZhCJQBTCzfwSWlSkffn5ujfIbyFhEUkE2hjZaxaic0gkTr6xYJFbRT5oljPapHew+7tri2dRbux2btya537pNZKcYTRV3VRBNWYygZTi4ftqx/fVi969qA77SIalZlldmQmcvnYmEJHg5CMWxoHD6o0dh8AhYLHt6rMeG1zOEBNgSLDVbciH+BAAA>"=o9OKua6A7fVJC5C1bUfo+OGYz3b"=46esab_1ahs tixetal<

<latexit sha1_base64="Z1JpHjN3dJSV8kDso3x5UDepr2Q=">AAAB/nicbVC7SgNBFL0bXzG+opY2g0GwMexGQcugjWWEvCBZwuxkNhkyM7vMzAphCfgLttrbia2/YuuXONlsoYkHLhzOuZdzOUHMmTau++UU1tY3NreK26Wd3b39g/LhUVtHiSK0RSIeqW6ANeVM0pZhhtNurCgWAaedYHI39zuPVGkWyaaZxtQXeCRZyAg2Vuo0Bym78GaDcsWtuhnQKvFyUoEcjUH5uz+MSCKoNIRjrXueGxs/xcowwums1E80jTGZ4BHtWSqxoNpPs3dn6MwqQxRGyo40KFN/X6RYaD0Vgd0U2Iz1sjcX//UCsZRswhs/ZTJODJVkERwmHJkIzbtAQ6YoMXxqCSaK2d8RGWOFibGNlWwp3nIFq6Rdq3qX1drDVaV+m9dThBM4hXPw4BrqcA8NaAGBCTzDC7w6T86b8+58LFYLTn5zDH/gfP4AI92V3A==</latexit>
1
(a)
<latexit sha1_base64="/rd31GxZKKRhDczZebPH7MoTAy8=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GITdhNBFMGbCwjmgckS5idzCZDZmaXmVkhrPkEW+3txNafsfVLnCRbaOKBC4dz7uVcThBzpo3rfjm5jc2t7Z38bmFv/+DwqHh80tZRoghtkYhHqhtgTTmTtGWY4bQbK4pFwGknmNzM/c4jVZpF8sFMY+oLPJIsZAQbK92X8eWgWHIr7gJonXgZKUGG5qD43R9GJBFUGsKx1j3PjY2fYmUY4XRW6CeaxphM8Ij2LJVYUO2ni1dn6MIqQxRGyo40aKH+vkix0HoqArspsBnrVW8u/usFYiXZhHU/ZTJODJVkGRwmHJkIzXtAQ6YoMXxqCSaK2d8RGWOFibFtFWwp3moF66RdrXi1SvXuqtSoZ/Xk4QzOoQweXEMDbqEJLSAwgmd4gVfnyXlz3p2P5WrOyW5O4Q+czx/C35Pq</latexit>
(b)
<latexit sha1_base64="J3PGJklANf2C0M8cIciYYCc1luY=">AAAB+nicbVC7SgNBFL0bXzG+opY2g0GITdhNBFMGbCwjmgckS5idzCZDZmaXmVkhrPkEW+3txNafsfVLnCRbaOKBC4dz7uVcThBzpo3rfjm5jc2t7Z38bmFv/+DwqHh80tZRoghtkYhHqhtgTTmTtGWY4bQbK4pFwGknmNzM/c4jVZpF8sFMY+oLPJIsZAQbK92Xg8tBseRW3AXQOvEyUoIMzUHxuz+MSCKoNIRjrXueGxs/xcowwums0E80jTGZ4BHtWSqxoNpPF6/O0IVVhiiMlB1p0EL9fZFiofVUBHZTYDPWq95c/NcLxEqyCet+ymScGCrJMjhMODIRmveAhkxRYvjUEkwUs78jMsYKE2PbKthSvNUK1km7WvFqlerdValRz+rJwxmcQxk8uIYG3EITWkBgBM/wAq/Ok/PmvDsfy9Wck92cwh84nz/EdJPr</latexit>

Figure 2: (a) Feature Tokenizer; in the example, there are three numerical and two categorical features;
(b) One Transformer layer.

Feature Tokenizer. The Feature Tokenizer module (see Figure 2) transforms the input features x to
embeddings T ∈ Rk×d . The embedding for a given feature xj is computed as follows:
Tj = bj + fj (xj ) ∈ Rd fj : Xj → Rd .
(num)
where bj is the j-th feature bias, fj is implemented as the element-wise multiplication with the
(num) (cat) (cat)
vector ∈ R and Wj d
fj is implemented as the lookup table Wj ∈ RSj ×d for categorical
features. Overall:
(num) (num) (num) (num)
Tj = bj + xj · Wj ∈ Rd ,
(cat) (cat) (cat)
Tj = bj + eTj Wj ∈ Rd ,
h i
(num) (num) (cat) (cat)
T = stack T1 , . . . , Tk(num) , T1 , . . . , Tk(cat) ∈ Rk×d .
where eTj is a one-hot vector for the corresponding categorical feature.
Transformer. At this stage, the embedding of the [CLS] token (or “classification token”, or “output
token”, see Devlin et al. (2019)) is appended to T and L Transformer layers F1 , . . . , FL are applied:
T0 = stack [[CLS], T ] Ti = Fi (Ti−1 ).

4
We use the PreNorm variant for easier optimization (Wang et al., 2019b), see Figure 2. In the PreNorm
setting, we also found it to be necessary to remove the first normalization from the first Transformer
layer to achieve good performance. See the original paper (Vaswani et al., 2017) for the background
on Multi-Head Self-Attention (MHSA) and the Feed Forward module. See supplementary for details
such as activations, placement of normalizations and dropout modules (Srivastava et al., 2014).
Prediction. The final representation of the [CLS] token is used for prediction:
ŷ = Linear(ReLU(LayerNorm(TL[CLS] ))).

Limitations. FT-Transformer requires more resources (both hardware and time) for training than
simple models such as ResNet and may not be easily scaled to datasets when the number of features
is “too large” (it is determined by the available hardware and time budget). Consequently, widespread
usage of FT-Transformer for solving tabular data problems can lead to greater CO2 emissions
produced by ML pipelines, since tabular data problems are ubiquitous. The main cause of the
described problem lies in the quadratic complexity of the vanilla MHSA with respect to the number
of features. However, the issue can be alleviated by using efficient approximations of MHSA (Tay
et al., 2020). Additionally, it is still possible to distill FT-Transformer into simpler architectures for
better inference performance. We report training times and the used hardware in supplementary.

3.4 Other models

In this section, we list the existing models designed specifically for tabular data that we include in the
comparison.
• SNN (Klambauer et al., 2017). An MLP-like architecture with the SELU activation that
enables training deeper models.
• NODE (Popov et al., 2020). A differentiable ensemble of oblivious decision trees.
• TabNet (Arik and Pfister, 2020). A recurrent architecture that alternates dynamical reweigh-
ing of features and conventional feed-forward modules.
• GrowNet (Badirli et al., 2020). Gradient boosted weak MLPs. The official implementation
supports only classification and regression problems.
• DCN V2 (Wang et al., 2020). Consists of an MLP-like module and the feature crossing
module (a combination of linear layers and multiplications).
• AutoInt (Song et al., 2019). Transforms features to embeddings and applies a series of
attention-based transformations to the embeddings.
• XGBoost (Chen and Guestrin, 2016). One of the most popular GBDT implementations.
• CatBoost (Prokhorenkova et al., 2018). GBDT implementation that uses oblivious decision
trees (Lou and Obukhov, 2017) as weak learners.

4 Experiments
In this section, we compare DL models to each other as well as to GBDT. Note that in the main text,
we report only the key results. In supplementary, we provide: (1) the results for all models on all
datasets; (2) information on hardware; (3) training times for ResNet and FT-Transformer.

4.1 Scope of the comparison

In our work, we focus on the relative performance of different architectures and do not employ various
model-agnostic DL practices, such as pretraining, additional loss functions, data augmentation,
distillation, learning rate warmup, learning rate decay and many others. While these practices can
potentially improve the performance, our goal is to evaluate the impact of inductive biases imposed
by the different model architectures.

4.2 Datasets

We use a diverse set of eleven public datasets (see supplementary for the detailed description). For
each dataset, there is exactly one train-validation-test split, so all algorithms use the same splits. The
datasets include: California Housing (CA, real estate data, Kelley Pace and Barry (1997)), Adult
(AD, income estimation, Kohavi (1996)), Helena (HE, anonymized dataset, Guyon et al. (2019)),

5
Jannis (JA, anonymized dataset, Guyon et al. (2019)), Higgs (HI, simulated physical particles, Baldi
et al. (2014); we use the version with 98K samples available at the OpenML repository (Vanschoren
et al., 2014)), ALOI (AL, images, Geusebroek et al. (2005)), Epsilon (EP, simulated physics experi-
ments), Year (YE, audio features, Bertin-Mahieux et al. (2011)), Covertype (CO, forest characteristics,
Blackard and Dean. (2000)), Yahoo (YA, search queries, Chapelle and Chang (2011)), Microsoft (MI,
search queries, Qin and Liu (2013)). We follow the pointwise approach to learning-to-rank and treat
ranking problems (Microsoft, Yahoo) as regression problems. The dataset properties are summarized
in Table 1.

Table 1: Dataset properties. Notation: “RMSE” ~ root-mean-square error, “Acc.” ~ accuracy.

CA AD HE JA HI AL EP YE CO YA MI
#objects 20640 48842 65196 83733 98050 108000 500000 515345 581012 709877 1200192
#num. features 8 6 27 54 28 128 2000 90 54 699 136
#cat. features 0 8 0 0 0 0 0 0 0 0 0
metric RMSE Acc. Acc. Acc. Acc. Acc. Acc. RMSE Acc. RMSE RMSE
#classes – 2 100 4 2 1000 2 – 7 – –

4.3 Implementation details

Data preprocessing. Data preprocessing is known to be vital for DL models. For each dataset, the
same preprocessing was used for all deep models for a fair comparison. By default, we used the
quantile transformation from the Scikit-learn library (Pedregosa et al., 2011). We apply standard-
ization (mean subtraction and scaling) to Helena and ALOI. The latter one represents image data,
and standardization is a common practice in computer vision. On the Epsilon dataset, we observed
preprocessing to be detrimental to deep models’ performance, so we use the raw features on this
dataset. We apply standardization to regression targets for all algorithms.
Tuning. For every dataset, we carefully tune each model’s hyperparameters. The best hyperparam-
eters are the ones that perform best on the validation set, so the test set is never used for tuning.
For most algorithms, we use the Optuna library (Akiba et al., 2019) to run Bayesian optimization
(the Tree-Structured Parzen Estimator algorithm), which is reported to be superior to random search
(Turner et al., 2021). For the rest, we iterate over predefined sets of configurations recommended by
corresponding papers. We provide parameter spaces and grids in supplementary. We set the budget
for Optuna-based tuning in terms of iterations and provide additional analysis on setting the budget
in terms of time in supplementary.
Evaluation. For each tuned configuration, we run 15 experiments with different random seeds and
report the performance on the test set. For some algorithms, we also report the performance of default
configurations without hyperparameter tuning.
Ensembles. For each model, on each dataset, we obtain three ensembles by splitting the 15 single
models into three disjoint groups of equal size and averaging predictions of single models within
each group.
Neural networks. We minimize cross-entropy for classification problems and mean squared error
for regression problems. For TabNet and GrowNet, we follow the original implementations and use
the Adam optimizer (Kingma and Ba, 2017). For all other algorithms, we use the AdamW optimizer
(Loshchilov and Hutter, 2019). We do not apply learning rate schedules. For each dataset, we use
a predefined batch size for all algorithms unless special instructions on batch sizes are given in
the corresponding papers (see supplementary). We continue training until there are patience + 1
consecutive epochs without improvements on the validation set; we set patience = 16 for all
algorithms.
Categorical features. For XGBoost, we use one-hot encoding. For CatBoost, we employ the built-in
support for categorical features. For Neural Networks, we use embeddings of the same dimensionality
for all categorical features.

6
Table 2: Results for DL models. The metric values averaged over 15 random seeds are reported. See
supplementary for standard deviations. For each dataset, top results are in bold. “Top” means “the
gap between this result and the result with the best score is not statistically significant”. For each
dataset, ranks are calculated by sorting the reported scores; the “rank” column reports the average
rank across all datasets. Notation: FT-T ~ FT-Transformer, ↓ ~ RMSE, ↑ ~ accuracy

CA ↓ AD ↑ HE ↑ JA ↑ HI ↑ AL ↑ EP ↑ YE ↓ CO ↑ YA ↓ MI ↓ rank (std)
TabNet 0.510 0.850 0.378 0.723 0.719 0.954 0.8896 8.909 0.957 0.823 0.751 7.5 (2.0)
SNN 0.493 0.854 0.373 0.719 0.722 0.954 0.8975 8.895 0.961 0.761 0.751 6.4 (1.4)
AutoInt 0.474 0.859 0.372 0.721 0.725 0.945 0.8949 8.882 0.934 0.768 0.750 5.7 (2.3)
GrowNet 0.487 0.857 – – 0.722 – 0.8970 8.827 – 0.765 0.751 5.7 (2.2)
MLP 0.499 0.852 0.383 0.719 0.723 0.954 0.8977 8.853 0.962 0.757 0.747 4.8 (1.9)
DCN2 0.484 0.853 0.385 0.716 0.723 0.955 0.8977 8.890 0.965 0.757 0.749 4.7 (2.0)
NODE 0.464 0.858 0.359 0.727 0.726 0.918 0.8958 8.784 0.958 0.753 0.745 3.9 (2.8)
ResNet 0.486 0.854 0.396 0.728 0.727 0.963 0.8969 8.846 0.964 0.757 0.748 3.3 (1.8)
FT-T 0.459 0.859 0.391 0.732 0.729 0.960 0.8982 8.855 0.970 0.756 0.746 1.8 (1.2)

4.4 Comparing DL models

Table 2 reports the results for deep architectures.

The main takeaways:
• MLP is still a good sanity check
• ResNet turns out to be an effective baseline that none of the competitors can consistently
outperform.
• FT-Transformer performs best on most tasks and becomes a new powerful solution for the
field.
• Tuning makes simple models such as MLP and ResNet competitive, so we recommend
tuning baselines when possible. Luckily, today, it is more approachable with libraries such
as Optuna (Akiba et al., 2019).
Among other models, NODE (Popov et al., 2020) is the only one that demonstrates high performance
on several tasks. However, it is still inferior to ResNet on six datasets (Helena, Jannis, Higgs, ALOI,
Epsilon, Covertype), while being a more complex solution. Moreover, it is not a truly “single” model;
in fact, it often contains significantly more parameters than ResNet and FT-Transformer and has an
ensemble-like structure. We illustrate that by comparing ensembles in Table 3. The results indicate
that FT-Transformer and ResNet benefit more from ensembling; in this regime, FT-Transformer
outperforms NODE and the gap between ResNet and NODE is significantly reduced. Nevertheless,
NODE remains a prominent solution among tree-based approaches.

Table 3: Results for ensembles of DL models with the highest ranks (see Table 2). For each
model-dataset pair, the metric value averaged over three ensembles is reported. See supplementary
for standard deviations. Depending on the dataset, the highest accuracy or the lowest RMSE is
in bold. Due to the limited precision, some different values are represented with the same figures.
Notation: ↓ ~ RMSE, ↑ ~ accuracy.

CA ↓ AD ↑ HE ↑ JA ↑ HI ↑ AL ↑ EP ↑ YE ↓ CO ↑ YA ↓ MI ↓
NODE 0.461 0.860 0.361 0.730 0.727 0.921 0.8970 8.716 0.965 0.750 0.744
ResNet 0.478 0.857 0.398 0.734 0.731 0.966 0.8976 8.770 0.967 0.751 0.745
FT-Transformer 0.448 0.860 0.398 0.739 0.731 0.967 0.8984 8.751 0.973 0.747 0.743

4.5 Comparing DL models and GBDT

In this section, our goal is to check whether DL models are conceptually ready to outperform GBDT.
To this end, we compare the best possible metric values that one can achieve using GBDT or DL
models, without taking speed and hardware requirements into account (undoubtedly, GBDT is a more
lightweight solution). We accomplish that by comparing ensembles instead of single models since

7
GBDT is essentially an ensembling technique and we expect that deep architectures will benefit more
from ensembling (Fort et al., 2020). We report the results in Table 4.

Table 4: Results for ensembles of GBDT and the main DL models. For each model-dataset pair, the
metric value averaged over three ensembles is reported. See supplementary for standard deviations.
Notation follows Table 3.

CA ↓ AD ↑ HE ↑ JA ↑ HI ↑ AL ↑ EP ↑ YE ↓ CO ↑ YA ↓ MI ↓
Default hyperparameters
XGBoost 0.462 0.874 0.348 0.711 0.717 0.924 0.8799 9.192 0.964 0.761 0.751
CatBoost 0.428 0.873 0.386 0.724 0.728 0.948 0.8893 8.885 0.910 0.749 0.744
FT-Transformer 0.454 0.860 0.395 0.734 0.731 0.966 0.8969 8.727 0.973 0.747 0.742
Tuned hyperparameters
XGBoost 0.431 0.872 0.377 0.724 0.728 – 0.8861 8.819 0.969 0.732 0.742
CatBoost 0.423 0.874 0.388 0.727 0.729 – 0.8898 8.837 0.968 0.740 0.741
ResNet 0.478 0.857 0.398 0.734 0.731 0.966 0.8976 8.770 0.967 0.751 0.745
FT-Transformer 0.448 0.860 0.398 0.739 0.731 0.967 0.8984 8.751 0.973 0.747 0.743

Default hyperparameters. We start with the default configurations to check the “out-of-the-box”
performance, which is an important practical scenario. The default FT-Transformer implies a configu-
ration with all hyperparameters set to some specific values that we provide in supplementary. Table 4
demonstrates that the ensemble of FT-Transformers mostly outperforms the ensembles of GBDT,
which is not the case for only two datasets (California Housing, Adult). Interestingly, the ensemble
of default FT-Transformers performs quite on par with the ensembles of tuned FT-Transformers.
The main takeaway: FT-Transformer allows building powerful ensembles out of the box.
Tuned hyperparameters. Once hyperparameters are properly tuned, GBDTs start dominating on
some datasets (California Housing, Adult, Yahoo; see Table 4). In those cases, the gaps are significant
enough to conclude that DL models do not universally outperform GBDT. Importantly, the fact that
DL models outperform GBDT on most of the tasks does not mean that DL solutions are “better” in any
sense. In fact, it only means that the constructed benchmark is slightly biased towards “DL-friendly”
problems. Admittedly, GBDT remains an unsuitable solution to multiclass problems with a large
number of classes. Depending on the number of classes, GBDT can demonstrate unsatisfactory
performance (Helena) or even be untunable due to extremely slow training (ALOI).
The main takeaways:
• there is still no universal solution among DL models and GBDT
• DL research efforts aimed at surpassing GBDT should focus on datasets where GBDT
outperforms state-of-the-art DL solutions. Note that including “DL-friendly” problems is
still important to avoid degradation on such problems.

4.6 An intriguing property of FT-Transformer

Table 4 tells one more important story. Namely, FT-Transformer delivers most of its advantage over
the “conventional” DL model in the form of ResNet exactly on those problems where GBDT is
superior to ResNet (California Housing, Adult, Covertype, Yahoo, Microsoft) while performing on
par with ResNet on the remaining problems. In other words, FT-Transformer provides competitive
performance on all tasks, while GBDT and ResNet perform well only on some subsets of the tasks.
This observation may be the evidence that FT-Transformer is a more “universal” model for tabular
data problems. We develop this intuition further in section 5.1. Note that the described phenomenon
is not related to ensembling and is observed for single models too (see supplementary).

5 Analysis
5.1 When FT-Transformer is better than ResNet?

In this section, we make the first step towards understanding the difference in behavior between
FT-Transformer and ResNet, which was first observed in section 4.6. To achieve that, we design a

8
sequence of synthetic tasks where the difference in performance of the two models gradually changes
from negligible to dramatic. Namely, we generate and fix objects {xi }ni=1 , perform the train-val-test
split once and interpolate between two regression targets: fGBDT , which is supposed to be easier for
GBDT and fDL , which is expected to be easier for ResNet. Formally, for one object:

x ∼ N (0, Ik ), y = α · fGBDT (x) + (1 − α) · fDL (x).

where fGBDT (x) is an average prediction of 0.8

30 randomly constructed decision trees, and 0.7
fDL (x) is an MLP with three randomly initial-
ized hidden layers. Both fGBDT and fDL are 0.6

generated once, i.e. the same functions are ap- 0.5

RMSE
plied to all objects (see supplementary for de- 0.4
tails). The resulting targets are standardized
0.3
before training. The results are visualized in
Figure 3. ResNet and FT-Transformer perform 0.2 ResNet
similarly well on the ResNet-friendly tasks and FT-Transformer
0.1
CatBoost
outperform CatBoost on those tasks. However, 0.0
the ResNet’s relative performance drops signif- 0.00 0.25 0.50 0.75 1.00

icantly when the target becomes more GBDT α

friendly. By contrast, FT-Transformer yields
competitive performance across the whole range Figure 3: Test RMSE averaged over five seeds
of tasks. (shadows represent std. dev.). One α corresponds
to one task; each task has the same set of train,
The conducted experiment reveals a type of validation and test features, but different targets.
functions that are better approximated by
FT-Transformer than by ResNet. Additionally,
the fact that these functions are based on decision trees correlates with the observations in section 4.6
and the results in Table 4, where FT-Transformer shows the most convincing improvements over
ResNet exactly on those datasets where GBDT outperforms ResNet.

5.2 Ablation study

In this section, we test some design choices of FT-Transformer.

First, we compare FT-Transformer with AutoInt (Song et al., 2019), since it is the closest competitor
in its spirit. AutoInt also converts all features to embeddings and applies self-attention on top of them.
However, in its details, AutoInt significantly differs from FT-Transformer: its embedding layer does
not include feature biases, its backbone significantly differs from the vanilla Transformer (Vaswani
et al., 2017), and the inference mechanism does not use the [CLS] token.
Second, we check whether feature biases in Feature Tokenizer are essential for good performance.
We tune and evaluate FT-Transformer without feature biases following the same protocol as in
section 4.3 and reuse the remaining numbers from Table 2. The results averaged over 15 runs are
reported in Table 5 and demonstrate both the superiority of the Transformer’s backbone to that of
AutoInt and the necessity of feature biases.

Table 5: The results of the comparison between FT-Transformer and two attention-based alternatives:
AutoInt and FT-Transformer without feature biases. Notation follows Table 2.

CA ↓ HE ↑ JA ↑ HI ↑ AL ↑ YE ↓ CO ↑ MI ↓
AutoInt 0.474 0.372 0.721 0.725 0.945 8.882 0.934 0.750
FT-Transformer (w/o feature biases) 0.470 0.381 0.724 0.727 0.958 8.843 0.964 0.751
FT-Transformer 0.459 0.391 0.732 0.729 0.960 8.855 0.970 0.746

9
5.3 Obtaining feature importances from attention maps

In this section, we evaluate attention maps as a source of information on feature importances for
FT-Transformer for a given set of samples. For the i-th sample, we calculate the average attention map
pi for the [CLS] token from Transformer’s forward pass. Then, the obtained individual distributions
are averaged into one distribution p that represents the feature importances:
1 X 1 X
p= pi pi = pihl .
nsamples i nheads × L
h,l

where pihl is the h-th head’s attention map for the [CLS] token from the forward pass of the l-th
layer on the i-th sample. The main advantage of the described heuristic technique is its efficiency: it
requires a single forward for one sample.
In order to evaluate our approach, we compare it with Integrated Gradients (IG, Sundararajan et al.
(2017)), a general technique applicable to any differentiable model. We use permutation test (PT,
Breiman (2001)) as a reasonable interpretable method that allows us to establish a constructive metric,
namely, rank correlation. We run all the methods on the train set and summarize results in Table 6.
Interestingly, the proposed method yields reasonable feature importances and performs similarly to
IG (note that this does not imply similarity to IG’s feature importances). Given that IG can be orders
of magnitude slower and the “baseline” in the form of PT requires (nf eatures + 1) forward passes
(versus one for the proposed method), we conclude that the simple averaging of attention maps can
be a good choice in terms of cost-effectiveness.

Table 6: Rank correlation (takes values in [−1, 1]) between permutation test’s feature importances
ranking and two alternative rankings: Attention Maps (AM) and Integrated Gradients (IG). Means
and standard deviations over five runs are reported.

CA HE JA HI AL YE CO MI
AM 0.81 (0.05) 0.77 (0.03) 0.78 (0.05) 0.91 (0.03) 0.84 (0.01) 0.92 (0.01) 0.84 (0.04) 0.86 (0.02)
IG 0.84 (0.08) 0.74 (0.03) 0.75 (0.04) 0.72 (0.03) 0.89 (0.01) 0.50 (0.03) 0.90 (0.02) 0.56 (0.02)

6 Conclusion
In this work, we have investigated the status quo in the field of deep learning for tabular data and
improved the state of baselines in tabular DL. First, we have demonstrated that a simple ResNet-like
architecture can serve as an effective baseline. Second, we have proposed FT-Transformer — a
simple adaptation of the Transformer architecture that outperforms other DL solutions on most of
the tasks. We have also compared the new baselines with GBDT and demonstrated that GBDT still
dominates on some tasks. The code and all the details of the study are open-sourced 1 , and we hope
that our evaluation and two simple models (ResNet and FT-Transformer) will serve as a basis for
further developments on tabular DL.

References
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter
optimization framework. In KDD, 2019.
S. O. Arik and T. Pfister. Tabnet: Attentive interpretable tabular learning. arXiv, 1908.07442v5, 2020.
S. Badirli, X. Liu, Z. Xing, A. Bhowmik, K. Doan, and S. S. Keerthi. Gradient boosting neural
networks: Grownet. arXiv, 2002.07971v2, 2020.
P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with
deep learning. Nature Communications, 5, 2014.
1
https://github.com/yandex-research/rtdl

10
T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings
of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
A. Beutel, P. Covington, S. Jain, C. Xu, J. Li, V. Gatto, and E. H. Chi. Latent cross: Making use of
context in recurrent recommender systems. In WSDM 2018: The Eleventh ACM International
Conference on Web Search and Data Mining, 2018.
J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant
analysis in predicting forest cover types from cartographic variables. Computers and Electronics
in Agriculture, 24(3):131–151, 2000.
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. In Proceedings of the
Learning to Rank Challenge, volume 14, 2011.
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In SIGKDD, 2016.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical
image database. In CVPR, 2009.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv, 1810.04805v2, 2019.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image
recognition at scale. In ICLR, 2021.
S. Fort, H. Hu, and B. Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv,
1912.02757v2, 2020.
J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of
Statistics, 29(5):1189–1232, 2001.
J. M. Geusebroek, G. J. Burghouts, , and A. W. M. Smeulders. The amsterdam library of object
images. Int. J. Comput. Vision, 61(1):103–112, 2005.
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
I. Guyon, L. Sun-Hosoya, M. Boullé, H. J. Escalante, S. Escalera, Z. Liu, D. Jajetic, B. Ray, M. Saeed,
M. Sebag, A. Statnikov, W. Tu, and E. Viegas. Analysis of the automl challenge series 2015-2018.
In AutoML, Springer series on Challenges in Machine Learning, 2019.
H. Hazimeh, N. Ponomareva, P. Mol, Z. Tan, and R. Mazumder. The tree ensemble layer: Differen-
tiability meets conditional computation. In ICML, 2020.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv,
1512.03385v1, 2015.
X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin. Tabtransformer: Tabular data modeling using
contextual embeddings. arXiv, 2012.06678v1, 2020.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly
efficient gradient boosting decision tree. Advances in neural information processing systems, 30:
3146–3154, 2017.
R. Kelley Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):
291–297, 1997.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv, 1412.6980v9, 2017.
G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In
NIPS, 2017.

11
R. Kohavi. Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In KDD, 1996.
P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo. Deep neural decision forests. In
Proceedings of the IEEE international conference on computer vision, 2015.
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019.
Y. Lou and M. Obukhov. Bdt: Gradient boosted decision tables for high accuracy and scoring
efficiency. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2017.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12:2825–2830, 2011.
S. Popov, S. Morozov, and A. Babenko. Neural oblivious decision ensembles for deep learning on
tabular data. In ICLR, 2020.
L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. Catboost: unbiased boosting
with categorical features. In NeurIPS, 2018.
T. Qin and T. Liu. Introducing LETOR 4.0 datasets. arXiv, 1306.2597v1, 2013.
Z. Qin, L. Yan, H. Zhuang, Y. Tay, R. K. Pasumarthi, X. Wang, M. Bendersky, and M. Najork. Are
neural rankers still outperformed by gradient boosted decision trees? In ICLR, 2021.
W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang. Autoint: Automatic feature
interaction learning via self-attentive neural networks. In CIKM, 2019.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
S. Sun and M. Iyyer. Revisiting simple neural probabilistic language models. In NAACL, 2021.
M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In ICML, 2017.
Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. Efficient transformers: A survey. arXiv, 2009.06732v1,
2020.
R. Turner, D. Eriksson, M. McCourt, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon. Bayesian
optimization is superior to random search for machine learning hyperparameter tuning: Analysis
of the black-box optimization challenge 2020. arXiv, https://arxiv.org/abs/2104.10201v1, 2021.
J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. Openml: networked science in machine
learning. arXiv, 1407.7722v1, 2014.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In NIPS, 2017.
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark
and analysis platform for natural language understanding. In ICLR, 2019a.
Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao. Learning deep transformer
models for machine translation. In ACL, 2019b.
R. Wang, B. Fu, G. Fu, and M. Wang. Deep & cross network for ad click predictions. In ADKDD,
2017.
R. Wang, R. Shivanna, D. Z. Cheng, S. Jain, D. Lin, L. Hong, and E. H. Chi. Dcn v2: Improved deep
& cross network and practical lessons for web-scale learning to rank systems. arXiv, 2008.13535v2,
2020.
Y. Yang, I. G. Morillo, and T. M. Hospedales. Deep neural decision trees. arXiv, 1806.06988v1,
2018.