0% found this document useful (0 votes)

94 views33 pages

Few-Shot Classification of Tabular Data With Large Language Models

The document introduces TabLLM, a framework that leverages large language models (LLMs) for zero-shot and few-shot classification of tabular data by serializing the data into natural language strings. It evaluates various serialization methods and demonstrates that TabLLM outperforms traditional deep learning methods and gradient-boosted trees in several benchmark datasets, particularly in very-few-shot settings. The study highlights the potential of LLMs to exploit prior knowledge and achieve competitive performance with minimal labeled training data.

Uploaded by

zhangqshit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views33 pages

Few-Shot Classification of Tabular Data With Large Language Models

Uploaded by

zhangqshit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

TabLLM: Few-shot Classification of Tabular Data with Large Language

Models

Stefan Hegselmann1,2 Alejandro Buendia1 Hunter Lang1 Monica Agrawal1 Xiaoyi Jiang2 David Sontag1
1 MIT CSAIL 2 University of Münster

Abstract in settings with a small number of training examples, i.e.

the few-shot setting.
We study the application of large language While deep learning has led to breakthroughs in computer
models to zero-shot and few-shot classification vision and natural language processing, this success has not
of tabular data. We prompt the large language yet been extended to the tabular domain. For example, self-
model with a serialization of the tabular data to supervised deep learning methods have been introduced for
a natural-language string, together with a short tabular data (Yin et al., 2020; Arik and Pfister, 2021), but
description of the classification problem. In the Grinsztajn et al. (2022) showed that these deep techniques
few-shot setting, we fine-tune the large language still underperform ensembles of gradient boosted trees in
model using some labeled examples. We evalu- the fully supervised setting. This disparity in performance
ate several serialization methods including tem- can be attributed to the differences between tabular data and
plates, table-to-text models, and large language text or images; tabular data lacks locality, contains mixed
models. Despite its simplicity, we find that this data types, and the number of columns is usually fairly
technique outperforms prior deep-learning-based small compared to the number of features in text or image
tabular classification methods on several bench- data (Borisov et al., 2022a).
mark datasets. In most cases, even zero-shot
Recently, large language models (LLMs) such as GPT-3,
classification obtains non-trivial performance,
which are pre-trained on enormous corpora of text, have
illustrating the method’s ability to exploit prior
shown incredible performance on few-shot text classifica-
knowledge encoded in large language models.
tion and generation tasks (Brown et al., 2020; Sanh et al.,
Unlike many deep learning methods for tabular
2022; Ouyang et al., 2022). These LLMs perform well on
datasets, this approach is also competitive with
a variety of tasks and domains, including fact retrieval (Liu
strong traditional baselines like gradient-boosted
et al., 2021), mathematical reasoning (Wei et al., 2022),
trees, especially in the very-few-shot setting.
medical information extraction (Agrawal et al., 2022), and
tabular data cleaning tasks (Narayan et al., 2022). Most
importantly, because of all the knowledge encoded in their
1 INTRODUCTION parameters, LLMs require little or no labeled training data
to obtain this good performance.
Many real world applications generate tabular data as a
natural byproduct of relational databases (Shwartz-Ziv and In this work we introduce TabLLM, which is a general
Armon, 2022). It is ubiquitous in domains ranging from framework to leverage LLMs for few-shot classification of
healthcare to climate and finance (Sahakyan et al., 2021). tabular data. We prompt the LLM with a serialization of
Obtaining enough labeled data to train supervised learn- a row to a natural-language representation and a short de-
ing algorithms for classification can be difficult. For exam- scription of the classification problem. For risk stratifica-
ple, in healthcare, there are 10,000 rare diseases (Haendel tion, for instance, this serialization could list relevant pa-
et al., 2020) affecting very few patients, which hampers the tient attributes and combine it with, “Will this patient be
development of risk stratification models. Thus, we seek hospitalized?”. We experiment with nine different serial-
to develop methods that can exploit prior knowledge (e.g., izations and the T0 language model of different sizes (Sanh
from medical articles) to improve predictive performance et al., 2022). We use the parameter-efficient fine-tuning
method T-Few (Liu et al., 2022) to update the LLM’s pa-
rameters using some labeled examples. We also evaluate
Proceedings of the 26th International Conference on Artificial GPT-3 in the zero-shot setting (Brown et al., 2020). To the
Intelligence and Statistics (AISTATS) 2023, Valencia, Spain. best of our knowledge, this is one of the widest evaluations
PMLR: Volume 206. Copyright 2023 by the author(s). of LLMs for zero- and few-shot tabular classification.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

1. Tabular data with k labeled rows 2. Serialize feature names and values into natural-language string with different methods

age education gain income Manual Template Table-To-Text LLM

39 Bachelor 2174 ≤50K
36 HS-grad 0 >50K The age is 42. The educa- The person is 42 years old. The person is 42 years old
64 12th 0 ≤50K tion is Master. The gain is She has a Master. The gain and has a Master’s degree.
594. is 594 dollars. She gained $594.
29 Doctorate 1086 >50K
42 Master 594 3. Add task-specific prompt Does this person earn more than 50000 dollars? Yes or no? Answer:

4a. Fine-tune LLM using The age is 42. The education is 4b. Use LLM for prediction
The age is 29. The education is labeled examples Master. The gain is 594. on unlabeled examples
Doctorate. The gain is 1086.
Preditions Labels Does this person earn more than
Does this person earn more than
50000 dollars? Yes or no?
LLM >50K
>50K
>50K
Yes
>50K
>50K
>50K
>50K
50000 dollars? Yes or no?
Answer:
LLM No

Answer: Yes
Backprop

Figure 1: Overview of TabLLM. We first serialize the feature names and values into a natural language string. We
evaluate different strategies. This string is then combined with a task-specific prompt. To get predictions, we obtain
output probabilities from the LLM for each of a pre-specified set of verbalizer tokens (e.g., “Yes”, “No”), which map to
class labels (e.g., 1, −1). If 𝑘 > 0, we use the 𝑘 labeled examples to fine-tune the large language model using T-Few (Liu
et al., 2022). Finally, we use the (possibly tuned) large language model to obtain predictions on unlabeled examples.

Despite its simplicity, we find that TabLLM outperforms losses over augmentations (Bahri et al., 2022; Somepalli
prior deep-learning-based tabular classification methods on et al., 2021; Yoon et al., 2020; Arik and Pfister, 2021;
several benchmark datasets. By using information from Huang et al., 2020). Additional efforts have included dif-
the natural-language column names and feature values, it ferentiable trees, which combine advantages of tree ensem-
often enables effective zero-shot classification of tabular bles with gradient based optimization of neural networks
data. Unlike many deep learning methods on tabular data, (Kontschieder et al., 2015; Popov et al., 2020). How-
this approach is also competitive with gradient-boosted tree ever, several recent comprehensive reviews (Shwartz-Ziv
baselines and outperforms them or is on par until 256 shots. and Armon, 2022; Borisov et al., 2022a; Grinsztajn et al.,
In the very-few-shot setting it outperforms them by a con- 2022) found that gradient-boosted tree ensembles like XG-
siderable margin. The main contributions of this work are: Boost (Chen and Guestrin, 2016) and LightGBM (Ke et al.,
2017) systematically outperform these novel deep learning
• We introduce TabLLM, a novel framework leveraging architectures, even with proper fine-tuning and regulariza-
LLMs for data-efficient tabular classification tion (Kadra et al., 2021). Levin et al. (2022) found util-
ity in transfer learning in the semi-supervised setting, but
• We study nine serialization techniques and explore required a set of additional supervised tasks on the same
their performance across ten different datasets table, which can be a nontrivial limitation. They investi-
• We show that TabLLM instantiated with a simple text gate few-shot classification for medical diagnosis using 4 to
serialization and the T0 LLM can outperform state-of- 200 labeled examples, but do not exploit the power of large
the-art neural models and tree ensembles in the zero- pre-trained models, as we do in this work. Hollmann et al.
and few-shot setting (2022) recently introduced TabPFN, a Bayesian neural net-
work pre-trained on synthetic tabular data, outperforming
• We investigate the application of TabLLM to a large gradient boosted trees in a comprehensive evaluation.
real-world healthcare claims dataset and introduce se-
rialization methods that deal with many input features
2.2 Large Language Models for Tabular Data

2 RELATED WORK Another approach has been to leverage the natural language
capabilities of language models. Yin et al. (2020) use a
2.1 Machine Learning on Tabular Data language model for semantic parsing of natural language
queries over tabular data. Li et al. (2020) investigate the
Due to the success of deep learning in other domains, there ability of language models to perform entity matching on
have been many recent attempts at representation learning tabular data, i.e. determining if two rows refer to the same
for tabular data. Self-supervised objectives have largely object. Harari and Katz (2022) study data enrichment by
revolved around the prediction of masked cells, the iden- linking each table row with additional unstructured text
tification or correction of corrupted cells, and contrastive (e.g., from Wikipedia) from which they generated addi-
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

tional features using a language model. However, this setup representation. Typically, when prompting an LLM, there
requires named entities (e.g., celebrities, universities, etc.), is a template used to both serialize the inputs into one
which is quite limiting. Bertsimas et al. (2022) studied two natural-language string, and to provide the prompt itself
healthcare datasets and used a language model to gener- (e.g., the string “Does this person make more than 50,000
ate feature embeddings, which they fed into classifiers like dollars? Yes or no?”), which is usually located after the
gradient boosted trees. All these studies use a BERT-style serialized input. In this work, we break these pieces up
language model (Devlin et al., 2019). Narayan et al. (2022) into a serialization and a prompt. We define a function
recently assessed in-context learning with the autoregres- serialize(𝐹, x) that takes the column names 𝐹 and fea-
sive language model GPT-3 for tabular data cleaning tasks. ture values x for a row as inputs and creates a textual repre-
They found that it often outperforms state-of-the-art ap- sentation of the input. Combining this serialization with
proaches with ten labeled examples. Borisov et al. (2022b) a task-specific prompt 𝑝 will then form the LLM input
introduced an LLM-agnostic method to generate realistic (serialize(𝐹, x), 𝑝). This is illustrated in Figure 1. We
tabular data and found that it achieved better results than primarily study the serialization, since that is the biggest
existing approaches. In contrast, here we study classifica- difference compared to existing applications of prompting.
tion tasks of tabular data and investigate parameter-efficient Previous work has usually considered a simple concatena-
fine-tuning of LLMs. tion of feature names and values as a serialization of tabu-
lar data (Li et al., 2020; Narayan et al., 2022). In our work,
To use an LLM for tabular data, the table must be serial-
this function can be arbitrarily complex. For instance, we
ized into a natural text representation. All aforementioned
explore serializations that include (i) incorporating another
works relied on simple list or sentence serializations; Yin
LLM and (ii) employing feature selection as a substep.
et al. (2020) also included the column data type in the se-
rialized string. Only Bertsimas et al. (2022) studied differ-
ent serialization variants, but this was in a different context Large Language Models For Classification TabLLM
of deriving feature embeddings from BERT-style language can be used with different LLMs that generate text based
models. The LIFT method introduced by Dinh et al. (2022) on a natural-language input. Let LLM be an LLM with
comes closest to our work. The authors evaluated the ca- vocabulary 𝑉. Then, LLM((serialize(𝐹, x), 𝑝)) ∈ 𝑉 ∗
pabilities of fine-tuned GPT-3 and GPT-J models for re- is the prompted output of the LLM. In our few-shot set-
gression and classification on synthetic, tabular, and vision ting, {(serialize(𝐹, x), 𝑝) | (x, 𝑦) ∈ 𝐷 𝑘 } can be used
data. They also studied the sample efficiency and consid- as training examples for fine-tuning the LLM. The LLM
ered different static serialization templates assessing the ef- generates text in the vocabulary space 𝑉 ∗ that has to be
fect of including column names in the input. In this work, mapped to a valid class in 𝐶. Several approaches already
we focus on the publicly available T0 model and perform a exist for this problem. For example, the verbalizer (Schick
broader analysis of nine serialization techniques including and Schütze, 2021) defines a mapping between LLM out-
automatic approaches and ablations evaluating the impor- put tokens and the discrete label space. Verbalizers can
tance of feature values. Particularly, we are interested in be manually specified or automatically learned; see Cui
leveraging prior knowledge encoded in LLMs and we do a et al. (2022) for an overview of different verbalizer-learning
more fine-grained analysis of the sample efficiency includ- approaches. In this work, we assume for simplicity that
ing zero-shot experiments on ten different datasets. the verbalizer mapping is manually specified (see answer
choices in the templates in Sec. 8 in the Supplement).
3 METHODS
3.2 Our Instantiation of TabLLM
3.1 TabLLM for Tabular Data Classification
Serialization Approaches for TabLLM. The perfor-
Problem Formalization. Suppose we have a tabular mance of LLMs is very sensitive to the precise details of
dataset with 𝑛 rows and 𝑑 columns or features. We can the natural-language input (Zhao et al., 2021; Webson and
formalize this as 𝐷 = {(x𝑖 , 𝑦 𝑖 )}𝑖=1
𝑛 , where each x is a 𝑑-
𝑖 Pavlick, 2022). In this work, we focus on the serialization
dimensional feature vector. Since we consider classifica- of the tabular data. For the prompt, we use a simple de-
tion, 𝑦 𝑖 ∈ 𝐶 for a set of classes 𝐶. We define the column scription of the classification task and perform no further
names or feature names as 𝐹 = { 𝑓1 , ..., 𝑓 𝑑 }. We assume the prompt engineering. We study nine different serialization
𝑓𝑖 ’s are natural-language strings such as “age” or “educa- formats varying in complexity. All serialization methods
tion” (see Figure 1). For our 𝑘-shot classification experi- require minimal human effort to apply to new classification
ments, we only use a subset 𝐷 𝑘 of size 𝑘—sampled from tasks. We evaluate several methods that generate natural
𝐷 with replacement—for fine-tuning or training. text to create inputs that are closer to the training distribu-
tion of the LLM, thereby improving zero and very-few-shot
Serialization of Tabular Data. To use an LLM for tab- performance. Additional details and examples for the seri-
ular data, the table must be transformed into a natural text alizations are given in Sec. 1.2.1 and 9 in the Supplement.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

• List Template: A list of column names and feature (Sanh et al., 2022). This model has a token limit of 1024,
values. We fixed an arbitrary ordering of the columns. which roughly corresponds to 400 words. We also evaluate
• Text Template: An textual enumeration of all features the effect of a smaller version of the T0 model (T0 3B). We
as “The column name is value.” (see Figure 1). fine-tuned on the few-shot data D 𝑘 using the recent T-Few
recipe, which outperforms other parameter-efficient tuning
• Table-To-Text: We use an LLM fine-tuned on a methods such as soft prompt tuning (Liu et al., 2022). In
table-to-text generation task from HuggingFace addition, we perform zero-shot experiments with the LLM
(Narrativaai/bloom-560m-finetuned-totto GPT-3 (engine text-davinci-002) (Ouyang et al., 2022).
-table-to-text). To ensure that the serialization
includes all data we hand each column-value tuple to
the model separately and concatenate the outputs. 4 EXPERIMENTAL SETUP
• Text T0: We use the LLM T0 with 11B parameters
(bigscience/T0pp) (Sanh et al., 2022). We split up 4.1 Datasets
a row into pairs of two column-value tuples. We send
them to LLM separately with the prompt “Write this We studied TabLLM in two experimental settings. First,
information as a sentence:” and combine the outputs. we considered nine medium-sized tabular datasets for bi-
nary and multi-class classification. We systematically iden-
• Text GPT-3: We use GPT-3 (engine text-davinci- tified datasets from Kadra et al. (2021), Grinsztajn et al.
002) accessible through an API (Ouyang et al., 2022). (2022), and Borisov et al. (2022a). We included datasets
GPT-3 was able to serialize all features at once, so we with at most 50,000 rows to keep the fine-tuning costs man-
use a list of all features with the prompt “Rewrite all ageable and at most 30 columns to stay within T0’s token
list items in the input as a natural text.” as input. We limit. We also required textual feature names to make the
guide the output with “The {person, car, patient} is”. serializations more meaningful and we excluded datasets
with derived feature values (e.g., mean pixel values). This
We consider the following serializations as ablations: lead to inclusion of Bank (45,211 rows, 16 feats), Blood
(748, 4), California (20,640, 8), Car (1,728, 8), Credit-
• List Only Values: List Template for feature values g (1,000, 20), Income (48,842, 14), and Jungle (44,819,
only. We want to evaluate whether column names aid 6). We added two additional datasets from Kaggle that ful-
the classification performance. filled our inclusion criteria: Diabetes (768, 8) and Heart
• List Permuted Names: List Template with permuted (918, 11). Second, we evaluated TabLLM for risk stratifi-
column names. Hence, the wrong column name is as- cation on three binary classification tasks, following prior
sociated with each feature value. The permutation is work by Kodialam et al. (2021) and similarly using a de-
the same across all examples. We perform this abla- identified health claims dataset from a U.S. health insurer.
tion to study the relevance of the correct association We predicted the end-of-life (EoL) of all patients older than
between column names and feature values. 70 years, which can be used to inform care in a palliative
setting (Avati et al., 2018). We also considered the need for
• List Permuted Values: List Template with consis- any surgical procedure (Surgery) and the likelihood of hos-
tently permuted values across all examples. We gen- pitalization (LoH), which can help with determining health
erate one permutation for each column and apply this care needs and estimating future costs. Additional details
mapping to all column values. For continuous values, on all datasets can be found in Sec. 1 in the Supplement.
we use ten uniform bins. This tests whether the LLM We release the code for our experiments on Github.1
uses the fine-grained information encoded by the fea-
ture values for zero-shot and few-shot classification.
4.2 LLM and Fine-tuning
• List Short: List Template with at most ten features.
We only consider this for the healthcare dataset where We used the HuggingFace implementation of the T0 model
the number of features exceeds the input limit of the (bigscience/{T0pp,T0 3B}). Prompts for the LLM
LLM. We want to study the effect of less information. were designed following Sanh et al. (2022) using the
PromptSource framework (Bach et al., 2022). Each class
Large Language Models for TabLLM Another crucial in our classification tasks was manually encoded in a tex-
component of TabLLM is the LLM. TabLLM is both ag- tual response, e.g., “Yes” and “No” for true and false (Sanh
nostic to the LLM and the specific fine-tuning method that et al., 2022). The prediction probability for each class cor-
is used. We only consider a single LLM for most of our ex- responds to the probability of the LLM generating its token
periments. We employ the T0 encoder-decoder model with sequence normalized across all classes. All templates used
11 billion parameters as the LLM for TabLLM (Sanh et al., in this work are given in Sec. 8 in the Supplement.
2022). It was trained on a large variety of task-specific
prompts, making it a suitable candidate for our experiments 1 https://github.com/clinicalml/TabLLM
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

For fine-tuning, we adopted the default hyperparameters of encoded vector. For each medical concept, there were three
the T-Few method without any additional parameter tun- indicator variables of whether that concept occurred within
ing (Liu et al., 2022). The authors used a setup of 𝑘 = 32 30 days, 1 year, and anytime before prediction time.
shots and 1,000 training steps for most of their experiments,
which corresponds to 31.25 epochs. Hence, we fixed 30 4.4 Serializations
training epochs for all few-shot experiments on the public
tabular datasets. We used 20% of the data as a test set. For For the public datasets, some column names and feature
the large healthcare claims dataset, we used 10 epochs for values were manually mapped to human-readable forms,
up to 256 shots and 3 epochs for 1,024, 4,096 and 16,384 to based on the provided documentation. For instance, for
reduce the runtime and prevent overfitting for many train- the Income dataset, the feature name hours per week was
ing examples. We used a test set of 10,000 examples for the mapped to work hours per week and the feature value pri-
three healthcare tasks. All experiments were evaluated with vate for working class was mapped to private sector em-
the area under the receiver operating characteristic curve ployee. Numerical values were not changed.
(AUC). We used macro-AUC one-versus-rest for the mul-
ticlass setting. Estimates for the runtime are given in Sec. Serialization was more complex for the healthcare claims
2 in the Supplement. data. Each patient record is a time series of visits, with
each visit consisting of a list of medical conditions and
procedures. We only considered the manual serializations
4.3 Baseline Models List Template and Text Template. We tried to mimic the
style of a medical professional to tap potential prior knowl-
We compared TabLLM to several baselines. For the sim- edge of the LLM. To this end, the serialization starts with
plest baseline, we used a logistic regression (LR) model. an intro sentence containing the patient’s gender, age, and
Since previous work showed the superiority of gradient race. It then describes each visit, stating its date, the type
boosted tree ensembles (Borisov et al., 2022a), we included of doctor the patient saw (e.g., dermatology) if an outpa-
the most common models XGBoost (Chen and Guestrin, tient visit or length of hospitalization if an inpatient visit,
2016) and LightGBM (Ke et al., 2017). We also evaluated the primary complaint of the associated visit, and proce-
several state-of-the-art deep learning baselines. TabNet is dures performed. Since there are no feature values in this
a widely used neural model for tabular data that uses at- dataset, we omit List Only Values and List Permuted Values.
tention over columns (Arik and Pfister, 2021). SAINT is We also performed experiments for concept selection and
a more recent approach that uses attention over rows and different names for the medical concepts. Details for these
columns (Somepalli et al., 2021). SAINT performed best additional experiments and examples of the serializations
in a comprehensive review on tabular data (Borisov et al., are given in Sec. 1.2.2, 1.2.3, and 9 in the Supplement.
2022a). NODE is a differentiable tree ensemble method
that performed best in the evaluation of Shwartz-Ziv and
Armon (2022). Lastly, we include TabPFN, a Bayesian 5 RESULTS
neural network that was pre-trained on synthetic tabular
data (Hollmann et al., 2022). In contrast to TabLLM, we 5.1 Effects of serialization
performed hyperparameter tuning for all baselines except
TabPFN (see Sec. 3 in the Supplement), which requires no Figure 2 shows the performance of different serializa-
tuning by design. We adopted the parameter ranges from tion methods for TabLLM averaged over the nine public
previous reviews (Borisov et al., 2022a; Grinsztajn et al., datasets. The Text Template serialization performed very
2022). Since no validation set exists in the few-shot setting, well across all experiments. In the zero-shot setting, the
we used 4-fold cross validation on the 𝑘-shots. In particu- Text Template showed improvements over List Template,
lar, we did not use a large validation set for hyperparameter indicating the benefit of a serialization that is closer to the
tuning, unlike some few-shot learning works as highlighted training distribution of T0. However, these differences al-
by Perez et al. (2021). We encoded categorical values as ready vanished for 8 training examples. Hence, very few
one-hot vectors. We also tested ordinal encoding for LR, training examples might already suffice to adjust for dif-
XGBoost, LightGBM, and TabPFN, but it showed worse ferent templates. This suggests that sophisticated serializa-
results (see Table 12, 13, and 14 in the Supplement). In ad- tions might be unnecessary when some training data exists.
dition, we give results for GPT-3 (text-davinci-002)
Using LLMs for serialization showed mixed results. The
without fine-tuning, i.e. in the zero-shot setting using the
ordering is according to the complexity of the LLM used
Text Template serialization.
for serialization. GPT-3 has 175B, T0 11B, and the
For the three health claims tasks, we used the same experi- BLOOM table-to-text model 0.56B parameters. Different
mental setup for the baselines. However, we only included reasons might be responsible for the worse performance
LR and LightGBM due to runtime limitations. Following overall. The models tended to hallucinate information for
Kodialam et al. (2021), each patient’s input was a one-hot some examples, leading to biased predictions of TabLLM.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

0.85 0.85
Average AUC (SD) across tabular datasets

Average AUC (SD) across tabular datasets

0.80 0.80

0.75 0.75

0.70 0.70
Log. Reg.
List Template LightGBM
0.65 Text Template 0.65 XGBoost
Table-To-Text SAINT
0.60 Text T0 0.60 TabNet
Text GPT-3 NODE
0.55 List Only Values 0.55 TabPFN
List Perm. Names GPT-3
0.50 List Perm. Values 0.50 TabLLM

0 4 8 16 32 64 128 256 512 0 4 8 16 32 64 128 256 512

Number of labeled training examples (shots) Number of labeled training examples (shots)

Figure 2: Average AUC and SD of different serializations Figure 3: Average AUC and SD of TabLLM versus all
across nine public datasets. Text Template performs best baseline models across nine public datasets. TabLLM
for zero and few training examples. For many examples, outperforms all baselines for zero and very few training
the performance of different serializations converges. examples. TabPFN is the strongest baseline.

For instance, GPT-3 added “this car is a good choice” or alization and select the most frequent conditions. Results
added entirely new data to some examples (see Sec. 9 in for all (dataset, serialization) combinations (Table 12, 13,
the Supplement). Also, the LLMs are not completely faith- and 14) and the additional experiments on the healthcare
ful at including all features, even though we tried to enforce dataset (Table 5 and 7) can be found in the Supplement.
it in our experiments. This could explain that none of the
LLM serializations reaches the same performance as the
5.2 Public Tabular Datasets
template serializations, even for many training examples.
Using only feature values had a poor performance for zero Figure 3 shows the averaged results for TabLLM using the
and very few shots, but the performance equalized with best serialization (Text Template) versus all baseline mod-
more training examples. The same applies to the list se- els. Table 1 contains the detailed results for TabLLM,
rialization with permuted feature names. This indicates TabPFN, and XGBoost. TabLLM showed a similar behav-
that if enough training examples are available, the serial- ior across datasets. It achieved nontrivial zero-shot perfor-
ization approach does not matter, but that TabLLM relies mance for all tasks except on Credit-g and Heart. For
on information from the feature names in the zero-shot and Heart this might be due to the dataset’s inclusion crite-
few-shot regime, and also relies on the association of the ria requiring eligibility for a heart procedure biasing the
names with the correct values. The discrepancy for zero prediction. In all cases, TabLLM’s performance improved
and very few shots was even stronger for List Permuted Val- with a higher number of shots. In the zero-shot setting,
ues, which suggests that TabLLM relies more on the correct TabLLM was on par with GPT-3 even though GPT-3 is
values than feature names. Again, the performance equal- a much larger model than T0 (175B vs. 11B parame-
ized for more examples showing the ability of TabLLM to ters). TabPFN consistently outperformed the other baseline
learn new associations if enough training data is available. models across all numbers of training examples. TabPFN
Using the smaller T0 3B model showed a slightly decreased reached TabLLM’s performance with 4 to 256 (Income)
performance (see Table 12, 13, and 14 in the Supplement). training examples. LR was the second-best baseline of-
ten beating the tree models, which might be due to our ex-
For the healthcare claims dataset, we found that the List
tensive parameter tuning (see Sec. 4 in the Supplement).
Template slightly outperformed the Text Template serial-
TabLLM outperformed or was on par with the tree ensem-
ization (see Table 15 in the Supplement). This was con-
ble baselines until 256 training examples for all datasets
sistent across tasks. The List Short serialization only per-
except Calhousing and Jungle. For fewer shots, it often
formed slightly worse. The evaluation of different concept
outperformed them by a large margin. XGBoost performed
selection strategies showed that choosing the most frequent
relatively poorly for few shots, which was probably due to
conditions per patient performed best. We found no consid-
overfitting on the small training and validation sets (as de-
erable performance difference for different concept names.
scribed in the previous section, we do not use large valida-
From here onwards, we show results for TabLLM using the tion sets for hyperparameter tuning to ensure the results are
Text Template serialization for the public datasets. For the truly few-shot). TabLLM outperformed the neural base-
healthcare claims dataset, we use the List Template seri- lines SAINT, NODE, and TabNet in many settings. It also
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Table 1: Test AUC performance of TabLLM, the best tree ensemble model (XGBoost), and the best baseline (TabPFN) on
the public tabular datasets. Each column reports the performance for 𝑘 training examples. TabLLM (T0 + Text Template)
outperforms XGBoost and TabPFN in the very-few-shot regime. Standard deviations are given across five random seeds.

Number of Shots
Dataset Method 0 4 8 16 32 64 128 256 512 all
XGBoost — 0.50.00 0.56.09 0.68.04 0.76.03 0.83.02 0.85.03 0.88.01 0.90.01 0.94.00
Bank TabPFN — 0.59.14 0.66.08 0.69.02 0.76.03 0.82.03 0.86.02 0.89.00 0.90.00 0.91.00
TabLLM 0.63.01 0.59.10 0.64.05 0.65.05 0.64.06 0.69.03 0.82.05 0.87.01 0.88.01 0.92 †
XGBoost — 0.50.00 0.58.07 0.66.04 0.67.06 0.68.05 0.71.06 0.70.07 0.67.06 0.71.04
Blood TabPFN — 0.52.08 0.64.04 0.67.01 0.70.04 0.73.04 0.75.04 0.76.04 0.76.03 0.74.03
TabLLM 0.61.04 0.58.09 0.66.03 0.66.07 0.68.04 0.68.04 0.68.06 0.70.08 0.68.04 0.70.04
XGBoost — 0.50.00 0.62.10 0.74.03 0.79.04 0.82.04 0.87.01 0.90.01 0.92.01 0.97.00
Calhousing TabPFN — 0.63.13 0.63.11 0.80.03 0.85.03 0.89.01 0.91.01 0.92.00 0.93.00 0.94.00
TabLLM 0.61.01 0.63.05 0.60.07 0.70.08 0.77.08 0.77.04 0.81.02 0.83.01 0.86.02 0.95.00
XGBoost — 0.50.00 0.59.04 0.70.08 0.82.03 0.91.02 0.95.01 0.98.01 0.99.01 1.00.00
Car TabPFN — 0.64.06 0.75.05 0.87.04 0.92.02 0.97.00 0.99.01 1.00.00 1.00.00 1.00.00
TabLLM 0.82.02 0.83.03 0.85.03 0.86.03 0.91.02 0.96.02 0.98.01 0.99.00 1.00.00 1.00.00
XGBoost — 0.50.00 0.51.07 0.59.05 0.66.03 0.67.06 0.68.02 0.73.02 0.75.03 0.78.04
Credit-g TabPFN — 0.58.08 0.59.03 0.64.06 0.69.07 0.70.07 0.72.06 0.75.04 0.75.02 0.75.03
TabLLM 0.53.05 0.69.04 0.66.04 0.66.05 0.72.06 0.70.07 0.71.07 0.72.03 0.72.02 0.70.02
XGBoost — 0.50.00 0.59.16 0.72.07 0.69.08 0.73.05 0.78.05 0.80.03 0.80.01 0.84.03
Diabetes TabPFN — 0.61.13 0.67.11 0.71.07 0.77.03 0.82.03 0.83.03 0.83.03 0.81.02 0.81.03
TabLLM 0.68.06 0.61.09 0.63.08 0.69.07 0.68.04 0.73.03 0.79.04 0.78.02 0.78.04 0.80.04
XGBoost — 0.50.00 0.55.14 0.84.07 0.88.04 0.91.01 0.91.01 0.90.01 0.92.01 0.94.01
Heart TabPFN — 0.84.06 0.88.05 0.87.06 0.91.02 0.92.02 0.92.02 0.92.01 0.92.02 0.92.02
TabLLM 0.54.04 0.76.14 0.83.05 0.87.04 0.87.06 0.91.01 0.90.01 0.92.01 0.92.01 0.94.01
XGBoost — 0.50.00 0.59.06 0.77.02 0.79.03 0.82.02 0.84.01 0.87.01 0.88.00 0.93.00
Income TabPFN — 0.73.08 0.71.09 0.76.09 0.80.04 0.82.04 0.84.01 0.86.01 0.87.01 0.89.00
TabLLM 0.84.00 0.84.01 0.84.02 0.84.04 0.84.01 0.84.02 0.86.01 0.87.00 0.89.01 0.92.00
XGBoost — 0.50.00 0.58.07 0.72.05 0.78.03 0.81.02 0.84.02 0.87.01 0.91.01 0.98.00
Jungle TabPFN — 0.65.08 0.72.04 0.71.07 0.78.02 0.81.01 0.84.01 0.88.01 0.91.00 0.93.00
TabLLM 0.60.00 0.64.01 0.64.02 0.65.03 0.71.02 0.78.02 0.81.02 0.84.01 0.89.01 1.00 †
† These experiments were only performed for a single run due to runtime limitations of TabLLM on the full dataset.

Table 2: Five highest and lowest weighted features for Introspecting TabLLM—What Prior Knowledge Does
zero-shot TabLLM and logistic regression (LR) trained on it Use? Given the strong zero-shot performance of
all data for Income. Both models show very similar trends TabLLM on the Income dataset, we next sought to under-
for important features. stand which features it based its predictions on in order to
shed light on the prior knowledge used by the LLM. To de-
Feature TabLLM LR termine the feature importance for TabLLM, we fit a LR
rank weight rank weight
model to the zero-shot prediction using the original fea-
capital gain 1 5.310 2 2.393 tures as covariates as described in Sec. 6 in the Supple-
education Masters 2 4.623 6 1.455
education Doctorate 3 3.410 4 2.066
ment. Highly weighted features (see Table 2) for zero-shot
education Bachelors 4 2.995 7 1.135 TabLLM include the individual’s occupation (with e.g.,
education Prof-school 5 2.949 5 1.900 ‘Farming-fishing’ having a large negative weight), high-
occupation Priv-house-serv 102 -2.840 105 -1.909 est education level (‘Masters’ and ‘Doctorate’ have posi-
education 12th 103 -3.178 79 -0.480 tive weights; ‘Preschool’ grade has a negative weight), and
education Preschool 104 -3.520 106 -2.385 workclass (‘Without-pay’ has a negative weight). TabLLM
occupation Farming-fishing 105 -3.853 98 -0.982 also seems to be able to correctly interpret the numerically
workclass Without-pay 106 -4.423 69 -0.174
encoded capital gain value. For comparison, we also show
the feature weights for a LR model trained on all data. We
see a strong concordance between both models; TabLLM’s
was on par or very close to the best baseline models on the top five features are all among the top seven of the LR
full datasets, indicating that there is little performance lost model. However, TabLLM scores the highest education
due to the serialization and the choice of model family.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 3: Test AUC on the healthcare claims dataset. TabLLM outperforms logistic regression (LR) for up to 64 and
LightGBM for up 256 training examples on End of Life (EoL). Standard deviations are given across five random seeds.

Number of Shots
Dataset Method 0 16 64 256 1,024 4,096 16,384 all
LR — 0.65.07 0.77.02 0.80.02 0.83.01 0.83.01 0.84.01 0.84.01
EoL
LightGBM — 0.50.00 0.71.01 0.76.02 0.80.01 0.82.01 0.83.01 0.82 †
TabLLM 0.70 0.74 0.78 0.78 0.79 0.81 0.81 —
LR — 0.72.04 0.75.05 0.77.01 0.79.01 0.80.01 0.80.00 0.81.00
Surgery
LightGBM — 0.50.00 0.73.02 0.77.01 0.79.01 0.80.00 0.81.01 0.82 †
TabLLM 0.67 0.73 0.72 0.73 0.75 0.78 0.79 —
LR — 0.72.04 0.76.03 0.80.01 0.82.01 0.83.01 0.83.01 0.84.01
LoH
LightGBM — 0.50.00 0.72.02 0.76.03 0.81.01 0.83.00 0.83.01 0.85 †
TabLLM 0.71 0.73 0.73 0.76 0.78 0.81 0.82 —
† These experiments were only performed for a single run due to runtime limitations on the full dataset.

Table 4: Five highest and lowest weighted features for Introspecting TabLLM—What Prior Knowledge Does
zero-shot TabLLM for EoL and their relative risk (RR) it Use? We also performed a feature analysis to study the
with confidence intervals (CI). The top five features show strong zero-shot performance on EoL. However, we did not
a significant increase of the relative risk. compare to a LR model trained on all data due to the vast
amount of features and potential colinearites in the data.
Feature TabLLM RR (95% CI) Instead, we compared to the relative risk (RR) with a 95%
atrial fibrillation 0.633 2.72 (2.51-2.95) confidence interval (CI). Table 4 shows the five highest and
atherosclerosis of coronary art... 0.530 2.10 (1.94-2.27) lowest weighted features of zero-shot TabLLM and their
atherosclerosis of aorta 0.473 1.99 (1.81-2.19)
exudative age-related macular d... 0.452 2.38 (2.06-2.75)
relative risk for EoL. All top five features have a signifi-
sex male 0.442 1.23 (1.14-1.33) cantly increased relative risk demonstrating the capabilities
of TabLLM to identify relevant features even without any
open angle with borderline intr... -0.338 1.20 (1.03-1.40)
primary localized osteoarthrosi... -0.366 1.08 (0.82-1.43) training examples. For the five lowest weighted features,
localized, primary osteoarthritis -0.393 1.23 (1.07-1.40) only ‘sex female’ has a significantly decreased risk. A list
sex female -0.441 0.81 (0.75-0.88) of 100 features is given in Table 17 in the Supplement.
open-angle glaucoma - borderline -0.495 0.97 (0.85-1.10)

degrees in the opposite order. Table 16 in the Supplement 6 DISCUSSION

shows the importance of all 106 features.
For all datasets except Credit-g and Heart, the List Tem-
5.3 Large Healthcare Claims Dataset plate and Text Template serializations showed nontrivial
zero-shot performance, indicating that TabLLM is able to
Table 3 shows the results for TabLLM with the List Tem- effectively utilize prior knowledge in the LLM for classi-
plate serialization on EoL, Surgery, and LoH, the three fication. Serializations with LLMs proved suboptimal due
prediction tasks for the healthcare claims dataset. TabLLM to their noisy outputs suggesting that simple templates are
showed very considerable zero-shot performance, ranging preferable for TabLLM. The performance drops observed
from 0.67 AUC for Surgery to 0.71 for LoH. The perfor- when we removed or permuted the column names indicate
mance improves with higher number of training examples. that the LLM actually makes use of feature names and their
However, the performance jumps happen at different steps relationships to the correct values, especially in the few-
and to a different extent. TabLLM outperformed LR for shot setting. These findings are partly consistent with Dinh
up to 16 (Surgery and LoH) to 64 (EoL) training exam- et al. (2022) who used GPT-3 and tested serializations with
ples and LightGBM for up to 64 (LoH) and 256 (EoL) removed or permuted column names. When using all train-
examples. For more examples, LR and LightGBM per- ing examples, they showed that using the correct column
formed slightly better. This could suggest that the infor- names led to the best performance on four classification
mation lost from our concept selection procedure, needed tasks. In contrast to our results, however, they could not
because of the token limits of the LLM, eventually starts confirm these findings when using only a fraction (0.2, 0.4,
costing TabLLM performance. We also evaluated TabLLM 0.6, 0.8) of the training data. A reason for this could be that
and LR in an unbalanced setting (see Table 15 in the Sup- we tested much fewer number of training examples. In ad-
plement). In this case, TabLLM outperforms LR up to 64 dition to that, we found a very strong drop in performance
training examples on all datasets emphasizing its utility in for permuted values showing that the LLM relies more on
a real world setting with limited access to labeled data. the correct values than feature names. Surprisingly, how-
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

ever, all serializations with less information came close vate data. Except for the zero-shot and very few-shot
to the best serialization for 256 (tabular datasets) to 1024 regime, other baselines tend to outperform TabLLM on
training examples (insurance dataset). Hence, when hun- these datasets. This suggests that Blood, Diabetes, and
dreds of training examples are available, the input format Heart datasets could be good proxies for the community
proved less relevant, and the LLM was able to adapt (Jin to further study medical-domain tabular classification with
et al., 2022). Like our results, Bertsimas et al. (2022) found LLMs without needing access to large private datasets.
that natural language representation of healthcare data gave
little-to-no improvement (in their different setup) compared 7 LIMITATIONS AND CONCLUSION
to a more straightforward serialization in the medium-shot
setting. Our findings also support prior work showing that
TabLLM has a much larger computational footprint com-
irrelevant and even misleading inputs can lead to simi-
pared to traditional algorithms. It still requires fairly large
lar few-shot performance (Min et al., 2022; Webson and
GPUs to fine-tune the LLM, and inference with T0 requires
Pavlick, 2022; Reynolds and McDonell, 2021). For in-
far more FLOPs than inference with XGBoost or LR. Our
stance, permuting the column names only showed a dif-
results indicate that TabLLM trades off this computational
ference for up to 16 training examples (see Figure 2).
efficiency for improved sample efficiency. Further, as we
We found clear performance improvements for TabLLM saw with the three healthcare claims tasks, performance
when using additional training examples. It often outper- may suffer if the dense feature set for a given row cannot
formed strong baseline models in the very-few-shot setting. fit within the token limit for a given LLM. Since the gains
This emphasizes the value of leveraging LLMs when only from TabLLM stem from its ability to use existing domain
little labeled data is available. Surprisingly, Dinh et al. knowledge, the semantics of the column names and fea-
(2022) could not confirm these findings for GPT-3. On ture values need to have been observed during the LLM’s
two binary classification tasks a fine-tuned GPT-3 model original pre-training. For example, if the columns represent
performed worse than LR for up to 250 training examples. genes, we may not expect a vanilla LLM to have strong rep-
Our results indicate that the sample efficiency of TabLLM resentations for gene names. Finally, due to dataset shift,
is highly task-dependent. The performance on Blood, the pre-training data for a given LLM may not necessarily
Credit-g, Diabetes, and Heart is worse than the perfor- reflect the settings under which a given table was aggre-
mance on Income and Car. Most features of the latter gated, e.g., due to inflation and a changing value of money
datasets have semantically meaningful textual values likely (see Sec. 5 in the Supplement).
boosting TabLLM’s performance. However, TabLLM also
Despite these limitations, our empirical results show that
achieved reasonable results on numerical datasets (Blood,
TabLLM enjoys strong performance at tabular classifi-
California, Diabetes, and Jungle). In addition, Diabetes
cation, outperforming state-of-the-art baseline algorithms
and Heart have somewhat specialized feature names and
like XGBoost and SAINT by over 5 AUC points in the
values, such as “ventricular hypertrophy” and “Plasma glu-
very-few-shot regime, all while staying competitive with
cose concentration,” whereas Income and Car are more
these methods when a large number of samples is available.
general-domain knowledge. This indicates that T0, the lan-
guage model we used in TabLLM, seems to have less prior Currently, TabLLM does not use any unlabeled data; a
knowledge about medicine than about general-domain con- fruitful direction could involve leveraging unlabeled data,
cepts. Indeed, the training tasks for T0 do not contain any e.g., using the techniques from Lang et al. (2022) to com-
tasks with medical data (Sanh et al., 2022). bine the few-shot performance of TabLLM with the ulti-
mate performance of tree-based baselines by co-training
Our findings on the three insurance claims datasets partly
the models together. Other improvements could include
reinforce this hypothesis. Zero-shot performance depends
more faithful LLM serializations as well as numeric-
on the concept selection strategy and the LLM seems to
specific encoding methods (Gorishniy et al., 2022).
have little knowledge about medical procedures. Prior
work has shown that medical-domain-specific language
models, such as PubMedBERT, and general-domain mod- 8 SOCIETAL IMPACT
els with medical data in their training sets, such as GPT-
3, perform well at downstream prediction tasks on medical Similar to other ML systems that were trained on his-
data even with fairly few samples (Gu et al., 2021; Agrawal toric data, LLMs are prone to replicate existing biases and
et al., 2022). Substituting T0 with one of these models in stereotypes. Hence, when applying TabLLM for sensi-
TabLLM to study medical predictions tasks is an interest- tive tasks such as income or a health trajectory, predictions
ing direction for future work. should be considered with great care and further analyses
(e.g., for subgroups) are mandatory. In addition, LLMs re-
Our results on the public Blood, Diabetes, and Heart quire a lot of computing resources. This bears the risk of
datasets are very similar to our results for EoL, Surgery, creating an exclusive research environment. Also, the en-
and LoH, which are practically relevant but rely on pri- vironmental impact of LLMs can be significant.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

9 ACKNOWLEDGEMENTS G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
SH was supported by the German Academic Exchange Ser- Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
vice, HL by NSF AiTF award CCF-1723344, MA by a Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C.,
Takeda Fellowship, and DS, HL, AB, and SH in part by McCandlish, S., Radford, A., Sutskever, I., and Amodei,
Independence Blue Cross. Thanks to Dr. Steven Horng for D. (2020). Language models are few-shot learners. In
generously donating GPU-time on the BIDMC computing Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.,
cluster (Horng, 2022) and to NVIDIA Corporation for their and Lin, H., editors, Advances in Neural Information
donation of two NVIDIA A100 GPUs used in this work. Processing Systems, volume 33, pages 1877–1901. Cur-
ran Associates, Inc.
References Chen, T. and Guestrin, C. (2016). XGBoost: A Scal-
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., and able Tree Boosting System. In Proceedings of the 22nd
Sontag, D. (2022). Large Language Models are Zero- ACM SIGKDD International Conference on Knowledge
Shot Clinical Information Extractors. Technical Report Discovery and Data Mining, KDD ’16, pages 785–794,
arXiv:2205.12689, arXiv. New York, NY, USA. Association for Computing Ma-
chinery.
Arik, S. Ö. and Pfister, T. (2021). Tabnet: Attentive in-
terpretable tabular learning. Proceedings of the AAAI Cui, G., Hu, S., Ding, N., Huang, L., and Liu, Z. (2022).
Conference on Artificial Intelligence, 35(8):6679–6687. Prototypical verbalizer for prompt-based few-shot tun-
ing. In Proceedings of the 60th Annual Meeting of the
Avati, A., Jung, K., Harman, S., Downing, L., Ng, A., and Association for Computational Linguistics (Volume 1:
Shah, N. H. (2018). Improving palliative care with deep Long Papers), pages 7014–7024, Dublin, Ireland. Asso-
learning. BMC medical informatics and decision mak- ciation for Computational Linguistics.
ing, 18(4):55–64.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
Bach, S., Sanh, V., Yong, Z. X., Webson, A., Raffel, C., (2019). BERT: Pre-training of Deep Bidirectional Trans-
Nayak, N. V., Sharma, A., Kim, T., Bari, M. S., Fevry, formers for Language Understanding. In Proceedings of
T., Alyafeai, Z., Dey, M., Santilli, A., Sun, Z., Ben- the 2019 Conference of the North American Chapter of
david, S., Xu, C., Chhablani, G., Wang, H., Fries, J., the Association for Computational Linguistics: Human
Al-shaibani, M., Sharma, S., Thakker, U., Almubarak, Language Technologies, Volume 1 (Long and Short Pa-
K., Tang, X., Radev, D., Jiang, M. T.-j., and Rush, A. pers), pages 4171–4186, Minneapolis, Minnesota. Asso-
(2022). PromptSource: An integrated development en- ciation for Computational Linguistics.
vironment and repository for natural language prompts.
Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S.,
In Proceedings of the 60th Annual Meeting of the Asso-
yong Sohn, J., Papailiopoulos, D., and Lee, K. (2022).
ciation for Computational Linguistics: System Demon-
LIFT: Language-interfaced fine-tuning for non-language
strations, pages 93–104, Dublin, Ireland. Association for
machine learning tasks. In Oh, A. H., Agarwal, A., Bel-
Computational Linguistics.
grave, D., and Cho, K., editors, Advances in Neural In-
Bahri, D., Jiang, H., Tay, Y., and Metzler, D. (2022). Scarf: formation Processing Systems.
Self-supervised contrastive learning using random fea-
Gorishniy, Y., Rubachev, I., and Babenko, A. (2022). On
ture corruption. In International Conference on Learn-
embeddings for numerical features in tabular deep learn-
ing Representations.
ing. arXiv preprint arXiv:2203.05556.
Bertsimas, D., Carballo, K. V., Ma, Y., Na, L., Bous-
Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022).
sioux, L., Zeng, C., Soenksen, L. R., and Fuentes, I.
Why do tree-based models still outperform deep learn-
(2022). Tabtext: a systematic approach to aggregate
ing on typical tabular data? In Thirty-sixth Conference
knowledge across tabular data structures. arXiv preprint
on Neural Information Processing Systems Datasets and
arXiv:2206.10381.
Benchmarks Track.
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawel-
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu,
czyk, M., and Kasneci, G. (2022a). Deep Neural Net-
X., Naumann, T., Gao, J., and Poon, H. (2021). Domain-
works and Tabular Data: A Survey. Technical Report
specific language model pretraining for biomedical nat-
arXiv:2110.01889, arXiv.
ural language processing. ACM Transactions on Com-
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., and puting for Healthcare (HEALTH), 3(1):1–23.
Kasneci, G. (2022b). Language models are realistic tab- Haendel, M., Vasilevsky, N., Unni, D., Bologa, C., Har-
ular data generators. arXiv preprint arXiv:2210.06280. ris, N., Rehm, H., Hamosh, A., Baynam, G., Groza, T.,
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, McMurry, J., et al. (2020). How many rare diseases are
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, there? Nature Reviews Drug Discovery, 19(2):77–78.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Harari, A. and Katz, G. (2022). Few-shot tabular data en- Context Learning. arXiv:2205.05638 [cs]. arXiv:
richment using fine-tuned transformer architectures. In 2205.05638.
Proceedings of the 60th Annual Meeting of the Associa- Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z.,
tion for Computational Linguistics (Volume 1: Long Pa- and Tang, J. (2021). GPT Understands, Too. Technical
pers), pages 1577–1591. Report arXiv:2103.10385, arXiv.
Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F.
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,
(2022). Tabpfn: A transformer that solves small tabu-
Hajishirzi, H., and Zettlemoyer, L. (2022). Rethink-
lar classification problems in a second. arXiv preprint
ing the role of demonstrations: What makes in-context
arXiv:2207.01848.
learning work? arXiv preprint arXiv:2202.12837.
Horng, S. (2022). Machine Learning Core.
Narayan, A., Chami, I., Orr, L., and Ré, C. (2022). Can
Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Foundation Models Wrangle Your Data? Technical Re-
Z. (2020). TabTransformer: Tabular Data Model- port arXiv:2205.09911, arXiv.
ing Using Contextual Embeddings. Technical Report
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
arXiv:2012.06678, arXiv.
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Jin, W., Cheng, Y., Shen, Y., Chen, W., and Ren, X. Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller,
(2022). A good prompt is worth millions of parameters? L., Simens, M., Askell, A., Welinder, P., Christiano,
low-resource prompt-based learning for vision-language P., Leike, J., and Lowe, R. (2022). Training language
models. In ACL 2022. models to follow instructions with human feedback.
Kadra, A., Lindauer, M., Hutter, F., and Grabocka, J. arXiv:2203.02155 [cs]. arXiv: 2203.02155.
(2021). Well-tuned simple nets excel on tabular datasets. Perez, E., Kiela, D., and Cho, K. (2021). True few-shot
Advances in neural information processing systems, learning with language models. Advances in Neural In-
34:23928–23941. formation Processing Systems, 34:11054–11070.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Popov, S., Morozov, S., and Babenko, A. (2020). Neural
Ye, Q., and Liu, T.-Y. (2017). LightGBM: A Highly Ef- oblivious decision ensembles for deep learning on tabu-
ficient Gradient Boosting Decision Tree. In Advances lar data. In International Conference on Learning Rep-
in Neural Information Processing Systems, volume 30. resentations.
Curran Associates, Inc.
Reynolds, L. and McDonell, K. (2021). Prompt program-
Kodialam, R., Boiarsky, R., Lim, J., Sai, A., Dixit, N., and
ming for large language models: Beyond the few-shot
Sontag, D. (2021). Deep contextual clinical prediction
paradigm. In Extended Abstracts of the 2021 CHI Con-
with reverse distillation. Proceedings of the AAAI Con-
ference on Human Factors in Computing Systems, pages
ference on Artificial Intelligence, 35(1):249–258.
1–7.
Kontschieder, P., Fiterau, M., Criminisi, A., and Bulo, S. R.
Sahakyan, M., Aung, Z., and Rahwan, T. (2021). Ex-
(2015). Deep Neural Decision Forests. In 2015 IEEE
plainable artificial intelligence for tabular data: A sur-
International Conference on Computer Vision (ICCV),
vey. IEEE Access, 9:135392–135422.
pages 1467–1475, Santiago, Chile. IEEE.
Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L.,
Lang, H., Agrawal, M. N., Kim, Y., and Sontag, D. (2022).
Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey,
Co-training improves prompt-based learning for large
M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S.,
language models. In Chaudhuri, K., Jegelka, S., Song,
Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta,
L., Szepesvari, C., Niu, G., and Sabato, S., editors, Pro-
D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M.,
ceedings of the 39th International Conference on Ma-
Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang,
chine Learning, volume 162 of Proceedings of Machine
T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry,
Learning Research, pages 11985–12003. PMLR.
T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S.,
Levin, R., Cherepanova, V., Schwarzschild, A., Bansal, A., Gao, L., Wolf, T., and Rush, A. M. (2022). Multi-
Bruss, C. B., Goldstein, T., Wilson, A. G., and Gold- task prompted training enables zero-shot task general-
blum, M. (2022). Transfer Learning with Deep Tabular ization. In International Conference on Learning Repre-
Models. Technical Report arXiv:2206.15306, arXiv. sentations.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020). Schick, T. and Schütze, H. (2021). Exploiting Cloze-
Deep entity matching with pre-trained language models. Questions for Few-Shot Text Classification and Natural
Proc. VLDB Endow., 14(1):50–60. Language Inference. In Proceedings of the 16th Con-
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., ference of the European Chapter of the Association for
Bansal, M., and Raffel, C. (2022). Few-Shot Parameter- Computational Linguistics: Main Volume, pages 255–
Efficient Fine-Tuning is Better and Cheaper than In- 269, Online. Association for Computational Linguistics.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep

learning is not all you need. Information Fusion, 81.
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss,
C. B., and Goldstein, T. (2021). SAINT: Improved
Neural Networks for Tabular Data via Row Atten-
tion and Contrastive Pre-Training. Technical Report
arXiv:2106.01342, arXiv.
Webson, A. and Pavlick, E. (2022). Do Prompt-
Based Models Really Understand the Meaning of Their
Prompts? In Proceedings of the 2022 Conference of
the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies,
pages 2300–2344, Seattle, United States. Association for
Computational Linguistics.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi,
E., Le, Q., and Zhou, D. (2022). Chain of Thought
Prompting Elicits Reasoning in Large Language Mod-
els. arXiv:2201.11903 [cs]. arXiv: 2201.11903.
Yin, P., Neubig, G., Yih, W.-t., and Riedel, S. (2020).
TaBERT: Pretraining for Joint Understanding of Textual
and Tabular Data. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguis-
tics, pages 8413–8426, Online. Association for Compu-
tational Linguistics.
Yoon, J., Zhang, Y., Jordon, J., and van der Schaar, M.
(2020). VIME: Extending the Success of Self- and Semi-
supervised Learning to Tabular Domain. In Advances
in Neural Information Processing Systems, volume 33,
pages 11033–11043. Curran Associates, Inc.
Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh,
S. (2021). Calibrate Before Use: Improving Few-shot
Performance of Language Models. In Proceedings of
the 38th International Conference on Machine Learning,
pages 12697–12706. PMLR. ISSN: 2640-3498.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Supplementary Materials:
TabLLM: Few-shot Classification of Tabular Data with Large Language
Models

1 ADDITIONAL DATASET DETAILS

1.1 Public Tabular Datasets

We systematically identified datasets for classification from Kadra et al. (2021), Grinsztajn et al. (2022), Borisov et al.
(2022a), and from Kaggle. Each dataset was separated into 80/20 train-test splits. The 𝑘 labeled examples D 𝑘 were
sampled in a class-balanced manner from the training set. We performed experiments for different numbers of trainings
examples (shots) ranging from 0 to 512 and the entire dataset (all). To characterize the sensitivity of models to the choice of
𝑘 labeled examples, we repeated the dataset splitting and sampling procedures for five different seeds and report the mean
AUC and standard deviation (SD) across seeds. No hyperparameter tuning was conducted for TabLLM; for baselines,
internal cross validation was conducted to choose optimal hyperparameters, and the model was then retrained on all data.
We analyzed the following datasets:

• Bank (Kadra et al., 2021) contains information of a direct marketing campaign from a Portugese banking institution
(Moro et al., 2014). The goal is to predict whether a customer subscribed to a term deposit or not. It consists of 45,211
rows and 16 features; 5,289 labels are positive.
• Blood (Kadra et al., 2021) consists of data of a blood transfusion service from Taiwan (Yeh et al., 2009). It contains
4 attributes of 748 donors and the label is representing whether they returned for another donation (178 positive).
• California (Grinsztajn et al., 2022) contains eight attributes of 20,640 districts in California and the goal is to predict
the median house value in each district (Pace and Barry, 1997). Analogously to Grinsztajn et al. (2022), we created a
balanced classification task by predicting whether the house value is below or above the median (10,317 positive).
• Car (Kadra et al., 2021) has entries for different cars that are characterized by six attributes; the task is a multiclass
classification problem evaluating the state of each car. The dataset contains 1,728 rows, and the four classes have a
distribution of 1210, 384, 65, and 69 examples.
• Credit-g (Kadra et al., 2021) describes 1,000 people from Germany that want to receive a credit using 20 attributes.
The label is to predict whether they have good or bad risk; 700 are classified as good.
• Diabetes (from Kaggle2 ) was collected by the National Institute of Diabetes and Digestive and Kidney Diseases
(Smith et al., 1988) and contains 768 rows, each corresponding to women of Pima Indian heritage with eight clinical
variables. The task is binary classification of whether a person has diabetes; 268 cases are positive.
• Heart (from Kaggle3 ) contains data of four different hospitals (Detrano et al., 1989). Each row contains 11 clinical
variables of a patient. The task is binary classification of coronary artery disease. Of the 918 patients, 508 are positive.
• Income (Kadra et al., 2021; Borisov et al., 2022a) also called Adult contains rows for 48,842 individuals with twelve
attributes collected in the 1994 U.S. Census (Kohavi et al., 1996; Dua and Graff, 2017). The task is to predict whether
each person has an annual income over $50,000. The dataset has 11,687 positive labels.
• Jungle (Kadra et al., 2021) is a collection of 44,819 end game positions of Jungle Chess (van Rijn and Vis, 2014).
Each game is described with 6 attributes and the goal is to predict whether the white player will win (23,062 positive).

2 https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (06/28/2022)
3 https://www.kaggle.com/fedesoriano/heart-failure-prediction(06/28/2022)
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 5: Evaluation of different concept selection methods for the healthcare claims dataset in the zero-shot setting. The
last two rows show the performance when concepts where selected based on the lasso path of logistic regression weights,
which violates the zero-shot assumption (*).
Method EoL Surgery LoH
Age, sex, and race 0.59 0.57 0.65
Least frequent conditions 0.57 0.64 0.67
Least frequent procedures 0.59 0.59 0.65
Least frequent concepts (cond. + proc.) 0.55 0.55 0.66
Most frequent conditions 0.67 0.66 0.69
Most frequent procedures 0.59 0.58 0.65
Most frequent concepts (cond. + proc.) 0.62 0.61 0.65
Oldest conditions 0.65 0.66 0.69
Oldest procedures 0.59 0.58 0.65
Oldest concepts (cond. + proc.) 0.60 0.60 0.67
Most recent conditions 0.65 0.66 0.69
Most recent procedures 0.55 0.59 0.65
Most recent concepts (cond. + proc.) 0.59 0.60 0.66
Most relevant concepts based on 256 shots* 0.60 0.58 0.69
Most relevant concepts based on 4096 shots* 0.65 0.57 0.68

1.2 Large Healthcare Claims Dataset

The de-identified health claims data set was provided by a large U.S. health insurer. The data is stored in the Observational
Medical Outcomes Partnership (OMOP) Common Data Model version 6.0 (Hripcsak et al., 2015). It contains an entry for
every encounter a patient has with the health system. Each entry is associated with a date, a visit type (5 total), a medical
specialty (216 total), present conditions (14,095 total), and performed procedures (21,184 total). We additionally used the
static concepts age, sex, and race at time of prediction.
We studied three different tasks on this dataset with distinct cohorts. For all tasks, we used a six month outcome period
and a gap of three months between time of prediction and the outcome window to prevent data leakage. We required
patients to have at least one medical visit and to have been actively enrolled in an insurance plan for at least 95% of the
last year and the six month outcome window. We used 10% of the data as a holdout set and sampled the 𝑘 balanced shots
with replacement from the remaining data. We chose larger shot sizes, as the tasks are more complex. We only ran the
experiments for a single seed due to runtime limitations. We considered the following tasks:

• End of Life (EoL): We predicted the mortality of all patients older than 70 years. This is often used as a surrogate
task. For instance, it can improve initiation of palliative care (Avati et al., 2018) and can help to inform close relatives
to reduce family distress (Curtis et al., 2016). The final cohort contained 94,972 individuals; 2,424 were positive.
• Surgical Procedure (Surgery): We predicted the need for any surgical procedure. The task is important in determin-
ing health care needs and estimating costs. The cohort included 620,382 people of which 243,349 were positive.
• Likelihood of Hospitalization (LoH): We also predicted the likelihood of being hospitalized. Again, this information
can help identify needs and estimate costs. The cohort included 612,656 individuals; 22,427 were positive.

1.2.1 More Details on the Serialization

Each serialization begins with the patient’s age, sex, and race. For each concept entry that we included, we also added
information of the associated visit. This included its date, the type of doctor the patient saw (e.g., dermatology), if an
outpatient visit or length of hospitalization if an inpatient visit, and the primary complaint of the associated visit. If a visit
was already added to the serialization, we just added the concept to the existing visit entry. For the List Template and
Text Template serializations approximately 40 medical concepts could be added until the token limit of T0 was reached.
To explore the effect of fewer information in the input, we also tested the List Short serializations were we added only
10 medical concepts to the serialization. Hence, not the entire token limit of the LLM was used. Examples of the List
Template, Text Template and List Permuted Names serializations illustrating this structure are given in Sec. 9.1 at the end
of the Supplement.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Table 6: Five examples of different concept names for conditions. The first column shows the original name in the
healthcare claims dataset using SNOMED codes. A dash illustrates that no mapping was available.

Original name ICD MEDCIN CHV Simplify (GPT-3) Jargon (GPT-3)

Seasonal allergic Allergic rhinitis due hay fever hay fever Allergies Seasonal allergic
rhinitis to pollen rhinitis
Disturbance in Unspecified speech speech difficulties speech impairment Speech problems Dysarthria
speech disturbances
Congenital — — double cervix Double cervix Congenital
duplication of duplication of the
cervix cervix
Hypertensive Hypertensive hypertensive hypertensive High blood pressure Retinopathy h-tensa
retinopathy retinopathy retinopathy retinopathy affecting the retina
Malignant Malignant malignant neoplasm liver cancer Liver cancer Hepato-ca
neoplasm of liver neoplasm of liver, of liver
unspecified

Table 7: Evaluation of alternative condition concepts names. International Classification of Diseases (ICD), MEDCIN and
the Consumer Health Vocabulary (CHV) are alternative medical terminologies. We also tested shortening, simplifying,
and rewriting concepts as medical jargon via GPT-3. None of the alternative concept names showed consistent
performance improvement.
Method EoL Surgery LoH
Original concept names (SNOMED) 0.67 0.66 0.69
Map to ICD concept names 0.67 0.67 0.68
Map to MEDCIN concept names 0.67 0.66 0.69
Map to CHV concept names 0.66 0.66 0.69
Shorten longs concepts with GPT-3 0.67 0.66 0.69
Simplify concepts with GPT-3 0.67 0.66 0.70
Medical jargon with GPT-3 0.68 0.67 0.70

1.2.2 Concept Selection

For the healthcare claims dataset, the number of recorded medical concepts per patients usually exceeded T0’s token limit.
Hence, we had to determine which concepts of a patient should be included during the serialization. We evaluated four
different concept selection strategies in the zero-shot setting for the List Template serialization. Choosing the least frequent,
most frequent, oldest, or most recent concepts per patient. We tested these for all concepts (conditions and procedures),
only conditions, or only procedures. For each patient, we ranked all concepts according to one of the above methods and
added concepts until the token limit of the LLM was reached. For least frequent and most frequent, we used the earliest
visits associated with the selected medical concepts. We used a simple serialization that only contained the patient’s age,
sex, and race as a baseline for our experiments. We also tested concept selection based on the lasso path of a logistic
regression model determined on 256 and 4,096 shots. This violates the few-shot assumption, but we considered it an
interesting comparison with the other strategies that select concepts per patient.
The results are given in Table 5. Using the most frequent conditions per patient consistently outperformed all other
selection strategies. Frequent conditions might be useful since they reveal the most relevant condition of a patient. Also,
they are usually more common allowing more prior knowledge of the LLM. Across all strategies conditions were usually
more useful than procedures. This suggests more prior knowledge of conditions. Interestingly, selecting the most frequent
conditions is even better than using the concept weights of a LR model trained on 256 or 4,096 shots.

1.2.3 Alternative Concept Names

The healthcare claims dataset used SNOMED concept names for conditions and SNOMED, Healthcare Common Proce-
dure Coding System (HCPCS), International Classification of Diseases (ICD), and Current Procedural Terminology (CPT)
concept names for procedures. We tested different concept names to assess their effect on the performance. We used a zero-
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 8: Hyperparameters for LR model.

Parameter Values
penalty ‘l1’, ‘l2’
C 100, 10, 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5

Table 9: Hyperparameters for LightGBM model.

Parameter Values
num leaves 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096
lambda l1 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1., 10.
lambda l2 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1., 10.
learning rate 0.01, 0.03, 0.1, 0.3

Table 10: Hyperparameters for XGBoost model.

Parameter Values
max depth 2, 4, 6, 8, 10, 12
lambda l1 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.
lambda l2 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.
eta 0.01, 0.03, 0.1, 0.3

shot setting with the List Template serialization and the most frequent conditions per patient as the best selection strategy
determined as described above. Since the selection method only considered conditions, we only used different condition
names. We considered three alternative vocabularies in the Unified Medical Language System (UMLS) that covered at
least 20% of the condition concepts and offered different names. ICD is a very common medical terminology offering
alternative names for conditions. MEDCIN and the Consumer Health Vocabulary (CHV) offer concept names specifically
targeted at clinicians or consumers. We mapped the concept via their UMLS identifier. For ICD we were able to map
7,372, for MEDCIN 9,370 and for CHV 3,700 of the 14,095 condition concepts. Alternatively, we explored concept names
generated by GPT-3 (Brown et al., 2020). To do so, we used the publicly accessible GPT-3 API (engine text-davinci-
002) (Ouyang et al., 2022). We considered shortened names for concepts with more than sixty character (“Rewrite this
medical condition with at most six words.”), simplified concept names (“Write this medical condition in a short form in
lay language.”) and medical jargon (“Write this medical condition in medical jargon.”). For the simplified names and the
medical jargon, we provided GPT-3 with a single example for in-context learning. Examples for all alternative concept
names except the shortening are given in Table 6.
The results of this experiment are given in Table 7. We used the most frequent concept as a concept selection methods.
Based on the best concept selection, we performed additional experiments for alternative concept names. We found no
consistent performance difference even though there were considerable differences in the concept names (see Table 6).
Surprisingly, TabLLM performs better for EoL and Surgery using medical jargon to encode concepts.

2 RUNTIME ESTIMATES FOR TABLLM

The TabLLM training time on the Income dataset for 64 training examples and 30 epochs with a batch size of 8 was less
than 3 minutes. The average inference time for the test set of 10,000 examples with a batch size of 16 was 2 minutes,
around 12 ms per example. The training and inference times for the other public datasets were comparable. Due to the
larger size of the healthcare claims dataset, it took nearly 4 minutes to train for 64 examples and 10 epochs for EoL and
was similar for the other two tasks. Inference took approximately 14 minutes for 10,000 examples with a batch size of 16,
i.e. around 84 ms per example. The training times scaled linearly in the shot size.

3 PARAMETER TUNING FOR BASELINES

We used the scikit-learn framework to perform cross-validation and parameter tuning for the LR and the tree-based models
(Pedregosa et al., 2011). For LR we tried common parameters for the penalty term and regularization strength (see Table
8). We used the same LR parameters for the public tabular datasets and the healthcare claims dataset. For the tree-based
models we adopted the hyperparameter ranges from Borisov et al. (2022a) and Grinsztajn et al. (2022). We discretized the
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

parameter ranges and performed a complete grid search (see Tables 9 and 10).
For the neural baselines SAINT, TabNet, and NODE, we used the setup and suggested hyperparameter ranges in Borisov
et al. (2022a). We modified the open-source implementation of these methods4 to support ingestion of the nine public
tabular datasets. We used the hyperparameter-tuning framework Optuna5 and selected parameters that maximize AUC-
ROC across folds. Note that for the 4-shot setting of the Car dataset, AUC may not be defined if the selected validation
set includes only one label; in this case we used accuracy as our validation metric but report AUC-ROC on the holdout test
set. Each neural baseline model was run for 20 trials with Optuna and trained for 100 epochs per hyperparameter settings.

4 COMPARING BASELINE RESULTS TO THE LITERATURE

To assess whether our baseline results match results reported in the literature, we report studies that used the same models.

Bank Dataset. Kadra et al. (2021) trained a XGBoost, TabNet, and NODE baseline on this dataset and achieved a
balanced accuracy of 72.7, 70.6, and 74.6. Our experiments for a set of 512 balanced training examples (512 shots) show
a better performance for XGBoost than NODE.

Blood Dataset. The XGBoost, TabNet, and NODE baselines trained in Kadra et al. (2021) achieved a balanced accuracy
of 62.3, 64.3, 50. Our results for a set of 512 balanced training examples (512 shots) also show a better performance for
TabNet than XGBoost. However, in our experiments NODE performs better than XGBoost and not worse.

California Dataset. Borisov et al. (2022a) trained a Linear Model, XGBoost, LightGBM, TabNet, NODE, and SAINT
baseline on a regression version of the dataset. They achieved a mean squared error of 0.53, 0.21, 0.20, 0.35, 0.28, and
0.23. Our experiments for a set of 512 balanced training examples (512 shots) show a better performance for XGBoost
than LightGBM and the same performance for TabNet and NODE. Also, our linear model performs much better which is
probably due to more extensive hyperparameter tuning.

Car Dataset. The XGBoost, TabNet, and NODE models in Kadra et al. (2021) showed a balanced accuracy of 92.4,
98.7, and 46.1. In our experiments, XGBoost and TabNet performed very similar for many training examples and NODE
was only slightly inferior.

Credit-g Dataset. The XGBoost, TabNet, and NODE baselines trained in Kadra et al. (2021) achieved a balanced accu-
racy of 68.9, 61.2, and 73.1. Our AUC results cannot easily be compared but our experiments for 512 balanced training
examples (512 shots) follow the same trend.

Diabetes Dataset. Hasan et al. (2020) reported an AUC of 0.828 (0.030) for XGBoost on the diabetes dataset, which
matches our findings. With additional feature selection and preprocessing methods they reached an AUC of 0.946 (0.020)
with XGBoost, but this was out of the scope of our work. XGBoost was the most performant model that they included in
their experiments.

Heart Dataset. Muhammad et al. (2020) used only the 303 instances from the Cleveland cohort, while we combined all
four sub-cohorts. They achieved an AUC of 0.923 with LR, which is close to our results on all sub-cohorts. They also
tested several models that outperformed LR.

Income Dataset. Many studies used the Income or Adult dataset. The review Borisov et al. (2022a) included several of
our baselines. They reported an AUC of 0.854 (0.002) for a linear model, 0.928 (0.001) for XGBoost, 0.928 (0.001) for
LightGBM, 0.916 (0.002) for SAINT, 0.911 (0.001) for TabNet, and 0.911 (0.002) for NODE. These are in accordance
with our results. We reckon the better performance of our LR model is due to more extensive parameter tuning.

Jungle Dataset. The XGBoost and TabNet baselines trained in Kadra et al. (2021) achieved a balanced accuracy of 87.3
and 73.4. They did not train a NODE moel for this dataset. The results follows the same trend as our experiments for a set
of 512 balanced training examples (512 shots).
4 https://github.com/kathrinse/TabSurvey
5 https://github.com/optuna/optuna
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 11: The mean performance for one prompt (ours, SD over five seed omitted) and the mean performance and SD
across five different prompts (each again over five seeds).
Dataset Bank Blood California Car Credit-g Diabetes Heart Income Jungle
TabLLM 0-shot: 1 prompt (ours) 0.63 0.61 0.61 0.81 0.53 0.68 0.54 0.84 0.60
TabLLM 0-shot: avg. 5 prompts 0.64.01 0.60.02 0.59.01 0.80.01 0.52.01 0.67.01 0.55.04 0.84.01 0.60.00

5 ADJUSTING INCOME DATASET FOR INFLATION

We wanted to investigate how a distribution shift caused by inflation affects the zero-shot performance of TabLLM. The
Income dataset was collected in 1994, and the label and two features (capital gain/loss in last year) contain dollar values.
T0 was trained in 2021 (Sanh et al., 2022), and we assumed that the training data is much more recent than the Income
dataset. The inflation rate from 1994 to 2021 is 1.796 . Without inflation correction the zero-shot results were 0.80 (0.01).
Correcting the two features, correcting only the prompt, and correcting both all yielded the same performance as the
uncorrected one. The accuracy values also remained the same with the inflation correction.

6 FEATURE IMPORTANCE ANALYSIS OF TABLLM

We wanted to understand which features were most important for the zero-shot performance of TabLLM on Income and
EoL. To this end, we used zero-shot TabLLM with the List Template serialization to predict the label probability of all
examples in the dataset. We then used 4-fold cross validation to fit a L2-regularized LR model to the predicted label using
the features in the serialization as covariates. For EoL, we used age, sex, race, and the conditions as inputs, which summed
up to 14,105 features.
For Income we compared these approximated importance scores to the feature coefficients of a LR model trained on all
data for a single seed (Table 16). We used the same setup for the LR model as for our main experiments. We did 4-fold
cross validation on an 80% training split to choose hyperparameters, and then refit the model using all training data. The
best parameters of the LR model for Income were a ‘l1’ penalty and a regularization constant of 1. For EoL, we decided
that the LR model coefficients did not provide a good estimate of the ground truth due to the vast amount of features and
possible collinearities in the data. Instead, we provide the relative risk (RR) with 95% confidence intervals (CI) treating
the occurrence of a feature as an intervention. We report the 50 most and least important features of TabLLM in Table 17.

7 EFFECT OF USING DIFFERENT PROMPTS

To evaluate the effect of using a different prompt we considered the zero-shot setting, since even few training examples
mostly cancel the effect. For all datasets we constructed five different prompts that contained the same question, e.g., “Does
this person earn a lot of money?” instead of “Does this person earn more than 50000 dollars per year?” for the Income
dataset. The results are summarized in Table 11. The effects were relative small ranging from a standard deviation of 0.00
for Jungle to 0.04 for Heart across the five prompts. This suggests that TabLLM is not very sensitive to using different
prompts.

6 U.S. Bureau of Labor Statistics, CPI Inflation Calculator: https://www.bls.gov/data/inflation calculator.htm

Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Table 12: Test AUC performance of competing methods on public tabular datasets. Each column reports the 𝑘-shot
performance for different values of 𝑘. Standard deviations across five random seeds are shown as subscripts.
Number of Shots
Method 0 4 8 16 32 64 128 256 512 all
Bank Dataset
Logistic regression — 0.55.09 0.66.09 0.75.06 0.81.02 0.84.02 0.86.02 0.88.01 0.89.00 0.91.00
Logistic regression (ordinal) — 0.51.02 0.60.12 0.68.09 0.78.04 0.82.01 0.84.03 0.86.01 0.87.00 0.88.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.77.03 0.84.03 0.88.01 0.89.00 0.94.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.78.03 0.84.02 0.87.01 0.89.00 0.94.00
XGBoost — 0.50.00 0.56.09 0.68.04 0.76.03 0.83.02 0.85.03 0.88.01 0.90.01 0.94.00
XGBoost (ordinal) — 0.50.00 0.56.09 0.69.05 0.75.04 0.82.02 0.84.03 0.87.01 0.89.00 0.93.00
SAINT — 0.51.10 0.61.11 0.70.04 0.77.03 0.81.03 0.85.02 0.88.01 0.88.01 0.93.00
TabNet — 0.51.06 0.58.05 0.64.10 0.62.04 0.71.06 0.73.03 0.80.04 0.83.03 0.93.00
NODE — 0.52.02 0.55.06 0.64.06 0.73.06 0.78.02 0.83.03 0.85.01 0.86.01 0.76.02
TabPFN — 0.59.14 0.66.08 0.69.02 0.76.03 0.82.03 0.86.02 0.89.00 0.90.00 0.91.00
TabPFN (ordinal) — 0.57.10 0.67.05 0.71.05 0.78.04 0.83.01 0.86.02 0.87.00 0.88.00 0.89.00
TabLLM (T0 + Text GPT-3) 0.63.01 0.61.04 0.62.02 0.63.03 0.64.02 0.66.04 0.76.04 0.81.02 0.82.01 *
TabLLM (T0 + Text T0) 0.54.01 0.56.08 0.60.06 0.59.06 0.60.04 0.62.04 0.67.04 0.79.03 0.85.01 *
TabLLM (T0 + Table-To-Text) 0.42.01 0.48.07 0.50.05 0.56.03 0.57.04 0.59.05 0.63.03 0.68.02 0.74.01 *
TabLLM (T0 + Text Template) 0.63.01 0.59.10 0.64.05 0.65.05 0.64.06 0.69.03 0.82.05 0.87.01 0.88.01 0.92 †
TabLLM (T0 + List Template) 0.60.01 0.59.10 0.66.02 0.65.04 0.66.05 0.74.07 0.85.02 0.87.01 0.87.01 *
TabLLM (T0 + List Only Values) 0.56.01 0.58.09 0.60.04 0.63.03 0.67.03 0.71.05 0.79.03 0.84.01 0.86.01 *
TabLLM (T0 + List Perm. Names) 0.64.00 0.55.10 0.62.07 0.63.04 0.63.05 0.68.04 0.82.02 0.86.01 0.88.00 *
TabLLM (T0 + List Perm. Values) 0.38.01 0.47.11 0.53.06 0.55.07 0.57.05 0.65.04 0.75.07 0.84.01 0.85.01 *
TabLLM (T0 3B + Text Template) 0.61.01 0.60.10 0.65.05 0.64.07 0.65.05 0.70.02 0.77.05 0.88.01 0.89.01 *
Blood Dataset
Logistic regression — 0.54.09 0.59.08 0.72.03 0.70.06 0.74.02 0.76.02 0.76.02 0.76.03 0.76.03
Logistic regression (ordinal) — 0.54.09 0.59.08 0.72.03 0.70.06 0.74.02 0.76.02 0.76.02 0.76.03 0.76.03
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.69.04 0.71.05 0.71.07 0.67.05 0.74.04
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.69.04 0.71.05 0.71.07 0.67.05 0.74.04
XGBoost — 0.50.00 0.58.07 0.66.04 0.67.06 0.68.05 0.71.06 0.70.07 0.67.06 0.71.04
XGBoost (ordinal) — 0.50.00 0.58.07 0.66.04 0.67.06 0.68.05 0.71.06 0.70.07 0.67.06 0.71.04
SAINT — 0.47.12 0.66.08 0.66.03 0.67.06 0.67.05 0.71.03 0.76.05 0.73.02 0.74.03
TabNet — 0.47.09 0.61.06 0.60.09 0.66.06 0.63.06 0.66.04 0.72.06 0.72.02 0.71.03
NODE — 0.49.04 0.60.07 0.62.04 0.67.03 0.71.05 0.76.03 0.74.03 0.76.03 0.74.03
TabPFN — 0.52.08 0.64.04 0.67.01 0.70.04 0.73.04 0.75.04 0.76.04 0.76.03 0.74.03
TabPFN (ordinal) — 0.52.08 0.64.04 0.67.01 0.70.04 0.73.04 0.75.04 0.76.04 0.76.03 0.74.03
TabLLM (T0 + Text GPT-3) 0.63.04 0.61.07 0.65.04 0.63.02 0.64.03 0.62.05 0.67.06 0.68.05 0.66.05 *
TabLLM (T0 + Text T0) 0.49.04 0.51.03 0.59.08 0.59.06 0.64.04 0.65.06 0.66.05 0.68.06 0.66.03 *
TabLLM (T0 + Table-To-Text) 0.61.04 0.59.04 0.59.03 0.57.03 0.62.07 0.56.07 0.57.07 0.64.07 0.61.05 *
TabLLM (T0 + Text Template) 0.61.04 0.58.09 0.66.03 0.66.07 0.68.04 0.68.04 0.68.06 0.70.08 0.68.04 0.70.04
TabLLM (T0 + List Template) 0.56.05 0.54.08 0.64.02 0.64.08 0.67.05 0.66.06 0.67.05 0.70.06 0.67.06 *
TabLLM (T0 + List Only Values) 0.45.05 0.49.07 0.57.03 0.57.06 0.62.06 0.61.04 0.64.04 0.68.07 0.67.05 *
TabLLM (T0 + List Perm. Names) 0.52.04 0.49.07 0.62.03 0.62.06 0.65.05 0.65.04 0.68.06 0.72.06 0.68.04 *
TabLLM (T0 + List Perm. Values) 0.51.03 0.51.06 0.54.04 0.52.07 0.55.03 0.59.06 0.59.02 0.62.06 0.62.05 *
TabLLM (T0 3B + Text Template) 0.42.05 0.47.04 0.62.04 0.62.09 0.65.07 0.67.04 0.69.04 0.71.06 0.67.04 *
California Dataset
Logistic regression — 0.58.11 0.69.13 0.80.06 0.84.03 0.88.01 0.90.00 0.91.00 0.91.00 0.92.00
Logistic regression (ordinal) — 0.58.11 0.69.13 0.80.06 0.84.03 0.88.01 0.90.00 0.91.00 0.91.00 0.92.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.81.02 0.87.01 0.90.01 0.92.00 0.97.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.81.02 0.87.01 0.90.01 0.92.00 0.97.00
XGBoost — 0.50.00 0.62.10 0.74.03 0.79.04 0.82.04 0.87.01 0.90.01 0.92.01 0.97.00
XGBoost (ordinal) — 0.50.00 0.62.10 0.74.03 0.79.04 0.82.04 0.87.01 0.90.01 0.92.01 0.97.00
SAINT — 0.59.09 0.64.12 0.73.06 0.76.06 0.81.02 0.84.01 0.88.02 0.91.02 0.95.00
TabNet — 0.50.08 0.57.06 0.67.02 0.69.05 0.72.03 0.79.02 0.84.02 0.87.01 0.96.00
NODE — 0.58.06 0.57.07 0.70.05 0.77.03 0.80.01 0.86.02 0.86.02 0.87.01 0.87.01
TabPFN — 0.63.13 0.63.11 0.80.03 0.85.03 0.89.01 0.91.01 0.92.00 0.93.00 0.94.00
TabPFN (ordinal) — 0.63.13 0.63.11 0.80.03 0.85.03 0.89.01 0.91.01 0.92.00 0.93.00 0.94.00
TabLLM (T0 + Text GPT-3) 0.56.00 0.55.03 0.57.05 0.61.06 0.73.05 0.73.04 0.82.01 0.84.01 0.85.01 *
TabLLM (T0 + Text T0) 0.49.01 0.52.02 0.51.02 0.52.02 0.54.04 0.56.04 0.69.02 0.73.03 0.80.02 *
TabLLM (T0 + Table-To-Text) 0.49.01 0.50.01 0.51.01 0.52.02 0.57.04 0.58.04 0.74.03 0.79.02 0.82.01 *
TabLLM (T0 + Text Template) 0.61.01 0.63.05 0.60.07 0.70.08 0.77.08 0.77.04 0.81.02 0.83.01 0.86.02 0.95.00
TabLLM (T0 + List Template) 0.61.01 0.64.05 0.62.06 0.68.07 0.77.07 0.79.02 0.82.02 0.84.01 0.87.01 *
TabLLM (T0 + List Only Values) 0.58.01 0.57.08 0.55.03 0.65.09 0.74.08 0.77.03 0.83.01 0.84.02 0.86.02 *
TabLLM (T0 + List Perm. Names) 0.54.01 0.52.03 0.52.04 0.52.03 0.66.06 0.74.01 0.81.02 0.84.02 0.86.02 *
TabLLM (T0 + List Perm. Values) 0.47.01 0.48.02 0.50.01 0.52.02 0.57.03 0.64.04 0.71.04 0.76.01 0.78.02 *
TabLLM (T0 3B + Text Template) 0.57.01 0.59.03 0.57.04 0.66.07 0.77.06 0.79.02 0.81.01 0.83.01 0.85.01 *
* Result omitted due to runtime limitations of TabLLM on the full dataset.
† Only a single run performed due to runtime limitations of TabLLM on the full dataset.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 13: Test AUC performance of competing methods on public tabular datasets. Each column reports the 𝑘-shot
performance for different values of 𝑘. Standard deviations across five random seeds are shown as subscripts.
Number of Shots
Method 0 4 8 16 32 64 128 256 512 all
Car Dataset
Logistic regression — 0.61.02 0.65.10 0.74.07 0.83.02 0.93.02 0.96.01 0.97.01 0.98.00 0.98.00
Logistic regression (ordinal) — 0.62.06 0.63.05 0.64.07 0.75.04 0.73.03 0.73.03 0.74.03 0.76.02 0.78.03
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.85.06 0.93.01 0.98.01 0.99.01 1.00.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.75.04 0.91.05 0.98.01 0.99.00 1.00.00
XGBoost — 0.50.00 0.59.04 0.70.08 0.82.03 0.91.02 0.95.01 0.98.01 0.99.01 1.00.00
XGBoost (ordinal) — 0.50.00 0.55.03 0.70.04 0.78.03 0.90.03 0.94.01 0.98.01 0.99.01 1.00.00
SAINT — 0.56.08 0.64.08 0.76.03 0.85.03 0.92.02 0.96.01 0.98.01 0.99.00 1.00.00
TabNet — † 0.54.05 0.64.05 0.66.05 0.73.07 0.81.04 0.93.02 0.98.01 1.00.00
NODE — 0.51.10 0.57.06 0.69.02 0.74.03 0.80.02 0.82.01 0.91.01 0.96.01 0.93.01
TabPFN — 0.64.06 0.75.05 0.87.04 0.92.02 0.97.00 0.99.01 1.00.00 1.00.00 1.00.00
TabPFN (ordinal) — 0.59.06 0.65.08 0.75.04 0.82.06 0.89.01 0.93.01 0.98.01 0.99.01 1.00.00
TabLLM (T0 + Text GPT-3) 0.72.02 0.75.03 0.75.02 0.78.01 0.83.01 0.87.02 0.90.01 0.93.02 0.93.02 0.96.01
TabLLM (T0 + Text T0) 0.85.01 0.85.02 0.84.03 0.86.02 0.89.02 0.92.02 0.94.01 0.98.01 0.99.00 1.00.00
TabLLM (T0 + Table-To-Text) 0.61.01 0.69.04 0.74.04 0.79.02 0.88.01 0.91.02 0.94.01 0.96.01 0.95.01 0.96.00
TabLLM (T0 + Text Template) 0.82.02 0.83.03 0.85.03 0.86.03 0.91.02 0.96.02 0.98.01 0.99.00 1.00.00 1.00.00
TabLLM (T0 + List Template) 0.79.02 0.84.03 0.85.02 0.86.03 0.91.02 0.95.01 0.98.01 0.99.00 1.00.00 1.00.00
TabLLM (T0 + List Only Values) 0.48.03 0.62.04 0.67.03 0.70.03 0.75.02 0.87.02 0.94.01 0.98.01 0.99.01 1.00.00
TabLLM (T0 + List Perm. Names) 0.39.02 0.54.10 0.58.06 0.70.03 0.86.02 0.94.01 0.97.02 0.99.01 0.99.00 1.00.00
TabLLM (T0 + List Perm. Values) 0.38.02 0.48.08 0.55.05 0.63.04 0.69.03 0.78.02 0.90.03 0.98.01 1.00.00 1.00.00
TabLLM (T0 3B + Text Template) 0.78.02 0.80.03 0.84.03 0.84.04 0.89.03 0.91.01 0.96.01 0.98.01 0.99.00 1.00.00
Credit-g Dataset
Logistic regression — 0.50.08 0.56.06 0.58.08 0.68.08 0.66.07 0.71.06 0.75.04 0.76.02 0.79.03
Logistic regression (ordinal) — 0.56.05 0.54.06 0.55.05 0.61.05 0.68.05 0.66.03 0.68.04 0.71.02 0.72.02
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.61.09 0.68.03 0.72.02 0.75.02 0.78.02
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.68.07 0.66.04 0.72.02 0.75.03 0.76.04
XGBoost — 0.50.00 0.51.07 0.59.05 0.66.03 0.67.06 0.68.02 0.73.02 0.75.03 0.78.04
XGBoost (ordinal) — 0.50.00 0.54.11 0.57.08 0.64.05 0.66.06 0.68.04 0.74.02 0.76.03 0.76.04
SAINT — 0.56.08 0.53.05 0.60.05 0.66.06 0.66.06 0.68.05 0.72.04 0.73.03 0.77.04
TabNet — 0.48.05 0.52.07 0.49.03 0.52.03 0.56.05 0.60.05 0.61.02 0.66.04 0.64.03
NODE — 0.54.09 0.54.10 0.54.09 0.59.07 0.63.04 0.68.02 0.68.05 0.70.02 0.65.03
TabPFN — 0.58.08 0.59.03 0.64.06 0.69.07 0.70.07 0.72.06 0.75.04 0.75.02 0.75.03
TabPFN (ordinal) — 0.55.08 0.51.07 0.57.06 0.62.03 0.66.05 0.70.02 0.73.01 0.73.03 0.75.04
TabLLM (T0 + Text GPT-3) 0.52.04 0.53.04 0.56.03 0.56.05 0.55.05 0.57.08 0.60.06 0.61.04 0.63.05 *
TabLLM (T0 + Text T0) 0.49.02 0.50.06 0.54.06 0.55.04 0.60.06 0.61.02 0.61.02 0.63.03 0.65.02 *
TabLLM (T0 + Table-To-Text) 0.50.06 0.65.04 0.60.05 0.60.07 0.65.05 0.67.05 0.65.05 0.68.04 0.64.05 *
TabLLM (T0 + Text Template) 0.53.05 0.69.04 0.66.04 0.66.05 0.72.06 0.70.07 0.71.07 0.72.03 0.72.02 0.70.02
TabLLM (T0 + List Template) 0.53.05 0.64.04 0.60.06 0.64.05 0.70.05 0.66.08 0.67.03 0.70.03 0.70.04 *
TabLLM (T0 + List Only Values) 0.66.06 0.71.03 0.67.06 0.69.06 0.72.06 0.69.05 0.69.07 0.70.06 0.68.04 *
TabLLM (T0 + List Perm. Names) 0.44.01 0.58.09 0.59.08 0.60.07 0.70.06 0.69.06 0.67.05 0.70.05 0.70.03 *
TabLLM (T0 + List Perm. Values) 0.50.05 0.55.06 0.56.07 0.58.04 0.64.03 0.66.08 0.67.09 0.68.03 0.69.03 *
TabLLM (T0 3B + Text Template) 0.54.03 0.65.05 0.63.05 0.63.03 0.73.04 0.69.05 0.68.06 0.73.05 0.73.03 *
Diabetes Dataset
Logistic regression — 0.60.15 0.68.11 0.73.05 0.76.05 0.80.02 0.81.02 0.83.02 0.83.02 0.83.02
Logistic regression (ordinal) — 0.60.15 0.68.11 0.73.05 0.76.05 0.80.02 0.81.02 0.83.02 0.83.02 0.83.02
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.79.04 0.79.02 0.79.03 0.83.03
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.79.04 0.79.02 0.79.03 0.83.03
XGBoost — 0.50.00 0.59.16 0.72.07 0.69.08 0.73.05 0.78.05 0.80.03 0.80.01 0.84.03
XGBoost (ordinal) — 0.50.00 0.59.16 0.72.07 0.69.08 0.73.05 0.78.05 0.80.03 0.80.01 0.84.03
SAINT — 0.46.12 0.65.11 0.73.06 0.73.06 0.79.03 0.81.03 0.81.04 0.77.03 0.83.03
TabNet — 0.56.04 0.56.06 0.64.09 0.66.06 0.71.04 0.73.04 0.74.05 0.74.07 0.81.03
NODE — 0.49.13 0.67.09 0.69.08 0.73.05 0.77.04 0.80.04 0.81.03 0.83.02 0.83.03
TabPFN — 0.61.13 0.67.11 0.71.07 0.77.03 0.82.03 0.83.03 0.83.03 0.81.02 0.81.03
TabPFN (ordinal) — 0.61.13 0.67.11 0.71.07 0.77.03 0.82.03 0.83.03 0.83.03 0.81.02 0.81.03
TabLLM (T0 + Text GPT-3) 0.61.06 0.61.07 0.56.12 0.67.08 0.74.04 0.77.02 0.79.03 0.76.03 0.78.04 0.81.04
TabLLM (T0 + Text T0) 0.58.04 0.53.05 0.53.06 0.54.09 0.59.05 0.68.02 0.73.04 0.72.05 0.72.03 0.76.01
TabLLM (T0 + Table-To-Text) 0.58.04 0.51.10 0.53.07 0.56.05 0.57.04 0.59.04 0.72.05 0.74.04 0.75.06 0.77.04
TabLLM (T0 + Text Template) 0.68.06 0.61.09 0.63.08 0.69.07 0.68.04 0.73.03 0.79.04 0.78.02 0.78.04 0.80.04
TabLLM (T0 + List Template) 0.64.06 0.64.09 0.64.10 0.67.07 0.70.05 0.76.04 0.78.03 0.78.03 0.78.04 0.81.05
TabLLM (T0 + List Only Values) 0.55.05 0.54.07 0.52.05 0.59.08 0.63.04 0.67.07 0.73.03 0.75.06 0.77.04 0.79.03
TabLLM (T0 + List Perm. Names) 0.56.07 0.60.09 0.68.12 0.74.05 0.74.03 0.72.04 0.76.04 0.77.04 0.77.04 0.81.04
TabLLM (T0 + List Perm. Values) 0.44.03 0.47.09 0.43.06 0.55.07 0.61.05 0.65.05 0.73.03 0.76.03 0.78.02 0.80.03
TabLLM (T0 3B + Text Template) 0.62.05 0.57.07 0.60.08 0.67.05 0.67.06 0.76.03 0.77.04 0.81.05 0.80.04 0.82.04
* Result omitted due to runtime limitations of TabLLM on the full dataset.
† Result omitted due to TabNet package not supporting unseen labels in validation set during cross validation.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Table 14: Test AUC performance of competing methods on public tabular datasets. Each column reports the 𝑘-shot
performance for different values of 𝑘. Standard deviations across five random seeds are shown as subscripts.
Number of Shots
Method 0 4 8 16 32 64 128 256 512 all
Heart Dataset
Logistic regression — 0.69.17 0.75.13 0.82.06 0.87.05 0.91.01 0.90.02 0.92.01 0.93.01 0.93.01
Logistic regression (ordinal) — 0.70.17 0.73.14 0.84.04 0.88.03 0.89.01 0.88.02 0.90.02 0.92.02 0.92.02
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.91.01 0.91.01 0.91.01 0.93.00 0.94.01
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.91.01 0.91.02 0.91.01 0.92.01 0.94.01
XGBoost — 0.50.00 0.55.14 0.84.07 0.88.04 0.91.01 0.91.01 0.90.01 0.92.01 0.94.01
XGBoost (ordinal) — 0.50.00 0.56.15 0.84.07 0.90.03 0.91.01 0.90.01 0.90.01 0.92.01 0.94.01
SAINT — 0.80.12 0.83.10 0.88.07 0.90.01 0.90.04 0.90.02 0.90.01 0.92.01 0.93.01
TabNet — 0.56.12 0.70.05 0.73.14 0.80.04 0.83.05 0.84.03 0.88.02 0.88.03 0.89.03
NODE — 0.52.10 0.78.08 0.83.03 0.86.02 0.88.02 0.88.01 0.91.02 0.92.03 0.92.03
TabPFN — 0.84.06 0.88.05 0.87.06 0.91.02 0.92.02 0.92.02 0.92.01 0.92.02 0.92.02
TabPFN (ordinal) — 0.79.08 0.85.07 0.88.05 0.90.02 0.92.01 0.92.01 0.92.00 0.92.02 0.92.02
TabLLM (T0 + Text GPT-3) 0.51.04 0.72.05 0.82.03 0.85.05 0.88.03 0.91.02 0.89.02 0.91.01 0.91.01 0.93.01
TabLLM (T0 + Text T0) 0.44.03 0.74.07 0.82.10 0.87.02 0.88.02 0.89.04 0.90.01 0.89.02 0.89.03 0.93.02
TabLLM (T0 + Table-To-Text) 0.56.05 0.73.09 0.78.08 0.86.06 0.88.03 0.91.02 0.91.02 0.90.02 0.91.01 0.92.01
TabLLM (T0 + Text Template) 0.54.04 0.76.14 0.83.05 0.87.04 0.87.06 0.91.01 0.90.01 0.92.01 0.92.01 0.94.01
TabLLM (T0 + List Template) 0.52.03 0.73.12 0.83.05 0.87.04 0.88.04 0.91.02 0.91.01 0.92.01 0.92.01 0.94.01
TabLLM (T0 + List Only Values) 0.40.04 0.67.16 0.83.06 0.84.05 0.88.03 0.89.03 0.92.02 0.90.00 0.90.01 0.92.01
TabLLM (T0 + List Perm. Names) 0.57.02 0.78.07 0.85.02 0.82.06 0.87.05 0.90.02 0.92.02 0.91.01 0.91.01 0.93.02
TabLLM (T0 + List Perm. Values) 0.23.02 0.63.20 0.79.12 0.83.07 0.88.04 0.89.04 0.90.02 0.91.01 0.91.01 0.93.00
TabLLM (T0 3B + Text Template) 0.56.03 0.68.13 0.82.04 0.85.02 0.86.03 0.90.01 0.91.01 0.93.01 0.93.01 0.94.01
Income Dataset
Logistic regression — 0.68.15 0.72.13 0.80.03 0.82.01 0.83.03 0.85.01 0.87.01 0.88.00 0.90.00
Logistic regression (ordinal) — 0.55.04 0.56.06 0.58.07 0.70.06 0.76.03 0.79.01 0.80.01 0.80.00 0.81.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.78.03 0.81.03 0.87.01 0.88.00 0.93.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.78.01 0.81.01 0.86.01 0.89.00 0.93.00
XGBoost — 0.50.00 0.59.06 0.77.02 0.79.03 0.82.02 0.84.01 0.87.01 0.88.00 0.93.00
XGBoost (ordinal) — 0.50.00 0.63.04 0.74.04 0.76.04 0.79.03 0.84.02 0.86.01 0.88.00 0.93.00
SAINT — 0.74.03 0.65.15 0.79.03 0.81.03 0.84.02 0.84.02 0.87.01 0.88.00 0.91.00
TabNet — 0.56.04 0.59.07 0.62.11 0.64.06 0.71.04 0.73.05 0.80.02 0.83.02 0.92.00
NODE — 0.54.02 0.54.04 0.65.04 0.67.03 0.75.02 0.78.01 0.78.01 0.83.01 0.82.00
TabPFN — 0.73.08 0.71.09 0.76.09 0.80.04 0.82.04 0.84.01 0.86.01 0.87.01 0.89.00
TabPFN (ordinal) — 0.64.11 0.64.06 0.72.04 0.77.02 0.80.02 0.81.01 0.83.01 0.85.01 0.87.00
TabLLM (T0 + Text GPT-3) 0.75.01 0.79.03 0.80.03 0.82.02 0.82.01 0.84.02 0.84.02 0.85.01 0.86.00 *
TabLLM (T0 + Text T0) 0.65.01 0.67.03 0.66.07 0.72.02 0.75.03 0.79.04 0.82.02 0.83.02 0.86.01 *
TabLLM (T0 + Table-To-Text) 0.50.00 0.64.07 0.64.11 0.72.05 0.74.03 0.79.03 0.81.01 0.84.01 0.84.01 *
TabLLM (T0 + Text Template) 0.84.00 0.84.01 0.84.02 0.84.04 0.84.01 0.84.02 0.86.01 0.87.00 0.89.01 0.92.00
TabLLM (T0 + List Template) 0.79.01 0.83.01 0.83.03 0.83.02 0.84.01 0.85.01 0.86.01 0.87.01 0.88.01 *
TabLLM (T0 + List Only Values) 0.73.01 0.74.04 0.75.04 0.80.03 0.82.01 0.84.01 0.84.01 0.86.01 0.87.01 *
TabLLM (T0 + List Perm. Names) 0.65.00 0.75.03 0.74.05 0.82.02 0.83.02 0.84.02 0.86.01 0.86.01 0.88.01 *
TabLLM (T0 + List Perm. Values) 0.26.00 0.40.04 0.48.10 0.65.06 0.72.03 0.79.03 0.81.02 0.83.01 0.84.01 *
TabLLM (T0 3B + Text Template) 0.76.00 0.77.06 0.80.04 0.83.02 0.83.03 0.85.01 0.86.00 0.86.01 0.88.01 *
Jungle Dataset
Logistic regression — 0.62.09 0.69.09 0.68.04 0.76.03 0.79.01 0.79.00 0.80.01 0.80.00 0.81.00
Logistic regression (ordinal) — 0.62.09 0.69.09 0.68.04 0.76.03 0.79.01 0.79.00 0.80.01 0.80.00 0.81.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.84.02 0.88.01 0.91.00 0.98.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.84.02 0.88.01 0.91.00 0.98.00
XGBoost — 0.50.00 0.58.07 0.72.05 0.78.03 0.81.02 0.84.02 0.87.01 0.91.01 0.98.00
XGBoost (ordinal) — 0.50.00 0.58.07 0.72.05 0.78.03 0.81.02 0.84.02 0.87.01 0.91.01 0.98.00
SAINT — 0.64.05 0.69.06 0.72.05 0.79.02 0.81.01 0.83.01 0.88.01 0.90.00 1.00.00
TabNet — 0.53.09 0.60.05 0.62.03 0.69.04 0.73.04 0.75.02 0.79.02 0.84.01 0.99.00
NODE — 0.60.01 0.71.03 0.68.04 0.74.02 0.75.04 0.78.01 0.79.01 0.80.00 0.81.00
TabPFN — 0.65.08 0.72.04 0.71.07 0.78.02 0.81.01 0.84.01 0.88.01 0.91.00 0.93.00
TabPFN (ordinal) — 0.65.08 0.72.04 0.71.07 0.78.02 0.81.01 0.84.01 0.88.01 0.91.00 0.93.00
TabLLM (T0 + Text GPT-3) 0.56.01 0.58.02 0.55.02 0.60.06 0.68.03 0.74.03 0.77.01 0.81.01 0.85.01 *
TabLLM (T0 + Text T0) 0.63.00 0.63.04 0.64.05 0.62.06 0.70.01 0.71.03 0.74.02 0.78.02 0.82.01 *
TabLLM (T0 + Table-To-Text) 0.51.01 0.60.02 0.60.04 0.63.05 0.69.03 0.75.01 0.78.03 0.82.01 0.85.01 *
TabLLM (T0 + Text Template) 0.60.00 0.64.01 0.64.02 0.65.03 0.71.02 0.78.02 0.81.02 0.84.01 0.89.01 1.00 †
TabLLM (T0 + List Template) 0.63.00 0.65.01 0.66.03 0.66.04 0.71.03 0.78.02 0.81.03 0.84.01 0.88.01 *
TabLLM (T0 + List Only Values) 0.58.00 0.60.03 0.62.03 0.63.02 0.65.04 0.73.01 0.76.02 0.82.02 0.88.01 *
TabLLM (T0 + List Perm. Names) 0.40.00 0.53.06 0.55.05 0.63.10 0.72.03 0.79.02 0.80.03 0.84.02 0.89.01 *
TabLLM (T0 + List Perm. Values) 0.48.00 0.50.02 0.52.03 0.53.03 0.55.01 0.59.02 0.63.01 0.72.02 0.75.01 *
TabLLM (T0 3B + Text Template) 0.54.00 0.63.02 0.64.04 0.67.03 0.72.03 0.77.02 0.80.02 0.83.01 0.87.01 *
* Result omitted due to runtime limitations of TabLLM on the full dataset.
† These experiments were only performed for a single run due to runtime limitations of TabLLM on the full dataset.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 15: Full results on healthcare claims dataset. The best concept selection method (most frequent concepts) and
concept names (original concept names) were used as determined in prior zero-shot experiments. A fix number of 10
epochs was used for up to 256 shots and 3 epochs for more shots to decrease the runtime and prevent overfitting.
Number of Shots
Method 0 16 64 256 1,024 4,096 16,384 all
End of Life (EoL)
TabLLM (T0 + List Template) 0.70 0.74 0.78 0.78 0.79 0.81 0.81 —
TabLLM (T0 + Text Template) 0.63 0.71 0.74 0.76 0.78 0.79 0.80 —
TabLLM (T0 + List Short) 0.68 0.71 0.76 0.79 0.80 0.81 0.82 —
TabLLM (T0 + List Perm. Names) 0.62 0.66 0.70 0.74 0.75 0.77 0.79 —
Logistic Regression — 0.65.07 0.77.02 0.80.02 0.83.01 0.83.01 0.84.01 0.84.01
LightGBM — 0.50.00 0.71.01 0.76.02 0.80.01 0.82.01 0.83.01 0.82 *
TabLLM (T0 + List Template) unbalanced 0.70 0.64 0.69 0.74 0.74 0.77 0.79 —
Logistic Regression unbalanced — 0.44.04 0.53.12 0.75.03 0.77.03 0.80.02 0.82.02 0.84.01
Surgical Procedure (Surgery)
TabLLM (T0 + List Template) 0.67 0.73 0.72 0.73 0.75 0.78 0.79 —
TabLLM (T0 + Text Template) 0.62 0.71 0.69 0.72 0.74 0.77 0.78 —
TabLLM (T0 + List Short) 0.66 0.70 0.69 0.72 0.73 0.76 0.78 —
TabLLM (T0 + List Perm. Names) 0.60 0.68 0.70 0.72 0.74 0.77 —
Logistic Regression — 0.72.04 0.75.05 0.77.01 0.79.01 0.80.01 0.80.00 0.81.00
LightGBM — 0.50.00 0.73.02 0.77.01 0.79.01 0.80.00 0.81.01 0.82 *
TabLLM (T0 + List Template) unbalanced 0.67 0.68 0.73 0.74 0.75 0.77 0.79 —
Logistic Regression unbalanced — 0.61.15 0.77.01 0.77.02 0.78.01 0.80.01 0.80.00 0.81.00
Likelihood of Hospitalization (LoH)
TabLLM (T0 + List Template) 0.71 0.73 0.73 0.76 0.78 0.81 0.82 —
TabLLM (T0 + Text Template) 0.65 0.74 0.72 0.74 0.78 0.80 0.81 —
TabLLM (T0 + List Short) 0.70 0.73 0.75 0.78 0.79 0.80 0.82 —
TabLLM (T0 + List Perm. Names) 0.62 0.71 0.72 0.75 0.75 0.78 0.80 —
Logistic Regression — 0.72.04 0.76.03 0.80.01 0.82.01 0.83.01 0.83.01 0.84.01
LightGBM — 0.50.00 0.72.02 0.76.03 0.81.01 0.83.00 0.83.01 0.85 *
TabLLM (T0 + List Template) unbalanced 0.71 0.66 0.72 0.75 0.75 0.78 0.80 —
Logistic Regression unbalanced — 0.53.06 0.54.09 0.73.06 0.79.01 0.81.01 0.82.01 0.84.01
* These experiments were only performed for a single run due to runtime limitations on the full dataset.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Table 16: Feature importance of zero-shot TabLLM and LR on all data for the Income dataset. To determine the feature
importance of TabLLM, we fit a separate LR model to the predictions using the original feature values as covariates. For
LR we simply use the feature coefficients. The features are ranked by their TabLLM importance score.
Feature TabLLM LR Feature TabLLM LR
rank weight rank weight rank weight rank weight
capital gain 1 5.310 2 2.393 relationship Other-relative 54 -0.010 88 -0.759
education Masters 2 4.623 6 1.455 native country Trinadad&Tob. 55 -0.028 66 -0.097
education Doctorate 3 3.410 4 2.066 race Black 56 -0.044 74 -0.291
education Bachelors 4 2.995 7 1.135 native country England 57 -0.088 16 0.551
education Prof-school 5 2.949 5 1.900 native country Honduras 58 -0.105 58 0.000
occupation Machine-op-insp. 6 2.589 75 -0.325 relationship Not-in-family 59 -0.153 29 0.257
workclass Private 7 2.275 37 0.102 native country Holand-Neth. 60 -0.154 57 0.000
relationship Wife 8 2.109 8 0.955 occupation Craft-repair 61 -0.161 36 0.108
native country China 9 2.086 94 -0.839 capital loss 62 -0.182 31 0.255
native country United-States 10 2.045 38 0.087 race Other 63 -0.202 65 -0.085
native country Taiwan 11 1.965 54 0.000 native country Yugoslavia 64 -0.204 27 0.357
workclass Federal-gov 12 1.784 14 0.574 workclass Local-gov 65 -0.230 47 0.000
race White 13 1.685 61 0.000 occupation nan 66 -0.248 82 -0.653
education Assoc-acdm 14 1.621 13 0.574 marital status Never-married 67 -0.292 77 -0.443
native country nan 15 1.565 63 -0.056 native country Iran 68 -0.330 41 0.000
marital status Married-civ-sp. 16 1.487 3 2.214 native country Dominican-Rep. 69 -0.332 85 -0.731
occupation Protective-serv 17 1.434 17 0.535 marital status Married-sp.-abs. 70 -0.379 51 0.000
sex Male 18 1.335 42 0.000 native country Jamaica 71 -0.416 25 0.392
occupation Armed-Forces 19 1.290 60 0.000 native country Nicaragua 72 -0.425 45 0.000
occupation Adm-clerical 20 1.245 52 0.000 native country Thailand 73 -0.451 100 -1.116
hours per week 21 1.240 20 0.424 native country Peru 74 -0.522 93 -0.837
native country Hong 22 1.227 86 -0.749 native country Japan 75 -0.617 56 0.000
occupation Tech-support 23 1.164 18 0.526 relationship Unmarried 76 -0.620 48 0.000
relationship Husband 24 1.087 72 -0.212 native country France 77 -0.754 21 0.416
occupation Sales 25 0.857 28 0.298 occupation Other-service 78 -0.754 96 -0.903
native country Vietnam 26 0.803 95 -0.898 workclass Never-worked 79 -0.763 50 0.000
marital status Married-AF-sp. 27 0.792 1 2.571 education 1st-4th 80 -0.763 101 -1.172
native country Philippines 28 0.711 40 0.011 native country Columbia 81 -0.836 104 -1.855
age 29 0.710 22 0.411 education 5th-6th 82 -0.843 97 -0.961
native country Poland 30 0.698 53 0.000 marital status Divorced 83 -0.870 46 0.000
occupation Prof-specialty 31 0.684 12 0.620 education 9th 84 -0.904 102 -1.222
race Asian-Pac-Islander 32 0.651 32 0.254 native country Ecuador 85 -0.952 49 0.000
native country Outlying-US 33 0.591 92 -0.836 education 11th 86 -0.993 91 -0.825
workclass Self-emp-not-inc 34 0.582 76 -0.344 native country Haiti 87 -1.062 35 0.137
native country Italy 35 0.534 24 0.400 education Assoc-voc 88 -1.074 19 0.514
marital status Separated 36 0.523 70 -0.181 native country India 89 -1.074 71 -0.183
workclass nan 37 0.515 59 0.000 education 7th-8th 90 -1.151 103 -1.303
occupation Exec-managerial 38 0.503 10 0.773 marital status Widowed 91 -1.253 64 -0.071
native country Scotland 39 0.491 81 -0.626 education 10th 92 -1.306 89 -0.797
native country Laos 40 0.475 44 0.000 native country Greece 93 -1.319 68 -0.140
native country Cambodia 41 0.328 11 0.642 sex Female 94 -1.327 84 -0.710
native country Guatemala 42 0.276 55 0.000 native country South 95 -1.466 99 -1.101
workclass State-gov 43 0.267 73 -0.223 native country Cuba 96 -1.575 33 0.230
native country Germany 44 0.262 39 0.043 education Some-college 97 -1.950 26 0.363
native country Puerto-Rico 45 0.241 67 -0.128 occupation Handlers-cleaners 98 -1.992 83 -0.681
native country Hungary 46 0.177 34 0.191 native country Portugal 99 -2.049 15 0.572
native country Mexico 47 0.123 80 -0.579 race Amer-Indian-Eskimo 100 -2.081 78 -0.465
native country Ireland 48 0.116 9 0.954 relationship Own-child 101 -2.404 87 -0.755
education HS-grad 49 0.092 43 0.000 occupation Priv-house-serv 102 -2.840 105 -1.909
occupation Transport-moving 50 0.090 62 -0.048 education 12th 103 -3.178 79 -0.480
native country El-Salvador 51 0.027 90 -0.803 education Preschool 104 -3.520 106 -2.385
native country Canada 52 0.027 23 0.407 occupation Farming-fishing 105 -3.853 98 -0.982
workclass Self-emp-inc 53 0.001 30 0.255 workclass Without-pay 106 -4.423 69 -0.174
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Table 17: Feature importance of zero-shot TabLLM and relative risk (RR) with 95% confidence interval (CI) for EoL task
on the healthcare claims dataset. For TabLLM we fit a separate LR model to the predictions using the original feature
values as covariates. We determine the relative risk treating the respective feature as an intervention, i.e. the ratio of the
label in the group that has a concept divided by the ratio in the group without it. We selected 50 features with the highest
and the lowest importance.
Feature TabLLM RR (95% CI) Feature TabLLM RR (95% CI)
rank weight rank weight
atrial fibrillation 1 0.633 2.72 (2.51-2.95) open wound of forehead without ... 14056 -0.152 1.80 (1.18-2.74)
atherosclerosis of coronary art... 2 0.530 2.10 (1.94-2.27) prediabetes 14057 -0.157 0.81 (0.68-0.96)
atherosclerosis of aorta 3 0.473 1.99 (1.81-2.19) primary iridocyclitis 14058 -0.157 1.63 (1.03-2.56)
exudative age-related macular d... 4 0.452 2.38 (2.06-2.75) discoloration of skin 14059 -0.157 0.87 (0.73-1.04)
sex male 5 0.442 1.23 (1.14-1.33) basal cell carcinoma of truncal... 14060 -0.158 1.14 (0.94-1.40)
non-hodgkin’s lymphoma (clinical) 6 0.440 1.36 (0.94-1.96) lumbar sprain 14061 -0.158 1.14 (0.91-1.42)
chronic atrial fibrillation 7 0.436 3.36 (3.05-3.70) spasm 14062 -0.160 0.98 (0.82-1.16)
chronic kidney disease stage 3 8 0.430 2.75 (2.53-2.98) chronic rhinitis 14063 -0.161 1.22 (1.06-1.42)
atherosclerosis of arteries of ... 9 0.404 2.76 (2.42-3.15) primary cardiomyopathy 14064 -0.161 2.50 (2.11-2.97)
barrett’s esophagus 10 0.402 1.07 (0.84-1.37) benign neoplastic disease 14065 -0.162 1.04 (0.63-1.72)
chronic obstructive lung disease 11 0.401 2.39 (2.19-2.60) palpitations 14066 -0.166 1.12 (1.01-1.25)
paroxysmal atrial fibrillation 12 0.395 2.58 (2.37-2.81) localized, primary osteoarthrit... 14067 -0.167 1.50 (1.33-1.70)
systemic lupus erythematosus 13 0.395 1.51 (0.99-2.29) benign neoplasm of skin of lowe... 14068 -0.167 0.68 (0.53-0.89)
atherosclerosis of artery of lo... 14 0.394 2.45 (2.20-2.72) cyst of ovary 14069 -0.171 0.90 (0.64-1.26)
coronary atherosclerosis 15 0.381 2.15 (1.95-2.36) microscopic hematuria 14070 -0.171 1.18 (1.01-1.37)
nonexudative age-related macula... 16 0.377 2.15 (1.95-2.37) problem related to lifestyle 14071 -0.172 0.96 (0.48-1.91)
age related macular degeneration 17 0.371 2.18 (1.76-2.71) acquired hypothyroidism 14072 -0.172 1.47 (1.34-1.62)
pseudoexfoliation glaucoma 18 0.360 1.13 (0.72-1.76) abnormal findings on diagnostic... 14073 -0.176 0.63 (0.54-0.73)
degenerative joint disease invo... 19 0.359 1.77 (1.52-2.06) increased frequency of urination 14074 -0.177 1.41 (1.22-1.64)
coronary arteriosclerosis 20 0.357 2.00 (1.82-2.20) disorder of skin 14075 -0.178 1.18 (0.95-1.48)
coronary artery graft present 21 0.346 1.64 (1.41-1.91) thyroiditis 14076 -0.180 0.87 (0.49-1.57)
aortocoronary bypass graft present 22 0.335 2.24 (1.98-2.54) race hispanic or latino 14077 -0.186 0.96 (0.60-1.51)
dehydration 23 0.332 2.94 (2.68-3.22) herpes zoster without complication 14078 -0.187 1.14 (0.96-1.35)
primary malignant neoplasm of f... 24 0.327 1.19 (1.01-1.40) altered sensation of skin 14079 -0.191 1.00 (0.82-1.22)
malignant lymphoma 25 0.322 1.54 (0.96-2.46) generalized hyperhidrosis 14080 -0.194 1.37 (1.07-1.76)
cerebral infarction due to thro... 26 0.316 2.86 (2.46-3.32) primary open angle glaucoma 14081 -0.194 1.35 (1.20-1.52)
congestive heart failure 27 0.313 3.67 (3.38-3.99) stool finding 14082 -0.195 1.48 (1.26-1.73)
old myocardial infarction 28 0.299 2.04 (1.81-2.30) primary gout 14083 -0.196 1.80 (1.51-2.15)
sleep apnea 29 0.294 1.16 (0.98-1.37) localized, primary osteoarthrit... 14084 -0.199 1.10 (0.92-1.30)
acute hypoxemic respiratory fai... 30 0.292 4.02 (3.62-4.46) diarrhea 14085 -0.200 1.73 (1.57-1.90)
obstructive sleep apnea syndrome 31 0.287 1.09 (0.96-1.24) benign neoplasm of skin of uppe... 14086 -0.204 0.78 (0.58-1.03)
primary malignant neoplasm of e... 32 0.284 0.92 (0.56-1.53) prostatitis 14087 -0.204 1.20 (0.89-1.62)
sensorineural hearing loss 33 0.281 1.26 (1.09-1.47) eruption 14088 -0.205 1.25 (1.11-1.41)
retention of urine 34 0.280 2.19 (1.97-2.44) scar conditions and fibrosis of... 14089 -0.206 1.00 (0.86-1.15)
atrial flutter 35 0.280 2.14 (1.85-2.47) hashimoto thyroiditis 14090 -0.215 0.91 (0.49-1.68)
abdominal aortic aneurysm witho... 36 0.275 1.85 (1.58-2.18) acquired deformity of toe 14091 -0.227 1.25 (0.94-1.65)
chronic kidney disease due to h... 37 0.274 2.65 (2.42-2.90) race asian 14092 -0.228 0.70 (0.50-0.99)
non-rheumatic aortic sclerosis 38 0.271 2.64 (2.38-2.93) localized swelling, mass and lu... 14093 -0.242 1.48 (1.15-1.91)
type 2 diabetes mellitus 39 0.267 2.14 (1.96-2.33) benign neoplasm of skin of trunk 14094 -0.245 0.91 (0.79-1.05)
intraductal carcinoma in situ o... 40 0.265 0.62 (0.30-1.29) benign essential hypertension 14095 -0.245 1.86 (1.72-2.01)
chronic kidney disease stage 2 41 0.264 1.77 (1.55-2.03) finding of frequency of urination 14096 -0.255 1.48 (1.34-1.64)
degenerative disorder of macula 42 0.263 2.23 (1.88-2.65) benign essential microscopic he... 14097 -0.258 1.10 (0.76-1.59)
sensorineural hearing loss, bil... 43 0.262 1.30 (1.17-1.43) localized swelling, mass and lu... 14098 -0.262 1.93 (1.67-2.23)
race white 44 0.262 1.25 (1.14-1.37) digestive symptom 14099 -0.267 0.91 (0.68-1.21)
metabolic encephalopathy 45 0.259 4.42 (3.86-5.07) type 1 diabetes mellitus withou... 14100 -0.298 2.34 (2.03-2.70)
alzheimer’s disease 46 0.256 5.03 (4.45-5.69) open angle with borderline intr... 14101 -0.338 1.20 (1.03-1.40)
sick sinus syndrome 47 0.256 2.37 (2.08-2.71) primary localized osteoarthrosi... 14102 -0.366 1.08 (0.82-1.43)
ventricular tachycardia 48 0.255 2.33 (2.00-2.70) localized, primary osteoarthritis 14103 -0.393 1.23 (1.07-1.40)
acute posthemorrhagic anemia 49 0.255 2.15 (1.92-2.41) sex female 14104 -0.441 0.81 (0.75-0.88)
impaired fasting glycemia 50 0.254 0.97 (0.85-1.09) open-angle glaucoma - borderline 14105 -0.495 0.97 (0.85-1.10)
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

8 TASK TEMPLATES Heart Dataset:

answer choices: ’No ||| Yes’
Bank Dataset: jinja: ’{{serialization}}
answer choices: ’No ||| Yes’
Does the coronary angiography of this
jinja: ’{{serialization}}
patient show a heart disease? Yes or
no?
Does this client subscribe to a term
Answer:
deposit? Yes or no?
|||
Answer:
{{ answer choices[label] }}’
|||
{{ answer choices[label] }}’
Income Dataset:
Blood Dataset: answer choices: ’No ||| Yes’
jinja: ’{{serialization}}
answer choices: ’No ||| Yes’
jinja: ’{{serialization}}
Does this person earn more than 50000
dollars per year? Yes or no?
Did the person donate blood? Yes or no?
Answer:
Answer:
|||
|||
{{ answer choices[label] }}’
{{ answer choices[label] }}’

California Dataset: Jungle Dataset:

answer choices: ’No ||| Yes’ answer choices: ’No ||| Yes’
jinja: ’{{serialization}} jinja: ’{{serialization}}

Is this house block valuable? Yes or Does the white player win this two
no? pieces endgame of Jungle Chess? Yes or
Answer: no?
||| Answer:
{{ answer choices[label] }}’ |||
{{ answer choices[label] }}’
Car Dataset:
End Of Life Task:
answer choices: ’Unacceptable |||
Acceptable ||| Good ||| Very good’ answer choices: ’No ||| Yes’
jinja: ’{{serialization}} jinja: ’{{serialization}}

How would you rate the decision to buy Does this patient die in the next nine
this car? Unacceptable, acceptable, months? Yes or no?
good or very good? Answer:
Answer: |||
||| {{ answer choices[label] }}’
{{ answer choices[label] }}’
Surgical Procedure Task:
Credit-g Dataset: answer choices: ’No ||| Yes’
answer choices: ’No ||| Yes’ jinja: ’{{serialization}}
jinja: ’{{serialization}}
Does this patient need a surgery in the
Does this person receive a credit? Yes next nine months? Yes or no?
or no? Answer:
Answer: |||
||| {{ answer choices[label] }}’
{{ answer choices[label] }}’
Likelihood of Hospitalization Task:
Diabetes Dataset: answer choices: ’No ||| Yes’
answer choices: ’No ||| Yes’ jinja: ’{{serialization}}
jinja: ’{{serialization}}
Is this patient admitted to the hospital
Does this patient have diabetes? Yes or in the next nine months? Yes or no?
no? Answer:
Answer: |||
||| {{ answer choices[label] }}’
{{ answer choices[label] }}’
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

9 EXAMPLE SERIALIZATIONS Bank Dataset (Table-To-Text):

the age of 69 was 69 years. the retired
Bank Dataset (List Template): retired. the marital status is single
with the single name. the school has a
- age: 69
school of four students. the has a
- type of job: retired
credit of $500,000. The average yearly
- marital status: single
balance in euros is 2144. the has a
- education: tertiary
total of 2,000+ housing units. the has
- has credit in default?: no
an official loan of $500 million. the
- average yearly balance, in euros:
standard definition has been updated to
2144
the standard definition. the current
- has housing loan?: no
record of the month is 29. the first
- has personal loan?: no
contact month was on December 20, 2005,
- contact communication type: cellular
and then on March 22, 2006, the next
- last contact day of the month: 29
month was on March 22, 2006. the first
- last contact month of year: jul
contact duration was 417 seconds. the
- last contact duration, in seconds:
DVB has a selection of DVB. The year, in
417
which the client was first contacted by
- number of contacts performed during
a former airline operator, was by a
this campaign and for this client:
former airline operator, and by a former
- number of days that passed by after
airline operator, he was the first to
the client was last contacted from a
enter the post of the office. the 4 is
previous campaign: 184
a 4-purpose cycle. the first of the
- number of contacts performed before
first 20 MB of the history history to
this campaign and for this client: 4
use the 20 MB.
- outcome of the previous marketing
campaign: success Bank Dataset (Text T0):
Bank Dataset (Text Template): a retired soldier shows off his tattoos.
a city is a city with a population of
The age is 69. The type of job is singles and tertiary education. no, the
retired. The marital status is single. average yearly balance is 2144 euros.
The education is tertiary. The has no he has no personal loan or housing
credit in default? is no. The average loan a man is contacting a woman on her
yearly balance, in euros is 2144. The cell phone on the 29th day of the month.
has housing loan? is no. The has last contact month of year was july,
personal loan? is no. The contact last contact duration was 417 seconds.
communication type is cellular. The 184 days after the client was last
last contact day of the month is 29. contacted from a previous campaign. a
The last contact month of year is jul. previous marketing campaign for this
The last contact duration, in seconds is client resulted in success with 4
417. The number of contacts performed contacts
during this campaign and for this client
is. The number of days that passed by Bank Dataset (Text GPT-3):
after the client was last contacted from
The person is 69 years old, retired,
a previous campaign is 184. The number
single, and has a tertiary education.
of contacts performed before this
They have no credit in default, and
campaign and for this client is 4. The
their average yearly balance is 2144
outcome of the previous marketing
euros. They have no housing loan or
campaign is success.
personal loan. The contact
communication type is cellular, and the
last contact was on the 29th day of the
month and lasted 417 seconds. They have
been contacted 4 times before this
campaign, and the outcome of the
previous marketing campaign was success.

Blood Dataset (List Template):

- Recency - months since last donation:
23
- Frequency - total number of donation:
1
- Monetary - total blood donated in
c.c.: 250
- Time - months since first donation:
23
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Blood Dataset (Text Template): California Dataset (Text GPT-3):

The Recency - months since last donation The house block is located in the city
is 23. The Frequency - total number of of Los Angeles, in the state of
donation is 1. The Monetary - total California. The median income in the
blood donated in c.c. is 250. The Time area is $3,237, the median age is 32
- months since first donation is 23. years old, the total number of rooms is
6,597, the total number of bedrooms is
Blood Dataset (Table-To-Text): 1,579, the population is 3,689, and the
the number of the public can be from the number of households is 1,459. The
number of the public. The 1.2 has a latitude is 34.15, and the longitude is
maximum speed of 1.2. The first set of -118.01.
the first set was in 1742 and was in
1742. Car Dataset (List Template):
- Buying price: low
Blood Dataset (Text T0): - Doors: three
The donor has made 1 donation in the - Maintenance costs: low
last 23 months. monetary - total blood - Persons: more than four
donated in c.c. : 250, time - months - Safety score: medium
since first donation : 23 - Trunk size: medium

Blood Dataset (Text GPT-3): Car Dataset (Text Template):

The blood donor is a 23-year-old male The Buying price is low. The Doors is
who has donated blood once, 250 c.c. of three. The Maintenance costs is low.
blood, 23 months ago. The Persons is more than four. The
Safety score is medium. The Trunk size
California Dataset (List Template): is medium.
- median income: 3.2377 Car Dataset (Table-To-Text):
- median age: 32
- total rooms: 6597 The price of the price is C1,000. the
- total bedrooms: 1579 three Doors were three. The total
- population: 3689 number of people in the city is more
- households: 1459 than four. the Safety score was 17.5.
- latitude: 34.15 the Trunk size is 20.5-inch.
- longitude: -118.01
Car Dataset (Text T0):
California Dataset (Text Template): The refrigerator has three doors and is
The median income is 3.2377. The median very cheap. The maintenance costs are
age is 32. The total rooms is 6597. low for a family of more than four. The
The total bedrooms is 1579. The car has a medium safety score and a
population is 3689. The households is medium trunk size.
1459. The latitude is 34.15. The Car Dataset (Text GPT-3):
longitude is -118.01.
This car a good choice for those who are
California Dataset (Table-To-Text): looking for a low-priced vehicle with
low maintenance costs. It is also a
there were 3.2377 people residing in the
good choice for families or groups of
city. the total rooms have 6597 rooms.
friends who need a car with a bit more
the total has a total of 1579. The
space than a smaller car. The safety
population was 3689 at the time of the
score is medium, so it is not the best
census. The households 1459 is a
choice for those who are looking for a
standard households. The value 34.15 is
car with the highest safety rating.
a value that has a value of 34.15. The
longitude has a distance of 1.5 km and
is approximately 1.5 km.

California Dataset (Text T0):

median age of 32 years old the hotel has
a total of 6597 rooms and 1579 bedrooms.
a city has a population of 3689 and
households of 1459. a city is located
in the southwestern part of the country
at latitude 34.15 and longitude -118.01.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Credit-g Dataset (List Template): Credit-g Dataset (Table-To-Text):

- Status of existing checking account: the 0.2 (0.2) is a type of 00.2. The
0 <= ... < 200 DM average annual precipitation is 11.5
- Duration in month: 11 millimetres (4.5 in). the Credit
- Credit history : existing credits history has been paid back to a few
paid back duly till now years. the standard cell is a
- Purpose: furniture/equipment standard cell. the amount was 1577. the
- Credit amount: 1577 Savings account/bonds were from the
- Savings account/bonds: ... >= 1000 Savings account/bonds to the Savings
DM account/bonds. there were 1,000
- Present employment since: <1 employees. there were 4,000 people in
- Installment rate in percentage of the city. The male has a male score of
disposable income: 4 the female. the debt was $12.5 million
- Personal status and sex: female : ($9.5 million in 2013). the current
divorced/separated/married residence has a 1,000 feet (460 m) long.
- Other debtors / guarantors: none the standard estate is a standard
- Present residence since: 1 estate. It has a age of 20 years. the
- Property: real estate first installment was the first
- Age in years: 20 installment in the year 2005. The
- Other installment plans: none Housing is a public transport system
- Housing: own that is a network of the public. the
- Number of existing credits at this company has a number of existing and
bank: 1 existing works, and has a number of
- Job: skilled employee / official existing and existing works. the
- Number of people being liable to company’s job is job with the job name
provide maintenance for: 1.0 as "Success". the network has a network
- Telephone: none of over 800 MT/s. the foreign worker
- foreign worker: yes has no foreign worker.

Credit-g Dataset (Text Template): Credit-g Dataset (Text T0):

The Status of existing checking account The checking account has a balance of 0
is 0 <= ... < 200 DM. The Duration in DM. A man is paying for furniture and
month is 11. The Credit history is equipment with a credit card. The
existing credits paid back duly till credit amount is 1577, the savings
now. The Purpose is account/bonds are >= 1000 DM. The
furniture/equipment. The Credit amount present employee has been in this job
is 1577. The Savings account/bonds is for a year, and the installment rate is
... >= 1000 DM. The Present employment 4. % of disposable income. A female
since is <1. The Installment rate in who is divorced/separated/married is
percentage of disposable income is 4. requesting a loan. The property is
The Personal status and sex is female : located in a gated community and has
divorced/separated/married. The Other been on the market since. The man is 20
debtors / guarantors is none. The years old and has no other installment
Present residence since is 1. The plans. The number of existing credits
Property is real estate. The Age in at this bank is 1. A skilled employee
years is 20. The Other installment is liable to provide maintenance for
plans is none. The Housing is own. The 1.0. A foreign worker is without a
Number of existing credits at this bank telephone.
is 1. The Job is skilled employee /
official. The Number of people being Credit-g Dataset (Text GPT-3):
liable to provide maintenance for is The person is a 20-year-old female with
1.0. The Telephone is none. The a checking account status of 0-200 DM.
foreign worker is yes. She has been employed for less than a
year and her installment rate is 4% of
her disposable income. She is
divorced/separated/married and has no
other debtors or guarantors. She has
been living in her current residence for
1 year and owns real estate. She has 1
credit at this bank and is a skilled
employee/official. She is liable for
maintenance for 1 person. She has no
telephone. She is a foreign worker.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Diabetes Dataset (List Template): Heart Dataset (List Template):

- Age: 30 years - Age of the patient: 43 years
- Number of times pregnant: 1 - Sex of the patient: male
- Diastolic blood pressure: 64 mmHg - Chest pain type: asymptomatic
- Triceps skin fold thickness: 32 mm - Resting blood pressure: 132
- Plasma glucose concentration at 2 - Serum cholesterol: 247
hours in an oral glucose tolerance test - Fasting blood sugar > 120 mg/dl: yes
(GTT): 122 mg/dl - Resting electrocardiogram results:
- 2-hour serum insulin: 156 µU/ml probable or definite left ventricular
- Body mass index: 35.1 hypertrophy
- Diabetes pedigree function: 0.692 - Maximum heart rate achieved: 143
- Exercise-induced angina: yes
Diabetes Dataset (Text Template): - ST depression induced by exercise
The Age is 30. The Number of times relative to rest: 0.1
pregnant is 1. The Diastolic blood - Slope of the peak exercise ST segment:
pressure is 64. The Triceps skin fold flat
thickness is 32. The Plasma glucose
Heart Dataset (Text Template):
concentration at 2 hours in an oral
glucose tolerance test (GTT) is 122. The Age of the patient is 43. The Sex
The 2-hour serum insulin is 156. The of the patient is male. The Chest pain
Body mass index is 35.1. The Diabetes type is asymptomatic. The Resting blood
pedigree function is 0.692. pressure is 132. The Serum cholesterol
is 247. The Fasting blood sugar > 120
Diabetes Dataset (Table-To-Text): mg/dl is yes. The Resting
The age was 30 years, and was the electrocardiogram results is probable or
youngest ever to enter the age. the definite left ventricular hypertrophy.
number of children is 1. The Diastolic The Maximum heart rate achieved is 143.
blood pressure is 64. the Triceps can The Exercise-induced angina is yes. The
run up to 32. the 2 hours of the ST depression induced by exercise
glucose is 122. the 2-hour cycle peaked relative to rest is 0.1. The Slope of
to 156. The mass index was 35.1. The the peak exercise ST segment is flat.
0.692 is a fast and pathos.
Heart Dataset (Table-To-Text):
Diabetes Dataset (Text T0): The male patient was the 43rd of the Age
The woman is 30 years old and has been of the patient. The male is a male of
pregnant once. The doctor checks the the same class. The blood pressure was
blood pressure and triceps skin fold 132. The Serum cave has a cave of 247.
thickness of the patient. The glucose the sugar has a low of 120 mg/dl. the
concentration at 2 hours in an oral type of the group is the type of the
glucose tolerance test (GTT) was 122 and group that has a group of the group.
the 2-hour serum insulin was 156. The The highest heart rate achieved is 143.
pedigree function of this family is the Exercise angina has a yes value.
0.692. The ST depression has ranged from 0.1 to
0.1. the first segment was a flat of
Diabetes Dataset (Text GPT-3): the ST.
This patient 30 years old, has been Heart Dataset (Text T0):
pregnant once, has a diastolic blood
pressure of 64 mmHg, and has a triceps The patient is a 43-year-old male. The
skin fold thickness of 32 mm. The chest pain is asymptomatic and resting
patient’s plasma glucose concentration blood pressure is 132. The doctor
at 2 hours in an oral glucose tolerance checks the fasting blood sugar and finds
test (GTT) is 122 mg/dl, and the it is above 120 mg/dl. The resting ECG
patient’s 2-hour serum insulin is 156 results showed probable or definite left
µU/ml. The patient’s body mass index is ventricular hypertrophy, with maximum
35.1, and the patient’s diabetes heart rate of 143 beats per minute. The
pedigree function is 0.692. patient had exercise-induced angina,
with ST depression induced by exercise
relative to rest of 0.1. The slope of
the peak exercise segment is flat.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Heart Dataset (Text GPT-3): Income Dataset (Text T0):

This patient a 43-year-old male with Kim is a 30-year-old Asian-Pacific
asymptomatic chest pain. His resting Islander. She is never married and has
blood pressure is 132 mmHg and his serum never had children. The man is the
cholesterol is 247 mm/dl. He has owner of the house and he is the only
fasting blood sugar > 120 mg/dl and his child. A woman is executing a contract
resting electrocardiogram results are as a private sector employee. The
probable or definite left ventricular company had a capital loss of $ 0 last
hypertrophy. His maximum heart rate year. The man has a bachelor’s degree
achieved is 143 and he has and works 52 hours a week.
exercise-induced angina. His ST
depression induced by exercise relative Income Dataset (Text GPT-3):
to rest is 0.1 and his slope of the peak The person is 30 years old,
exercise ST segment is flat. Asian-Pac-Islander, female, never
married, and an own-child relation to
Income Dataset (List Template): the head of the household. The person
- Age: 30 is from Taiwan and is an execution and
- Race: Asian-Pac-Islander management occupation in the private
- Sex: Female sector employee work class. The person
- Marital status: never married has 0 dollars in capital gain and 0
- Rel. to head of the household: Own dollars in capital loss from the
- Native country: Taiwan previous year. The person has a
- Occupation: execution and management bachelor’s degree and works 52 hours per
- Work class: private sector employee week.
- Capital gain last year: 0
- Capital loss last year: 0 Jungle Dataset (List Template):
- Education: bachelor’s degree - white piece strength: 6
- Work hours per week: 52 - white piece file: 4
- white piece rank: 7
Income Dataset (Text Template): - black piece strength: 0
The Age is 30. The Race is - black piece file: 5
Asian-Pac-Islander. The Sex is Female. - black piece rank: 2
The Marital status is never married.
The Relation to head of the household is Jungle Dataset (Text Template):
Own-child. The Native country is The white piece strength is 6. The
Taiwan. The Occupation is execution and white piece file is 4. The white piece
management. The Work class is private rank is 7. The black piece strength is
sector employee. The Capital gain last 0. The black piece file is 5. The
year is 0. The Capital loss last year black piece rank is 2.
is 0. The Education is bachelor’s
degree. The Work hours per week is 52. Jungle Dataset (Table-To-Text):
the piece has a value of 6. the 4 file
Income Dataset (Table-To-Text): file has a 4-polytopic file. the piece
The age was 30 years, and was the has a cross point of the right side.
youngest ever to enter the age. The the black piece strength is 0. The
race was held in the Asian-Pac-Islander, black piece file has a 5.0.
and was won by the race. The sex of the
village was Female. The first female to Jungle Dataset (Text T0):
be married is Marital status never The white piece has a strength of 6 and
reported. the family has the head of a file of 4. The white piece is ranked
the household. The Chinese: native 7, the black piece is ranked 0. The
region of Taiwan. He was the black piece is ranked number two.
executioners of the execution and
management of the city of New York City. Jungle Dataset (Text GPT-3):
the private sector employee is a private
sector employee. The capital was The white piece is stronger than the
Capital of the State of India. The black piece. The white piece is on file
capital loss of the state was 0.5%. The 4 and rank 7. The black piece is on
bachelor’s degree in Education was file 5 and rank 2.
bachelor’s degree. the week 52 was the
52-hour week.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

9.1 Large Healthcare Claims Dataset

End Of Life Task anonymized (List Template):

Summary: The patient is a 73 year old
hispanic or latino man.

May 30, 2014: saw a doctor for

dermatology
Conditions:
- chronic cholecystitis
- aplastic anemia due to drugs

April 21, 2017: visited the hospital

for 12 days
Conditions:
- chronic cholecystitis [...]

End Of Life Task anonymized (Text Template):

Summary: The patient is a 73 year old
hispanic or latino man.

On May 30, 2014 the patient saw a doctor

for dermatology with a primary complaint
of chronic cholecystitis. He was also
treated for aplastic anemia due to
drugs.

On April 21, 2017 the patient visited

the hospital for 12 days with a primary
complaint of chronic cholecystitis.
[...]

End Of Life Task anonymized (List Permuted Names):

Summary: The patient is a 73 year old
hispanic or latino man.

May 30, 2014: saw a doctor for

dermatology
Conditions:
- onychomycosis due to dermatophyte
- chronic kidney disease

April 21, 2017: visited the hospital

for 12 days
Conditions:
- onychomycosis due to dermatophyte
[...]
TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Supplementary Materials References

Avati, A., Jung, K., Harman, S., Downing, L., Ng, A., and Shah, N. H. (2018). Improving palliative care with deep learning.
BMC medical informatics and decision making, 18(4):55–64.
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2022). Deep Neural Networks and Tabular
Data: A Survey. Technical Report arXiv:2110.01889, arXiv.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell,
A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C.,
Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,
Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell,
R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
Curran Associates, Inc.
Curtis, J. R., Treece, P. D., Nielsen, E. L., Gold, J., Ciechanowski, P. S., Shannon, S. E., Khandelwal, N., Young, J. P.,
and Engelberg, R. A. (2016). Randomized trial of communication facilitators to reduce family distress and intensity of
end-of-life care. American journal of respiratory and critical care medicine, 193(2):154–162.
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.-J., Sandhu, S., Guppy, K. H., Lee, S., and Froelicher,
V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The
American journal of cardiology, 64(5):304–310.
Dua, D. and Graff, C. (2017). UCI machine learning repository.
Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical
tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Hasan, M. K., Alam, M. A., Das, D., Hossain, E., and Hasan, M. (2020). Diabetes Prediction Using Ensembling of
Different Machine Learning Classifiers. IEEE Access, 8:76516–76531. Conference Name: IEEE Access.
Hripcsak, G., Duke, J. D., Shah, N. H., Reich, C. G., Huser, V., Schuemie, M. J., Suchard, M. A., Park, R. W., Wong, I.
C. K., Rijnbeek, P. R., van der Lei, J., Pratt, N., Nor&#233, N, G. N., Li, Y.-C., Stang, P. E., Madigan, D., and Ryan, P. B.
(2015). Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers.
MEDINFO 2015: eHealth-enabled Health, pages 574–578. Publisher: IOS Press.
Kadra, A., Lindauer, M., Hutter, F., and Grabocka, J. (2021). Well-tuned simple nets excel on tabular datasets. Advances
in neural information processing systems, 34:23928–23941.
Kohavi, R. et al. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96,
pages 202–207.
Moro, S., Cortez, P., and Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision
Support Systems, 62:22–31.
Muhammad, Y., Tahir, M., Hayat, M., and Chong, K. T. (2020). Early and accurate detection and diagnosis of heart disease
using intelligent computational model. Scientific Reports, 10(1):19747.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray,
A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and
Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs]. arXiv:
2203.02155.
Pace, R. K. and Barry, R. (1997). Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R.,
Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-
learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., Bari,
M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J.,
Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J.,
Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S., Gao, L., Wolf, T., and Rush, A. M.
(2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning
Representations.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag

Smith, J. W., Everhart, J., Dickson, W., Knowler, W., and Johannes, R. (1988). Using the ADAP Learning Algorithm to
Forecast the Onset of Diabetes Mellitus. Proceedings of the Annual Symposium on Computer Application in Medical
Care, pages 261–265.
van Rijn, J. N. and Vis, J. K. (2014). Endgame analysis of dou shou qi. ICGA Journal, 37(2):120–124.
Yeh, I.-C., Yang, K.-J., and Ting, T.-M. (2009). Knowledge discovery on rfm model using bernoulli sequence. Expert
Systems with Applications, 36(3):5866–5871.

Test Bank For Calculus 11th Edition Ron Larson Bruce H Edwards
No ratings yet
Test Bank For Calculus 11th Edition Ron Larson Bruce H Edwards
14 pages
Practical Digital Signal Processing by Dogan Ibrahim
50% (10)
Practical Digital Signal Processing by Dogan Ibrahim
22 pages
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
No ratings yet
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
47 pages
Tacl A 00544
No ratings yet
Tacl A 00544
23 pages
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
No ratings yet
T M: A T D L P - E E: AB Dvancing Abular EEP Earning With Arameter Fficient Nsembling
39 pages
Trompt Towards A Better Deep Neural Network For Tabular Data
No ratings yet
Trompt Towards A Better Deep Neural Network For Tabular Data
43 pages
Accurate Predictions On Small Data With A Tabular Foundation Model
No ratings yet
Accurate Predictions On Small Data With A Tabular Foundation Model
23 pages
1 s2.0 S0957417424027192 Main
No ratings yet
1 s2.0 S0957417424027192 Main
13 pages
Incorporating LLM Priors Into Tabular Learners: Table Representation Learning Workshop at Neurips 2023
No ratings yet
Incorporating LLM Priors Into Tabular Learners: Table Representation Learning Workshop at Neurips 2023
10 pages
TabTransformer - Tabular Data Modeling Using Contextual Embeddings
No ratings yet
TabTransformer - Tabular Data Modeling Using Contextual Embeddings
17 pages
Resnet
No ratings yet
Resnet
25 pages
Revisiting Deep Learning Models For Tabular Data
No ratings yet
Revisiting Deep Learning Models For Tabular Data
12 pages
2626 Pre Training Language Mod
No ratings yet
2626 Pre Training Language Mod
10 pages
DL Tabular
No ratings yet
DL Tabular
43 pages
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
No ratings yet
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
55 pages
Trend
No ratings yet
Trend
47 pages
Mambular: A Sequential Model For Tabular Deep Learning
No ratings yet
Mambular: A Sequential Model For Tabular Deep Learning
21 pages
Why Tree Based Method
No ratings yet
Why Tree Based Method
14 pages
Wavelets Meet Large Language Models
No ratings yet
Wavelets Meet Large Language Models
16 pages
Deep Neural Networks and Tabular Data A Survey
No ratings yet
Deep Neural Networks and Tabular Data A Survey
21 pages
Publi-6721 2
No ratings yet
Publi-6721 2
17 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Ibm Tabformer
No ratings yet
Ibm Tabformer
5 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
A Data-Centric Perspective On Evaluating Machine Learning Models For Tabular Data
No ratings yet
A Data-Centric Perspective On Evaluating Machine Learning Models For Tabular Data
35 pages
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
No ratings yet
ExcelFormer A Neural Network Surpassing GBDTs On Tabular Data
13 pages
Tabular Data - Deep Learning Is Not All You Need
No ratings yet
Tabular Data - Deep Learning Is Not All You Need
13 pages
EPIC: Effective Prompting For Imbalanced-Class Data Synthesis in Tabular Data Classification Via Large Language Models
No ratings yet
EPIC: Effective Prompting For Imbalanced-Class Data Synthesis in Tabular Data Classification Via Large Language Models
39 pages
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
No ratings yet
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
23 pages
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
No ratings yet
T PFN: A T T S S T C P S: AB Ransformer HAT Olves Mall Abular Lassification Roblems in A Econd
33 pages
L S LL MAF: Abel Upervised A Inetuning
No ratings yet
L S LL MAF: Abel Upervised A Inetuning
12 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
From Words To Numbers: Your Large Language Model Is Se-Cretly A Capable Regressor When Given In-Context Examples
No ratings yet
From Words To Numbers: Your Large Language Model Is Se-Cretly A Capable Regressor When Given In-Context Examples
50 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
22 pages
Platypus
No ratings yet
Platypus
17 pages
TableLlama Towards Open Large Generalist Models For Tables
No ratings yet
TableLlama Towards Open Large Generalist Models For Tables
21 pages
Efficient Large Language Models - A Survey
No ratings yet
Efficient Large Language Models - A Survey
67 pages
CAAFE
No ratings yet
CAAFE
23 pages
Mortizing Intractable Inference in Large Language Models: Edward J. Hu, Moksh Jain, Eric Elmoznino Younesse Kaddar
No ratings yet
Mortizing Intractable Inference in Large Language Models: Edward J. Hu, Moksh Jain, Eric Elmoznino Younesse Kaddar
31 pages
Edi6 Paper4
No ratings yet
Edi6 Paper4
6 pages
Ba LLMS W3 S2 2024 2025
No ratings yet
Ba LLMS W3 S2 2024 2025
64 pages
25 Tabular Representation Noisy o
No ratings yet
25 Tabular Representation Noisy o
14 pages
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
No ratings yet
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
14 pages
A Multi-Perspective Analysis of Memorization in Large Language Models
No ratings yet
A Multi-Perspective Analysis of Memorization in Large Language Models
18 pages
Manifold Learning For LLM Compression
No ratings yet
Manifold Learning For LLM Compression
4 pages
On Embeddings For Numerical Features in Tabular Deep Learning
No ratings yet
On Embeddings For Numerical Features in Tabular Deep Learning
21 pages
Large Language Models Are
No ratings yet
Large Language Models Are
14 pages
MLP Tabular
No ratings yet
MLP Tabular
19 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
GPT Self-Supervision For A Better Data Annotator: Preprint. Under Review
No ratings yet
GPT Self-Supervision For A Better Data Annotator: Preprint. Under Review
15 pages
2024 Findings-Eacl 141
No ratings yet
2024 Findings-Eacl 141
17 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Investigating Table-To-Text Generation Capabilities of Llms in Real-World Information Seeking Scenarios
No ratings yet
Investigating Table-To-Text Generation Capabilities of Llms in Real-World Information Seeking Scenarios
16 pages
Harmonic: Harnessing Llms For Tabular Data Synthesis and Privacy Protection
No ratings yet
Harmonic: Harnessing Llms For Tabular Data Synthesis and Privacy Protection
15 pages
Tablegpt
No ratings yet
Tablegpt
13 pages
Featurespace
No ratings yet
Featurespace
2 pages
Domain Specialization As The Key To Make Large Language Models Disruptive: A Comprehensive Survey
No ratings yet
Domain Specialization As The Key To Make Large Language Models Disruptive: A Comprehensive Survey
35 pages
Tabnet: Attentive Interpretable Tabular Learning: Sercan O. Arık Tomas Pfister
No ratings yet
Tabnet: Attentive Interpretable Tabular Learning: Sercan O. Arık Tomas Pfister
12 pages
Paper 1
No ratings yet
Paper 1
44 pages
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Experiment Lab Report - 3
No ratings yet
Experiment Lab Report - 3
5 pages
Bootstrapping BF
No ratings yet
Bootstrapping BF
6 pages
DPP 8 Random Variable and Binomial D
No ratings yet
DPP 8 Random Variable and Binomial D
5 pages
Towards Post-Quantum Blockchain A Review On Blockchain Cryptography Resistant To Quantum Computing Attacks
No ratings yet
Towards Post-Quantum Blockchain A Review On Blockchain Cryptography Resistant To Quantum Computing Attacks
26 pages
Shadow Removal Final Report
No ratings yet
Shadow Removal Final Report
6 pages
Varnika Resume Final
No ratings yet
Varnika Resume Final
2 pages
Hardware Documentation
No ratings yet
Hardware Documentation
62 pages
Presentation Ch20
No ratings yet
Presentation Ch20
45 pages
? Data Cleaning 101
No ratings yet
? Data Cleaning 101
17 pages
Gradient Boosting
No ratings yet
Gradient Boosting
32 pages
Worksheet Chapter 3
No ratings yet
Worksheet Chapter 3
3 pages
Example 23: Transport Fare Charges in Madrid-Annotated Student Work
No ratings yet
Example 23: Transport Fare Charges in Madrid-Annotated Student Work
17 pages
Data Analytics - Unit 4 (22IT513PE)
100% (1)
Data Analytics - Unit 4 (22IT513PE)
30 pages
DSP Mod4@AzDOCUMENTS - in
No ratings yet
DSP Mod4@AzDOCUMENTS - in
43 pages
DSP
0% (1)
DSP
94 pages
Evaluation Metrics For Regression Problems
No ratings yet
Evaluation Metrics For Regression Problems
9 pages
Measuring Errors
No ratings yet
Measuring Errors
42 pages
Question of Coding
No ratings yet
Question of Coding
7 pages
Rsa 2 PDF
No ratings yet
Rsa 2 PDF
5 pages
Why Is The Remaining Useful Life Prediction Uncertain
No ratings yet
Why Is The Remaining Useful Life Prediction Uncertain
13 pages
21 - ICML - Unbalanced Minibatch Optimal Transport - Applications To Domain Adaptation
No ratings yet
21 - ICML - Unbalanced Minibatch Optimal Transport - Applications To Domain Adaptation
12 pages
A#1
No ratings yet
A#1
7 pages
Detection of Cardiovascular Diseases in ECG Images Using Machine Learning and Deep Learning Methods
No ratings yet
Detection of Cardiovascular Diseases in ECG Images Using Machine Learning and Deep Learning Methods
4 pages
Baroda-Union PDF
No ratings yet
Baroda-Union PDF
6 pages
Introduction To Neural Networks: Deep Learning For NLP
No ratings yet
Introduction To Neural Networks: Deep Learning For NLP
57 pages
FASE I - Tema 2
No ratings yet
FASE I - Tema 2
52 pages
Overview 2
No ratings yet
Overview 2
2 pages
Game Theory (PPT-OPERATIONS RESEARCH)
No ratings yet
Game Theory (PPT-OPERATIONS RESEARCH)
37 pages

Few-Shot Classification of Tabular Data With Large Language Models

Uploaded by

Few-Shot Classification of Tabular Data With Large Language Models

Uploaded by

TabLLM: Few-shot Classification of Tabular Data with Large Language

Abstract in settings with a small number of training examples, i.e.

age education gain income Manual Template Table-To-Text LLM

Average AUC (SD) across tabular datasets

0 4 8 16 32 64 128 256 512 0 4 8 16 32 64 128 256 512

degrees in the opposite order. Table 16 in the Supplement 6 DISCUSSION

9 ACKNOWLEDGEMENTS G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep

1 ADDITIONAL DATASET DETAILS

1.1 Public Tabular Datasets

1.2 Large Healthcare Claims Dataset

1.2.1 More Details on the Serialization

Original name ICD MEDCIN CHV Simplify (GPT-3) Jargon (GPT-3)

1.2.2 Concept Selection

1.2.3 Alternative Concept Names

Table 8: Hyperparameters for LR model.

Table 9: Hyperparameters for LightGBM model.

Table 10: Hyperparameters for XGBoost model.

2 RUNTIME ESTIMATES FOR TABLLM

3 PARAMETER TUNING FOR BASELINES

4 COMPARING BASELINE RESULTS TO THE LITERATURE

5 ADJUSTING INCOME DATASET FOR INFLATION

6 FEATURE IMPORTANCE ANALYSIS OF TABLLM

7 EFFECT OF USING DIFFERENT PROMPTS

6 U.S. Bureau of Labor Statistics, CPI Inflation Calculator: https://www.bls.gov/data/inflation calculator.htm

8 TASK TEMPLATES Heart Dataset:

California Dataset: Jungle Dataset:

9 EXAMPLE SERIALIZATIONS Bank Dataset (Table-To-Text):

Blood Dataset (List Template):

Blood Dataset (Text Template): California Dataset (Text GPT-3):

Blood Dataset (Text GPT-3): Car Dataset (Text Template):

California Dataset (Text T0):

Credit-g Dataset (List Template): Credit-g Dataset (Table-To-Text):

Credit-g Dataset (Text Template): Credit-g Dataset (Text T0):

Diabetes Dataset (List Template): Heart Dataset (List Template):

Heart Dataset (Text GPT-3): Income Dataset (Text T0):

9.1 Large Healthcare Claims Dataset

End Of Life Task anonymized (List Template):

May 30, 2014: saw a doctor for

April 21, 2017: visited the hospital

End Of Life Task anonymized (Text Template):

On May 30, 2014 the patient saw a doctor

On April 21, 2017 the patient visited

End Of Life Task anonymized (List Permuted Names):

May 30, 2014: saw a doctor for

April 21, 2017: visited the hospital

Supplementary Materials References

You might also like