Few-Shot Classification of Tabular Data With Large Language Models
Few-Shot Classification of Tabular Data With Large Language Models
Models
Stefan Hegselmann1,2 Alejandro Buendia1 Hunter Lang1 Monica Agrawal1 Xiaoyi Jiang2 David Sontag1
1 MIT CSAIL 2 University of Münster
1. Tabular data with k labeled rows 2. Serialize feature names and values into natural-language string with different methods
4a. Fine-tune LLM using The age is 42. The education is 4b. Use LLM for prediction
The age is 29. The education is labeled examples Master. The gain is 594. on unlabeled examples
Doctorate. The gain is 1086.
Preditions Labels Does this person earn more than
Does this person earn more than
50000 dollars? Yes or no?
LLM >50K
>50K
>50K
Yes
>50K
>50K
>50K
>50K
50000 dollars? Yes or no?
Answer:
LLM No
Answer: Yes
Backprop
Figure 1: Overview of TabLLM. We first serialize the feature names and values into a natural language string. We
evaluate different strategies. This string is then combined with a task-specific prompt. To get predictions, we obtain
output probabilities from the LLM for each of a pre-specified set of verbalizer tokens (e.g., “Yes”, “No”), which map to
class labels (e.g., 1, −1). If 𝑘 > 0, we use the 𝑘 labeled examples to fine-tune the large language model using T-Few (Liu
et al., 2022). Finally, we use the (possibly tuned) large language model to obtain predictions on unlabeled examples.
Despite its simplicity, we find that TabLLM outperforms losses over augmentations (Bahri et al., 2022; Somepalli
prior deep-learning-based tabular classification methods on et al., 2021; Yoon et al., 2020; Arik and Pfister, 2021;
several benchmark datasets. By using information from Huang et al., 2020). Additional efforts have included dif-
the natural-language column names and feature values, it ferentiable trees, which combine advantages of tree ensem-
often enables effective zero-shot classification of tabular bles with gradient based optimization of neural networks
data. Unlike many deep learning methods on tabular data, (Kontschieder et al., 2015; Popov et al., 2020). How-
this approach is also competitive with gradient-boosted tree ever, several recent comprehensive reviews (Shwartz-Ziv
baselines and outperforms them or is on par until 256 shots. and Armon, 2022; Borisov et al., 2022a; Grinsztajn et al.,
In the very-few-shot setting it outperforms them by a con- 2022) found that gradient-boosted tree ensembles like XG-
siderable margin. The main contributions of this work are: Boost (Chen and Guestrin, 2016) and LightGBM (Ke et al.,
2017) systematically outperform these novel deep learning
• We introduce TabLLM, a novel framework leveraging architectures, even with proper fine-tuning and regulariza-
LLMs for data-efficient tabular classification tion (Kadra et al., 2021). Levin et al. (2022) found util-
ity in transfer learning in the semi-supervised setting, but
• We study nine serialization techniques and explore required a set of additional supervised tasks on the same
their performance across ten different datasets table, which can be a nontrivial limitation. They investi-
• We show that TabLLM instantiated with a simple text gate few-shot classification for medical diagnosis using 4 to
serialization and the T0 LLM can outperform state-of- 200 labeled examples, but do not exploit the power of large
the-art neural models and tree ensembles in the zero- pre-trained models, as we do in this work. Hollmann et al.
and few-shot setting (2022) recently introduced TabPFN, a Bayesian neural net-
work pre-trained on synthetic tabular data, outperforming
• We investigate the application of TabLLM to a large gradient boosted trees in a comprehensive evaluation.
real-world healthcare claims dataset and introduce se-
rialization methods that deal with many input features
2.2 Large Language Models for Tabular Data
2 RELATED WORK Another approach has been to leverage the natural language
capabilities of language models. Yin et al. (2020) use a
2.1 Machine Learning on Tabular Data language model for semantic parsing of natural language
queries over tabular data. Li et al. (2020) investigate the
Due to the success of deep learning in other domains, there ability of language models to perform entity matching on
have been many recent attempts at representation learning tabular data, i.e. determining if two rows refer to the same
for tabular data. Self-supervised objectives have largely object. Harari and Katz (2022) study data enrichment by
revolved around the prediction of masked cells, the iden- linking each table row with additional unstructured text
tification or correction of corrupted cells, and contrastive (e.g., from Wikipedia) from which they generated addi-
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
tional features using a language model. However, this setup representation. Typically, when prompting an LLM, there
requires named entities (e.g., celebrities, universities, etc.), is a template used to both serialize the inputs into one
which is quite limiting. Bertsimas et al. (2022) studied two natural-language string, and to provide the prompt itself
healthcare datasets and used a language model to gener- (e.g., the string “Does this person make more than 50,000
ate feature embeddings, which they fed into classifiers like dollars? Yes or no?”), which is usually located after the
gradient boosted trees. All these studies use a BERT-style serialized input. In this work, we break these pieces up
language model (Devlin et al., 2019). Narayan et al. (2022) into a serialization and a prompt. We define a function
recently assessed in-context learning with the autoregres- serialize(𝐹, x) that takes the column names 𝐹 and fea-
sive language model GPT-3 for tabular data cleaning tasks. ture values x for a row as inputs and creates a textual repre-
They found that it often outperforms state-of-the-art ap- sentation of the input. Combining this serialization with
proaches with ten labeled examples. Borisov et al. (2022b) a task-specific prompt 𝑝 will then form the LLM input
introduced an LLM-agnostic method to generate realistic (serialize(𝐹, x), 𝑝). This is illustrated in Figure 1. We
tabular data and found that it achieved better results than primarily study the serialization, since that is the biggest
existing approaches. In contrast, here we study classifica- difference compared to existing applications of prompting.
tion tasks of tabular data and investigate parameter-efficient Previous work has usually considered a simple concatena-
fine-tuning of LLMs. tion of feature names and values as a serialization of tabu-
lar data (Li et al., 2020; Narayan et al., 2022). In our work,
To use an LLM for tabular data, the table must be serial-
this function can be arbitrarily complex. For instance, we
ized into a natural text representation. All aforementioned
explore serializations that include (i) incorporating another
works relied on simple list or sentence serializations; Yin
LLM and (ii) employing feature selection as a substep.
et al. (2020) also included the column data type in the se-
rialized string. Only Bertsimas et al. (2022) studied differ-
ent serialization variants, but this was in a different context Large Language Models For Classification TabLLM
of deriving feature embeddings from BERT-style language can be used with different LLMs that generate text based
models. The LIFT method introduced by Dinh et al. (2022) on a natural-language input. Let LLM be an LLM with
comes closest to our work. The authors evaluated the ca- vocabulary 𝑉. Then, LLM((serialize(𝐹, x), 𝑝)) ∈ 𝑉 ∗
pabilities of fine-tuned GPT-3 and GPT-J models for re- is the prompted output of the LLM. In our few-shot set-
gression and classification on synthetic, tabular, and vision ting, {(serialize(𝐹, x), 𝑝) | (x, 𝑦) ∈ 𝐷 𝑘 } can be used
data. They also studied the sample efficiency and consid- as training examples for fine-tuning the LLM. The LLM
ered different static serialization templates assessing the ef- generates text in the vocabulary space 𝑉 ∗ that has to be
fect of including column names in the input. In this work, mapped to a valid class in 𝐶. Several approaches already
we focus on the publicly available T0 model and perform a exist for this problem. For example, the verbalizer (Schick
broader analysis of nine serialization techniques including and Schütze, 2021) defines a mapping between LLM out-
automatic approaches and ablations evaluating the impor- put tokens and the discrete label space. Verbalizers can
tance of feature values. Particularly, we are interested in be manually specified or automatically learned; see Cui
leveraging prior knowledge encoded in LLMs and we do a et al. (2022) for an overview of different verbalizer-learning
more fine-grained analysis of the sample efficiency includ- approaches. In this work, we assume for simplicity that
ing zero-shot experiments on ten different datasets. the verbalizer mapping is manually specified (see answer
choices in the templates in Sec. 8 in the Supplement).
3 METHODS
3.2 Our Instantiation of TabLLM
3.1 TabLLM for Tabular Data Classification
Serialization Approaches for TabLLM. The perfor-
Problem Formalization. Suppose we have a tabular mance of LLMs is very sensitive to the precise details of
dataset with 𝑛 rows and 𝑑 columns or features. We can the natural-language input (Zhao et al., 2021; Webson and
formalize this as 𝐷 = {(x𝑖 , 𝑦 𝑖 )}𝑖=1
𝑛 , where each x is a 𝑑-
𝑖 Pavlick, 2022). In this work, we focus on the serialization
dimensional feature vector. Since we consider classifica- of the tabular data. For the prompt, we use a simple de-
tion, 𝑦 𝑖 ∈ 𝐶 for a set of classes 𝐶. We define the column scription of the classification task and perform no further
names or feature names as 𝐹 = { 𝑓1 , ..., 𝑓 𝑑 }. We assume the prompt engineering. We study nine different serialization
𝑓𝑖 ’s are natural-language strings such as “age” or “educa- formats varying in complexity. All serialization methods
tion” (see Figure 1). For our 𝑘-shot classification experi- require minimal human effort to apply to new classification
ments, we only use a subset 𝐷 𝑘 of size 𝑘—sampled from tasks. We evaluate several methods that generate natural
𝐷 with replacement—for fine-tuning or training. text to create inputs that are closer to the training distribu-
tion of the LLM, thereby improving zero and very-few-shot
Serialization of Tabular Data. To use an LLM for tab- performance. Additional details and examples for the seri-
ular data, the table must be transformed into a natural text alizations are given in Sec. 1.2.1 and 9 in the Supplement.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
• List Template: A list of column names and feature (Sanh et al., 2022). This model has a token limit of 1024,
values. We fixed an arbitrary ordering of the columns. which roughly corresponds to 400 words. We also evaluate
• Text Template: An textual enumeration of all features the effect of a smaller version of the T0 model (T0 3B). We
as “The column name is value.” (see Figure 1). fine-tuned on the few-shot data D 𝑘 using the recent T-Few
recipe, which outperforms other parameter-efficient tuning
• Table-To-Text: We use an LLM fine-tuned on a methods such as soft prompt tuning (Liu et al., 2022). In
table-to-text generation task from HuggingFace addition, we perform zero-shot experiments with the LLM
(Narrativaai/bloom-560m-finetuned-totto GPT-3 (engine text-davinci-002) (Ouyang et al., 2022).
-table-to-text). To ensure that the serialization
includes all data we hand each column-value tuple to
the model separately and concatenate the outputs. 4 EXPERIMENTAL SETUP
• Text T0: We use the LLM T0 with 11B parameters
(bigscience/T0pp) (Sanh et al., 2022). We split up 4.1 Datasets
a row into pairs of two column-value tuples. We send
them to LLM separately with the prompt “Write this We studied TabLLM in two experimental settings. First,
information as a sentence:” and combine the outputs. we considered nine medium-sized tabular datasets for bi-
nary and multi-class classification. We systematically iden-
• Text GPT-3: We use GPT-3 (engine text-davinci- tified datasets from Kadra et al. (2021), Grinsztajn et al.
002) accessible through an API (Ouyang et al., 2022). (2022), and Borisov et al. (2022a). We included datasets
GPT-3 was able to serialize all features at once, so we with at most 50,000 rows to keep the fine-tuning costs man-
use a list of all features with the prompt “Rewrite all ageable and at most 30 columns to stay within T0’s token
list items in the input as a natural text.” as input. We limit. We also required textual feature names to make the
guide the output with “The {person, car, patient} is”. serializations more meaningful and we excluded datasets
with derived feature values (e.g., mean pixel values). This
We consider the following serializations as ablations: lead to inclusion of Bank (45,211 rows, 16 feats), Blood
(748, 4), California (20,640, 8), Car (1,728, 8), Credit-
• List Only Values: List Template for feature values g (1,000, 20), Income (48,842, 14), and Jungle (44,819,
only. We want to evaluate whether column names aid 6). We added two additional datasets from Kaggle that ful-
the classification performance. filled our inclusion criteria: Diabetes (768, 8) and Heart
• List Permuted Names: List Template with permuted (918, 11). Second, we evaluated TabLLM for risk stratifi-
column names. Hence, the wrong column name is as- cation on three binary classification tasks, following prior
sociated with each feature value. The permutation is work by Kodialam et al. (2021) and similarly using a de-
the same across all examples. We perform this abla- identified health claims dataset from a U.S. health insurer.
tion to study the relevance of the correct association We predicted the end-of-life (EoL) of all patients older than
between column names and feature values. 70 years, which can be used to inform care in a palliative
setting (Avati et al., 2018). We also considered the need for
• List Permuted Values: List Template with consis- any surgical procedure (Surgery) and the likelihood of hos-
tently permuted values across all examples. We gen- pitalization (LoH), which can help with determining health
erate one permutation for each column and apply this care needs and estimating future costs. Additional details
mapping to all column values. For continuous values, on all datasets can be found in Sec. 1 in the Supplement.
we use ten uniform bins. This tests whether the LLM We release the code for our experiments on Github.1
uses the fine-grained information encoded by the fea-
ture values for zero-shot and few-shot classification.
4.2 LLM and Fine-tuning
• List Short: List Template with at most ten features.
We only consider this for the healthcare dataset where We used the HuggingFace implementation of the T0 model
the number of features exceeds the input limit of the (bigscience/{T0pp,T0 3B}). Prompts for the LLM
LLM. We want to study the effect of less information. were designed following Sanh et al. (2022) using the
PromptSource framework (Bach et al., 2022). Each class
Large Language Models for TabLLM Another crucial in our classification tasks was manually encoded in a tex-
component of TabLLM is the LLM. TabLLM is both ag- tual response, e.g., “Yes” and “No” for true and false (Sanh
nostic to the LLM and the specific fine-tuning method that et al., 2022). The prediction probability for each class cor-
is used. We only consider a single LLM for most of our ex- responds to the probability of the LLM generating its token
periments. We employ the T0 encoder-decoder model with sequence normalized across all classes. All templates used
11 billion parameters as the LLM for TabLLM (Sanh et al., in this work are given in Sec. 8 in the Supplement.
2022). It was trained on a large variety of task-specific
prompts, making it a suitable candidate for our experiments 1 https://github.com/clinicalml/TabLLM
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
For fine-tuning, we adopted the default hyperparameters of encoded vector. For each medical concept, there were three
the T-Few method without any additional parameter tun- indicator variables of whether that concept occurred within
ing (Liu et al., 2022). The authors used a setup of 𝑘 = 32 30 days, 1 year, and anytime before prediction time.
shots and 1,000 training steps for most of their experiments,
which corresponds to 31.25 epochs. Hence, we fixed 30 4.4 Serializations
training epochs for all few-shot experiments on the public
tabular datasets. We used 20% of the data as a test set. For For the public datasets, some column names and feature
the large healthcare claims dataset, we used 10 epochs for values were manually mapped to human-readable forms,
up to 256 shots and 3 epochs for 1,024, 4,096 and 16,384 to based on the provided documentation. For instance, for
reduce the runtime and prevent overfitting for many train- the Income dataset, the feature name hours per week was
ing examples. We used a test set of 10,000 examples for the mapped to work hours per week and the feature value pri-
three healthcare tasks. All experiments were evaluated with vate for working class was mapped to private sector em-
the area under the receiver operating characteristic curve ployee. Numerical values were not changed.
(AUC). We used macro-AUC one-versus-rest for the mul-
ticlass setting. Estimates for the runtime are given in Sec. Serialization was more complex for the healthcare claims
2 in the Supplement. data. Each patient record is a time series of visits, with
each visit consisting of a list of medical conditions and
procedures. We only considered the manual serializations
4.3 Baseline Models List Template and Text Template. We tried to mimic the
style of a medical professional to tap potential prior knowl-
We compared TabLLM to several baselines. For the sim- edge of the LLM. To this end, the serialization starts with
plest baseline, we used a logistic regression (LR) model. an intro sentence containing the patient’s gender, age, and
Since previous work showed the superiority of gradient race. It then describes each visit, stating its date, the type
boosted tree ensembles (Borisov et al., 2022a), we included of doctor the patient saw (e.g., dermatology) if an outpa-
the most common models XGBoost (Chen and Guestrin, tient visit or length of hospitalization if an inpatient visit,
2016) and LightGBM (Ke et al., 2017). We also evaluated the primary complaint of the associated visit, and proce-
several state-of-the-art deep learning baselines. TabNet is dures performed. Since there are no feature values in this
a widely used neural model for tabular data that uses at- dataset, we omit List Only Values and List Permuted Values.
tention over columns (Arik and Pfister, 2021). SAINT is We also performed experiments for concept selection and
a more recent approach that uses attention over rows and different names for the medical concepts. Details for these
columns (Somepalli et al., 2021). SAINT performed best additional experiments and examples of the serializations
in a comprehensive review on tabular data (Borisov et al., are given in Sec. 1.2.2, 1.2.3, and 9 in the Supplement.
2022a). NODE is a differentiable tree ensemble method
that performed best in the evaluation of Shwartz-Ziv and
Armon (2022). Lastly, we include TabPFN, a Bayesian 5 RESULTS
neural network that was pre-trained on synthetic tabular
data (Hollmann et al., 2022). In contrast to TabLLM, we 5.1 Effects of serialization
performed hyperparameter tuning for all baselines except
TabPFN (see Sec. 3 in the Supplement), which requires no Figure 2 shows the performance of different serializa-
tuning by design. We adopted the parameter ranges from tion methods for TabLLM averaged over the nine public
previous reviews (Borisov et al., 2022a; Grinsztajn et al., datasets. The Text Template serialization performed very
2022). Since no validation set exists in the few-shot setting, well across all experiments. In the zero-shot setting, the
we used 4-fold cross validation on the 𝑘-shots. In particu- Text Template showed improvements over List Template,
lar, we did not use a large validation set for hyperparameter indicating the benefit of a serialization that is closer to the
tuning, unlike some few-shot learning works as highlighted training distribution of T0. However, these differences al-
by Perez et al. (2021). We encoded categorical values as ready vanished for 8 training examples. Hence, very few
one-hot vectors. We also tested ordinal encoding for LR, training examples might already suffice to adjust for dif-
XGBoost, LightGBM, and TabPFN, but it showed worse ferent templates. This suggests that sophisticated serializa-
results (see Table 12, 13, and 14 in the Supplement). In ad- tions might be unnecessary when some training data exists.
dition, we give results for GPT-3 (text-davinci-002)
Using LLMs for serialization showed mixed results. The
without fine-tuning, i.e. in the zero-shot setting using the
ordering is according to the complexity of the LLM used
Text Template serialization.
for serialization. GPT-3 has 175B, T0 11B, and the
For the three health claims tasks, we used the same experi- BLOOM table-to-text model 0.56B parameters. Different
mental setup for the baselines. However, we only included reasons might be responsible for the worse performance
LR and LightGBM due to runtime limitations. Following overall. The models tended to hallucinate information for
Kodialam et al. (2021), each patient’s input was a one-hot some examples, leading to biased predictions of TabLLM.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
0.85 0.85
Average AUC (SD) across tabular datasets
0.75 0.75
0.70 0.70
Log. Reg.
List Template LightGBM
0.65 Text Template 0.65 XGBoost
Table-To-Text SAINT
0.60 Text T0 0.60 TabNet
Text GPT-3 NODE
0.55 List Only Values 0.55 TabPFN
List Perm. Names GPT-3
0.50 List Perm. Values 0.50 TabLLM
Figure 2: Average AUC and SD of different serializations Figure 3: Average AUC and SD of TabLLM versus all
across nine public datasets. Text Template performs best baseline models across nine public datasets. TabLLM
for zero and few training examples. For many examples, outperforms all baselines for zero and very few training
the performance of different serializations converges. examples. TabPFN is the strongest baseline.
For instance, GPT-3 added “this car is a good choice” or alization and select the most frequent conditions. Results
added entirely new data to some examples (see Sec. 9 in for all (dataset, serialization) combinations (Table 12, 13,
the Supplement). Also, the LLMs are not completely faith- and 14) and the additional experiments on the healthcare
ful at including all features, even though we tried to enforce dataset (Table 5 and 7) can be found in the Supplement.
it in our experiments. This could explain that none of the
LLM serializations reaches the same performance as the
5.2 Public Tabular Datasets
template serializations, even for many training examples.
Using only feature values had a poor performance for zero Figure 3 shows the averaged results for TabLLM using the
and very few shots, but the performance equalized with best serialization (Text Template) versus all baseline mod-
more training examples. The same applies to the list se- els. Table 1 contains the detailed results for TabLLM,
rialization with permuted feature names. This indicates TabPFN, and XGBoost. TabLLM showed a similar behav-
that if enough training examples are available, the serial- ior across datasets. It achieved nontrivial zero-shot perfor-
ization approach does not matter, but that TabLLM relies mance for all tasks except on Credit-g and Heart. For
on information from the feature names in the zero-shot and Heart this might be due to the dataset’s inclusion crite-
few-shot regime, and also relies on the association of the ria requiring eligibility for a heart procedure biasing the
names with the correct values. The discrepancy for zero prediction. In all cases, TabLLM’s performance improved
and very few shots was even stronger for List Permuted Val- with a higher number of shots. In the zero-shot setting,
ues, which suggests that TabLLM relies more on the correct TabLLM was on par with GPT-3 even though GPT-3 is
values than feature names. Again, the performance equal- a much larger model than T0 (175B vs. 11B parame-
ized for more examples showing the ability of TabLLM to ters). TabPFN consistently outperformed the other baseline
learn new associations if enough training data is available. models across all numbers of training examples. TabPFN
Using the smaller T0 3B model showed a slightly decreased reached TabLLM’s performance with 4 to 256 (Income)
performance (see Table 12, 13, and 14 in the Supplement). training examples. LR was the second-best baseline of-
ten beating the tree models, which might be due to our ex-
For the healthcare claims dataset, we found that the List
tensive parameter tuning (see Sec. 4 in the Supplement).
Template slightly outperformed the Text Template serial-
TabLLM outperformed or was on par with the tree ensem-
ization (see Table 15 in the Supplement). This was con-
ble baselines until 256 training examples for all datasets
sistent across tasks. The List Short serialization only per-
except Calhousing and Jungle. For fewer shots, it often
formed slightly worse. The evaluation of different concept
outperformed them by a large margin. XGBoost performed
selection strategies showed that choosing the most frequent
relatively poorly for few shots, which was probably due to
conditions per patient performed best. We found no consid-
overfitting on the small training and validation sets (as de-
erable performance difference for different concept names.
scribed in the previous section, we do not use large valida-
From here onwards, we show results for TabLLM using the tion sets for hyperparameter tuning to ensure the results are
Text Template serialization for the public datasets. For the truly few-shot). TabLLM outperformed the neural base-
healthcare claims dataset, we use the List Template seri- lines SAINT, NODE, and TabNet in many settings. It also
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
Table 1: Test AUC performance of TabLLM, the best tree ensemble model (XGBoost), and the best baseline (TabPFN) on
the public tabular datasets. Each column reports the performance for 𝑘 training examples. TabLLM (T0 + Text Template)
outperforms XGBoost and TabPFN in the very-few-shot regime. Standard deviations are given across five random seeds.
Number of Shots
Dataset Method 0 4 8 16 32 64 128 256 512 all
XGBoost — 0.50.00 0.56.09 0.68.04 0.76.03 0.83.02 0.85.03 0.88.01 0.90.01 0.94.00
Bank TabPFN — 0.59.14 0.66.08 0.69.02 0.76.03 0.82.03 0.86.02 0.89.00 0.90.00 0.91.00
TabLLM 0.63.01 0.59.10 0.64.05 0.65.05 0.64.06 0.69.03 0.82.05 0.87.01 0.88.01 0.92 †
XGBoost — 0.50.00 0.58.07 0.66.04 0.67.06 0.68.05 0.71.06 0.70.07 0.67.06 0.71.04
Blood TabPFN — 0.52.08 0.64.04 0.67.01 0.70.04 0.73.04 0.75.04 0.76.04 0.76.03 0.74.03
TabLLM 0.61.04 0.58.09 0.66.03 0.66.07 0.68.04 0.68.04 0.68.06 0.70.08 0.68.04 0.70.04
XGBoost — 0.50.00 0.62.10 0.74.03 0.79.04 0.82.04 0.87.01 0.90.01 0.92.01 0.97.00
Calhousing TabPFN — 0.63.13 0.63.11 0.80.03 0.85.03 0.89.01 0.91.01 0.92.00 0.93.00 0.94.00
TabLLM 0.61.01 0.63.05 0.60.07 0.70.08 0.77.08 0.77.04 0.81.02 0.83.01 0.86.02 0.95.00
XGBoost — 0.50.00 0.59.04 0.70.08 0.82.03 0.91.02 0.95.01 0.98.01 0.99.01 1.00.00
Car TabPFN — 0.64.06 0.75.05 0.87.04 0.92.02 0.97.00 0.99.01 1.00.00 1.00.00 1.00.00
TabLLM 0.82.02 0.83.03 0.85.03 0.86.03 0.91.02 0.96.02 0.98.01 0.99.00 1.00.00 1.00.00
XGBoost — 0.50.00 0.51.07 0.59.05 0.66.03 0.67.06 0.68.02 0.73.02 0.75.03 0.78.04
Credit-g TabPFN — 0.58.08 0.59.03 0.64.06 0.69.07 0.70.07 0.72.06 0.75.04 0.75.02 0.75.03
TabLLM 0.53.05 0.69.04 0.66.04 0.66.05 0.72.06 0.70.07 0.71.07 0.72.03 0.72.02 0.70.02
XGBoost — 0.50.00 0.59.16 0.72.07 0.69.08 0.73.05 0.78.05 0.80.03 0.80.01 0.84.03
Diabetes TabPFN — 0.61.13 0.67.11 0.71.07 0.77.03 0.82.03 0.83.03 0.83.03 0.81.02 0.81.03
TabLLM 0.68.06 0.61.09 0.63.08 0.69.07 0.68.04 0.73.03 0.79.04 0.78.02 0.78.04 0.80.04
XGBoost — 0.50.00 0.55.14 0.84.07 0.88.04 0.91.01 0.91.01 0.90.01 0.92.01 0.94.01
Heart TabPFN — 0.84.06 0.88.05 0.87.06 0.91.02 0.92.02 0.92.02 0.92.01 0.92.02 0.92.02
TabLLM 0.54.04 0.76.14 0.83.05 0.87.04 0.87.06 0.91.01 0.90.01 0.92.01 0.92.01 0.94.01
XGBoost — 0.50.00 0.59.06 0.77.02 0.79.03 0.82.02 0.84.01 0.87.01 0.88.00 0.93.00
Income TabPFN — 0.73.08 0.71.09 0.76.09 0.80.04 0.82.04 0.84.01 0.86.01 0.87.01 0.89.00
TabLLM 0.84.00 0.84.01 0.84.02 0.84.04 0.84.01 0.84.02 0.86.01 0.87.00 0.89.01 0.92.00
XGBoost — 0.50.00 0.58.07 0.72.05 0.78.03 0.81.02 0.84.02 0.87.01 0.91.01 0.98.00
Jungle TabPFN — 0.65.08 0.72.04 0.71.07 0.78.02 0.81.01 0.84.01 0.88.01 0.91.00 0.93.00
TabLLM 0.60.00 0.64.01 0.64.02 0.65.03 0.71.02 0.78.02 0.81.02 0.84.01 0.89.01 1.00 †
† These experiments were only performed for a single run due to runtime limitations of TabLLM on the full dataset.
Table 2: Five highest and lowest weighted features for Introspecting TabLLM—What Prior Knowledge Does
zero-shot TabLLM and logistic regression (LR) trained on it Use? Given the strong zero-shot performance of
all data for Income. Both models show very similar trends TabLLM on the Income dataset, we next sought to under-
for important features. stand which features it based its predictions on in order to
shed light on the prior knowledge used by the LLM. To de-
Feature TabLLM LR termine the feature importance for TabLLM, we fit a LR
rank weight rank weight
model to the zero-shot prediction using the original fea-
capital gain 1 5.310 2 2.393 tures as covariates as described in Sec. 6 in the Supple-
education Masters 2 4.623 6 1.455
education Doctorate 3 3.410 4 2.066
ment. Highly weighted features (see Table 2) for zero-shot
education Bachelors 4 2.995 7 1.135 TabLLM include the individual’s occupation (with e.g.,
education Prof-school 5 2.949 5 1.900 ‘Farming-fishing’ having a large negative weight), high-
occupation Priv-house-serv 102 -2.840 105 -1.909 est education level (‘Masters’ and ‘Doctorate’ have posi-
education 12th 103 -3.178 79 -0.480 tive weights; ‘Preschool’ grade has a negative weight), and
education Preschool 104 -3.520 106 -2.385 workclass (‘Without-pay’ has a negative weight). TabLLM
occupation Farming-fishing 105 -3.853 98 -0.982 also seems to be able to correctly interpret the numerically
workclass Without-pay 106 -4.423 69 -0.174
encoded capital gain value. For comparison, we also show
the feature weights for a LR model trained on all data. We
see a strong concordance between both models; TabLLM’s
was on par or very close to the best baseline models on the top five features are all among the top seven of the LR
full datasets, indicating that there is little performance lost model. However, TabLLM scores the highest education
due to the serialization and the choice of model family.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Table 3: Test AUC on the healthcare claims dataset. TabLLM outperforms logistic regression (LR) for up to 64 and
LightGBM for up 256 training examples on End of Life (EoL). Standard deviations are given across five random seeds.
Number of Shots
Dataset Method 0 16 64 256 1,024 4,096 16,384 all
LR — 0.65.07 0.77.02 0.80.02 0.83.01 0.83.01 0.84.01 0.84.01
EoL
LightGBM — 0.50.00 0.71.01 0.76.02 0.80.01 0.82.01 0.83.01 0.82 †
TabLLM 0.70 0.74 0.78 0.78 0.79 0.81 0.81 —
LR — 0.72.04 0.75.05 0.77.01 0.79.01 0.80.01 0.80.00 0.81.00
Surgery
LightGBM — 0.50.00 0.73.02 0.77.01 0.79.01 0.80.00 0.81.01 0.82 †
TabLLM 0.67 0.73 0.72 0.73 0.75 0.78 0.79 —
LR — 0.72.04 0.76.03 0.80.01 0.82.01 0.83.01 0.83.01 0.84.01
LoH
LightGBM — 0.50.00 0.72.02 0.76.03 0.81.01 0.83.00 0.83.01 0.85 †
TabLLM 0.71 0.73 0.73 0.76 0.78 0.81 0.82 —
† These experiments were only performed for a single run due to runtime limitations on the full dataset.
Table 4: Five highest and lowest weighted features for Introspecting TabLLM—What Prior Knowledge Does
zero-shot TabLLM for EoL and their relative risk (RR) it Use? We also performed a feature analysis to study the
with confidence intervals (CI). The top five features show strong zero-shot performance on EoL. However, we did not
a significant increase of the relative risk. compare to a LR model trained on all data due to the vast
amount of features and potential colinearites in the data.
Feature TabLLM RR (95% CI) Instead, we compared to the relative risk (RR) with a 95%
atrial fibrillation 0.633 2.72 (2.51-2.95) confidence interval (CI). Table 4 shows the five highest and
atherosclerosis of coronary art... 0.530 2.10 (1.94-2.27) lowest weighted features of zero-shot TabLLM and their
atherosclerosis of aorta 0.473 1.99 (1.81-2.19)
exudative age-related macular d... 0.452 2.38 (2.06-2.75)
relative risk for EoL. All top five features have a signifi-
sex male 0.442 1.23 (1.14-1.33) cantly increased relative risk demonstrating the capabilities
of TabLLM to identify relevant features even without any
open angle with borderline intr... -0.338 1.20 (1.03-1.40)
primary localized osteoarthrosi... -0.366 1.08 (0.82-1.43) training examples. For the five lowest weighted features,
localized, primary osteoarthritis -0.393 1.23 (1.07-1.40) only ‘sex female’ has a significantly decreased risk. A list
sex female -0.441 0.81 (0.75-0.88) of 100 features is given in Table 17 in the Supplement.
open-angle glaucoma - borderline -0.495 0.97 (0.85-1.10)
ever, all serializations with less information came close vate data. Except for the zero-shot and very few-shot
to the best serialization for 256 (tabular datasets) to 1024 regime, other baselines tend to outperform TabLLM on
training examples (insurance dataset). Hence, when hun- these datasets. This suggests that Blood, Diabetes, and
dreds of training examples are available, the input format Heart datasets could be good proxies for the community
proved less relevant, and the LLM was able to adapt (Jin to further study medical-domain tabular classification with
et al., 2022). Like our results, Bertsimas et al. (2022) found LLMs without needing access to large private datasets.
that natural language representation of healthcare data gave
little-to-no improvement (in their different setup) compared 7 LIMITATIONS AND CONCLUSION
to a more straightforward serialization in the medium-shot
setting. Our findings also support prior work showing that
TabLLM has a much larger computational footprint com-
irrelevant and even misleading inputs can lead to simi-
pared to traditional algorithms. It still requires fairly large
lar few-shot performance (Min et al., 2022; Webson and
GPUs to fine-tune the LLM, and inference with T0 requires
Pavlick, 2022; Reynolds and McDonell, 2021). For in-
far more FLOPs than inference with XGBoost or LR. Our
stance, permuting the column names only showed a dif-
results indicate that TabLLM trades off this computational
ference for up to 16 training examples (see Figure 2).
efficiency for improved sample efficiency. Further, as we
We found clear performance improvements for TabLLM saw with the three healthcare claims tasks, performance
when using additional training examples. It often outper- may suffer if the dense feature set for a given row cannot
formed strong baseline models in the very-few-shot setting. fit within the token limit for a given LLM. Since the gains
This emphasizes the value of leveraging LLMs when only from TabLLM stem from its ability to use existing domain
little labeled data is available. Surprisingly, Dinh et al. knowledge, the semantics of the column names and fea-
(2022) could not confirm these findings for GPT-3. On ture values need to have been observed during the LLM’s
two binary classification tasks a fine-tuned GPT-3 model original pre-training. For example, if the columns represent
performed worse than LR for up to 250 training examples. genes, we may not expect a vanilla LLM to have strong rep-
Our results indicate that the sample efficiency of TabLLM resentations for gene names. Finally, due to dataset shift,
is highly task-dependent. The performance on Blood, the pre-training data for a given LLM may not necessarily
Credit-g, Diabetes, and Heart is worse than the perfor- reflect the settings under which a given table was aggre-
mance on Income and Car. Most features of the latter gated, e.g., due to inflation and a changing value of money
datasets have semantically meaningful textual values likely (see Sec. 5 in the Supplement).
boosting TabLLM’s performance. However, TabLLM also
Despite these limitations, our empirical results show that
achieved reasonable results on numerical datasets (Blood,
TabLLM enjoys strong performance at tabular classifi-
California, Diabetes, and Jungle). In addition, Diabetes
cation, outperforming state-of-the-art baseline algorithms
and Heart have somewhat specialized feature names and
like XGBoost and SAINT by over 5 AUC points in the
values, such as “ventricular hypertrophy” and “Plasma glu-
very-few-shot regime, all while staying competitive with
cose concentration,” whereas Income and Car are more
these methods when a large number of samples is available.
general-domain knowledge. This indicates that T0, the lan-
guage model we used in TabLLM, seems to have less prior Currently, TabLLM does not use any unlabeled data; a
knowledge about medicine than about general-domain con- fruitful direction could involve leveraging unlabeled data,
cepts. Indeed, the training tasks for T0 do not contain any e.g., using the techniques from Lang et al. (2022) to com-
tasks with medical data (Sanh et al., 2022). bine the few-shot performance of TabLLM with the ulti-
mate performance of tree-based baselines by co-training
Our findings on the three insurance claims datasets partly
the models together. Other improvements could include
reinforce this hypothesis. Zero-shot performance depends
more faithful LLM serializations as well as numeric-
on the concept selection strategy and the LLM seems to
specific encoding methods (Gorishniy et al., 2022).
have little knowledge about medical procedures. Prior
work has shown that medical-domain-specific language
models, such as PubMedBERT, and general-domain mod- 8 SOCIETAL IMPACT
els with medical data in their training sets, such as GPT-
3, perform well at downstream prediction tasks on medical Similar to other ML systems that were trained on his-
data even with fairly few samples (Gu et al., 2021; Agrawal toric data, LLMs are prone to replicate existing biases and
et al., 2022). Substituting T0 with one of these models in stereotypes. Hence, when applying TabLLM for sensi-
TabLLM to study medical predictions tasks is an interest- tive tasks such as income or a health trajectory, predictions
ing direction for future work. should be considered with great care and further analyses
(e.g., for subgroups) are mandatory. In addition, LLMs re-
Our results on the public Blood, Diabetes, and Heart quire a lot of computing resources. This bears the risk of
datasets are very similar to our results for EoL, Surgery, creating an exclusive research environment. Also, the en-
and LoH, which are practically relevant but rely on pri- vironmental impact of LLMs can be significant.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Harari, A. and Katz, G. (2022). Few-shot tabular data en- Context Learning. arXiv:2205.05638 [cs]. arXiv:
richment using fine-tuned transformer architectures. In 2205.05638.
Proceedings of the 60th Annual Meeting of the Associa- Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z.,
tion for Computational Linguistics (Volume 1: Long Pa- and Tang, J. (2021). GPT Understands, Too. Technical
pers), pages 1577–1591. Report arXiv:2103.10385, arXiv.
Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F.
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,
(2022). Tabpfn: A transformer that solves small tabu-
Hajishirzi, H., and Zettlemoyer, L. (2022). Rethink-
lar classification problems in a second. arXiv preprint
ing the role of demonstrations: What makes in-context
arXiv:2207.01848.
learning work? arXiv preprint arXiv:2202.12837.
Horng, S. (2022). Machine Learning Core.
Narayan, A., Chami, I., Orr, L., and Ré, C. (2022). Can
Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Foundation Models Wrangle Your Data? Technical Re-
Z. (2020). TabTransformer: Tabular Data Model- port arXiv:2205.09911, arXiv.
ing Using Contextual Embeddings. Technical Report
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
arXiv:2012.06678, arXiv.
C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Jin, W., Cheng, Y., Shen, Y., Chen, W., and Ren, X. Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller,
(2022). A good prompt is worth millions of parameters? L., Simens, M., Askell, A., Welinder, P., Christiano,
low-resource prompt-based learning for vision-language P., Leike, J., and Lowe, R. (2022). Training language
models. In ACL 2022. models to follow instructions with human feedback.
Kadra, A., Lindauer, M., Hutter, F., and Grabocka, J. arXiv:2203.02155 [cs]. arXiv: 2203.02155.
(2021). Well-tuned simple nets excel on tabular datasets. Perez, E., Kiela, D., and Cho, K. (2021). True few-shot
Advances in neural information processing systems, learning with language models. Advances in Neural In-
34:23928–23941. formation Processing Systems, 34:11054–11070.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Popov, S., Morozov, S., and Babenko, A. (2020). Neural
Ye, Q., and Liu, T.-Y. (2017). LightGBM: A Highly Ef- oblivious decision ensembles for deep learning on tabu-
ficient Gradient Boosting Decision Tree. In Advances lar data. In International Conference on Learning Rep-
in Neural Information Processing Systems, volume 30. resentations.
Curran Associates, Inc.
Reynolds, L. and McDonell, K. (2021). Prompt program-
Kodialam, R., Boiarsky, R., Lim, J., Sai, A., Dixit, N., and
ming for large language models: Beyond the few-shot
Sontag, D. (2021). Deep contextual clinical prediction
paradigm. In Extended Abstracts of the 2021 CHI Con-
with reverse distillation. Proceedings of the AAAI Con-
ference on Human Factors in Computing Systems, pages
ference on Artificial Intelligence, 35(1):249–258.
1–7.
Kontschieder, P., Fiterau, M., Criminisi, A., and Bulo, S. R.
Sahakyan, M., Aung, Z., and Rahwan, T. (2021). Ex-
(2015). Deep Neural Decision Forests. In 2015 IEEE
plainable artificial intelligence for tabular data: A sur-
International Conference on Computer Vision (ICCV),
vey. IEEE Access, 9:135392–135422.
pages 1467–1475, Santiago, Chile. IEEE.
Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L.,
Lang, H., Agrawal, M. N., Kim, Y., and Sontag, D. (2022).
Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey,
Co-training improves prompt-based learning for large
M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S.,
language models. In Chaudhuri, K., Jegelka, S., Song,
Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta,
L., Szepesvari, C., Niu, G., and Sabato, S., editors, Pro-
D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M.,
ceedings of the 39th International Conference on Ma-
Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang,
chine Learning, volume 162 of Proceedings of Machine
T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry,
Learning Research, pages 11985–12003. PMLR.
T., Fries, J. A., Teehan, R., Scao, T. L., Biderman, S.,
Levin, R., Cherepanova, V., Schwarzschild, A., Bansal, A., Gao, L., Wolf, T., and Rush, A. M. (2022). Multi-
Bruss, C. B., Goldstein, T., Wilson, A. G., and Gold- task prompted training enables zero-shot task general-
blum, M. (2022). Transfer Learning with Deep Tabular ization. In International Conference on Learning Repre-
Models. Technical Report arXiv:2206.15306, arXiv. sentations.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020). Schick, T. and Schütze, H. (2021). Exploiting Cloze-
Deep entity matching with pre-trained language models. Questions for Few-Shot Text Classification and Natural
Proc. VLDB Endow., 14(1):50–60. Language Inference. In Proceedings of the 16th Con-
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., ference of the European Chapter of the Association for
Bansal, M., and Raffel, C. (2022). Few-Shot Parameter- Computational Linguistics: Main Volume, pages 255–
Efficient Fine-Tuning is Better and Cheaper than In- 269, Online. Association for Computational Linguistics.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Supplementary Materials:
TabLLM: Few-shot Classification of Tabular Data with Large Language
Models
We systematically identified datasets for classification from Kadra et al. (2021), Grinsztajn et al. (2022), Borisov et al.
(2022a), and from Kaggle. Each dataset was separated into 80/20 train-test splits. The 𝑘 labeled examples D 𝑘 were
sampled in a class-balanced manner from the training set. We performed experiments for different numbers of trainings
examples (shots) ranging from 0 to 512 and the entire dataset (all). To characterize the sensitivity of models to the choice of
𝑘 labeled examples, we repeated the dataset splitting and sampling procedures for five different seeds and report the mean
AUC and standard deviation (SD) across seeds. No hyperparameter tuning was conducted for TabLLM; for baselines,
internal cross validation was conducted to choose optimal hyperparameters, and the model was then retrained on all data.
We analyzed the following datasets:
• Bank (Kadra et al., 2021) contains information of a direct marketing campaign from a Portugese banking institution
(Moro et al., 2014). The goal is to predict whether a customer subscribed to a term deposit or not. It consists of 45,211
rows and 16 features; 5,289 labels are positive.
• Blood (Kadra et al., 2021) consists of data of a blood transfusion service from Taiwan (Yeh et al., 2009). It contains
4 attributes of 748 donors and the label is representing whether they returned for another donation (178 positive).
• California (Grinsztajn et al., 2022) contains eight attributes of 20,640 districts in California and the goal is to predict
the median house value in each district (Pace and Barry, 1997). Analogously to Grinsztajn et al. (2022), we created a
balanced classification task by predicting whether the house value is below or above the median (10,317 positive).
• Car (Kadra et al., 2021) has entries for different cars that are characterized by six attributes; the task is a multiclass
classification problem evaluating the state of each car. The dataset contains 1,728 rows, and the four classes have a
distribution of 1210, 384, 65, and 69 examples.
• Credit-g (Kadra et al., 2021) describes 1,000 people from Germany that want to receive a credit using 20 attributes.
The label is to predict whether they have good or bad risk; 700 are classified as good.
• Diabetes (from Kaggle2 ) was collected by the National Institute of Diabetes and Digestive and Kidney Diseases
(Smith et al., 1988) and contains 768 rows, each corresponding to women of Pima Indian heritage with eight clinical
variables. The task is binary classification of whether a person has diabetes; 268 cases are positive.
• Heart (from Kaggle3 ) contains data of four different hospitals (Detrano et al., 1989). Each row contains 11 clinical
variables of a patient. The task is binary classification of coronary artery disease. Of the 918 patients, 508 are positive.
• Income (Kadra et al., 2021; Borisov et al., 2022a) also called Adult contains rows for 48,842 individuals with twelve
attributes collected in the 1994 U.S. Census (Kohavi et al., 1996; Dua and Graff, 2017). The task is to predict whether
each person has an annual income over $50,000. The dataset has 11,687 positive labels.
• Jungle (Kadra et al., 2021) is a collection of 44,819 end game positions of Jungle Chess (van Rijn and Vis, 2014).
Each game is described with 6 attributes and the goal is to predict whether the white player will win (23,062 positive).
2 https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (06/28/2022)
3 https://www.kaggle.com/fedesoriano/heart-failure-prediction(06/28/2022)
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Table 5: Evaluation of different concept selection methods for the healthcare claims dataset in the zero-shot setting. The
last two rows show the performance when concepts where selected based on the lasso path of logistic regression weights,
which violates the zero-shot assumption (*).
Method EoL Surgery LoH
Age, sex, and race 0.59 0.57 0.65
Least frequent conditions 0.57 0.64 0.67
Least frequent procedures 0.59 0.59 0.65
Least frequent concepts (cond. + proc.) 0.55 0.55 0.66
Most frequent conditions 0.67 0.66 0.69
Most frequent procedures 0.59 0.58 0.65
Most frequent concepts (cond. + proc.) 0.62 0.61 0.65
Oldest conditions 0.65 0.66 0.69
Oldest procedures 0.59 0.58 0.65
Oldest concepts (cond. + proc.) 0.60 0.60 0.67
Most recent conditions 0.65 0.66 0.69
Most recent procedures 0.55 0.59 0.65
Most recent concepts (cond. + proc.) 0.59 0.60 0.66
Most relevant concepts based on 256 shots* 0.60 0.58 0.69
Most relevant concepts based on 4096 shots* 0.65 0.57 0.68
The de-identified health claims data set was provided by a large U.S. health insurer. The data is stored in the Observational
Medical Outcomes Partnership (OMOP) Common Data Model version 6.0 (Hripcsak et al., 2015). It contains an entry for
every encounter a patient has with the health system. Each entry is associated with a date, a visit type (5 total), a medical
specialty (216 total), present conditions (14,095 total), and performed procedures (21,184 total). We additionally used the
static concepts age, sex, and race at time of prediction.
We studied three different tasks on this dataset with distinct cohorts. For all tasks, we used a six month outcome period
and a gap of three months between time of prediction and the outcome window to prevent data leakage. We required
patients to have at least one medical visit and to have been actively enrolled in an insurance plan for at least 95% of the
last year and the six month outcome window. We used 10% of the data as a holdout set and sampled the 𝑘 balanced shots
with replacement from the remaining data. We chose larger shot sizes, as the tasks are more complex. We only ran the
experiments for a single seed due to runtime limitations. We considered the following tasks:
• End of Life (EoL): We predicted the mortality of all patients older than 70 years. This is often used as a surrogate
task. For instance, it can improve initiation of palliative care (Avati et al., 2018) and can help to inform close relatives
to reduce family distress (Curtis et al., 2016). The final cohort contained 94,972 individuals; 2,424 were positive.
• Surgical Procedure (Surgery): We predicted the need for any surgical procedure. The task is important in determin-
ing health care needs and estimating costs. The cohort included 620,382 people of which 243,349 were positive.
• Likelihood of Hospitalization (LoH): We also predicted the likelihood of being hospitalized. Again, this information
can help identify needs and estimate costs. The cohort included 612,656 individuals; 22,427 were positive.
Each serialization begins with the patient’s age, sex, and race. For each concept entry that we included, we also added
information of the associated visit. This included its date, the type of doctor the patient saw (e.g., dermatology), if an
outpatient visit or length of hospitalization if an inpatient visit, and the primary complaint of the associated visit. If a visit
was already added to the serialization, we just added the concept to the existing visit entry. For the List Template and
Text Template serializations approximately 40 medical concepts could be added until the token limit of T0 was reached.
To explore the effect of fewer information in the input, we also tested the List Short serializations were we added only
10 medical concepts to the serialization. Hence, not the entire token limit of the LLM was used. Examples of the List
Template, Text Template and List Permuted Names serializations illustrating this structure are given in Sec. 9.1 at the end
of the Supplement.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
Table 6: Five examples of different concept names for conditions. The first column shows the original name in the
healthcare claims dataset using SNOMED codes. A dash illustrates that no mapping was available.
Table 7: Evaluation of alternative condition concepts names. International Classification of Diseases (ICD), MEDCIN and
the Consumer Health Vocabulary (CHV) are alternative medical terminologies. We also tested shortening, simplifying,
and rewriting concepts as medical jargon via GPT-3. None of the alternative concept names showed consistent
performance improvement.
Method EoL Surgery LoH
Original concept names (SNOMED) 0.67 0.66 0.69
Map to ICD concept names 0.67 0.67 0.68
Map to MEDCIN concept names 0.67 0.66 0.69
Map to CHV concept names 0.66 0.66 0.69
Shorten longs concepts with GPT-3 0.67 0.66 0.69
Simplify concepts with GPT-3 0.67 0.66 0.70
Medical jargon with GPT-3 0.68 0.67 0.70
For the healthcare claims dataset, the number of recorded medical concepts per patients usually exceeded T0’s token limit.
Hence, we had to determine which concepts of a patient should be included during the serialization. We evaluated four
different concept selection strategies in the zero-shot setting for the List Template serialization. Choosing the least frequent,
most frequent, oldest, or most recent concepts per patient. We tested these for all concepts (conditions and procedures),
only conditions, or only procedures. For each patient, we ranked all concepts according to one of the above methods and
added concepts until the token limit of the LLM was reached. For least frequent and most frequent, we used the earliest
visits associated with the selected medical concepts. We used a simple serialization that only contained the patient’s age,
sex, and race as a baseline for our experiments. We also tested concept selection based on the lasso path of a logistic
regression model determined on 256 and 4,096 shots. This violates the few-shot assumption, but we considered it an
interesting comparison with the other strategies that select concepts per patient.
The results are given in Table 5. Using the most frequent conditions per patient consistently outperformed all other
selection strategies. Frequent conditions might be useful since they reveal the most relevant condition of a patient. Also,
they are usually more common allowing more prior knowledge of the LLM. Across all strategies conditions were usually
more useful than procedures. This suggests more prior knowledge of conditions. Interestingly, selecting the most frequent
conditions is even better than using the concept weights of a LR model trained on 256 or 4,096 shots.
The healthcare claims dataset used SNOMED concept names for conditions and SNOMED, Healthcare Common Proce-
dure Coding System (HCPCS), International Classification of Diseases (ICD), and Current Procedural Terminology (CPT)
concept names for procedures. We tested different concept names to assess their effect on the performance. We used a zero-
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
shot setting with the List Template serialization and the most frequent conditions per patient as the best selection strategy
determined as described above. Since the selection method only considered conditions, we only used different condition
names. We considered three alternative vocabularies in the Unified Medical Language System (UMLS) that covered at
least 20% of the condition concepts and offered different names. ICD is a very common medical terminology offering
alternative names for conditions. MEDCIN and the Consumer Health Vocabulary (CHV) offer concept names specifically
targeted at clinicians or consumers. We mapped the concept via their UMLS identifier. For ICD we were able to map
7,372, for MEDCIN 9,370 and for CHV 3,700 of the 14,095 condition concepts. Alternatively, we explored concept names
generated by GPT-3 (Brown et al., 2020). To do so, we used the publicly accessible GPT-3 API (engine text-davinci-
002) (Ouyang et al., 2022). We considered shortened names for concepts with more than sixty character (“Rewrite this
medical condition with at most six words.”), simplified concept names (“Write this medical condition in a short form in
lay language.”) and medical jargon (“Write this medical condition in medical jargon.”). For the simplified names and the
medical jargon, we provided GPT-3 with a single example for in-context learning. Examples for all alternative concept
names except the shortening are given in Table 6.
The results of this experiment are given in Table 7. We used the most frequent concept as a concept selection methods.
Based on the best concept selection, we performed additional experiments for alternative concept names. We found no
consistent performance difference even though there were considerable differences in the concept names (see Table 6).
Surprisingly, TabLLM performs better for EoL and Surgery using medical jargon to encode concepts.
The TabLLM training time on the Income dataset for 64 training examples and 30 epochs with a batch size of 8 was less
than 3 minutes. The average inference time for the test set of 10,000 examples with a batch size of 16 was 2 minutes,
around 12 ms per example. The training and inference times for the other public datasets were comparable. Due to the
larger size of the healthcare claims dataset, it took nearly 4 minutes to train for 64 examples and 10 epochs for EoL and
was similar for the other two tasks. Inference took approximately 14 minutes for 10,000 examples with a batch size of 16,
i.e. around 84 ms per example. The training times scaled linearly in the shot size.
We used the scikit-learn framework to perform cross-validation and parameter tuning for the LR and the tree-based models
(Pedregosa et al., 2011). For LR we tried common parameters for the penalty term and regularization strength (see Table
8). We used the same LR parameters for the public tabular datasets and the healthcare claims dataset. For the tree-based
models we adopted the hyperparameter ranges from Borisov et al. (2022a) and Grinsztajn et al. (2022). We discretized the
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
parameter ranges and performed a complete grid search (see Tables 9 and 10).
For the neural baselines SAINT, TabNet, and NODE, we used the setup and suggested hyperparameter ranges in Borisov
et al. (2022a). We modified the open-source implementation of these methods4 to support ingestion of the nine public
tabular datasets. We used the hyperparameter-tuning framework Optuna5 and selected parameters that maximize AUC-
ROC across folds. Note that for the 4-shot setting of the Car dataset, AUC may not be defined if the selected validation
set includes only one label; in this case we used accuracy as our validation metric but report AUC-ROC on the holdout test
set. Each neural baseline model was run for 20 trials with Optuna and trained for 100 epochs per hyperparameter settings.
To assess whether our baseline results match results reported in the literature, we report studies that used the same models.
Bank Dataset. Kadra et al. (2021) trained a XGBoost, TabNet, and NODE baseline on this dataset and achieved a
balanced accuracy of 72.7, 70.6, and 74.6. Our experiments for a set of 512 balanced training examples (512 shots) show
a better performance for XGBoost than NODE.
Blood Dataset. The XGBoost, TabNet, and NODE baselines trained in Kadra et al. (2021) achieved a balanced accuracy
of 62.3, 64.3, 50. Our results for a set of 512 balanced training examples (512 shots) also show a better performance for
TabNet than XGBoost. However, in our experiments NODE performs better than XGBoost and not worse.
California Dataset. Borisov et al. (2022a) trained a Linear Model, XGBoost, LightGBM, TabNet, NODE, and SAINT
baseline on a regression version of the dataset. They achieved a mean squared error of 0.53, 0.21, 0.20, 0.35, 0.28, and
0.23. Our experiments for a set of 512 balanced training examples (512 shots) show a better performance for XGBoost
than LightGBM and the same performance for TabNet and NODE. Also, our linear model performs much better which is
probably due to more extensive hyperparameter tuning.
Car Dataset. The XGBoost, TabNet, and NODE models in Kadra et al. (2021) showed a balanced accuracy of 92.4,
98.7, and 46.1. In our experiments, XGBoost and TabNet performed very similar for many training examples and NODE
was only slightly inferior.
Credit-g Dataset. The XGBoost, TabNet, and NODE baselines trained in Kadra et al. (2021) achieved a balanced accu-
racy of 68.9, 61.2, and 73.1. Our AUC results cannot easily be compared but our experiments for 512 balanced training
examples (512 shots) follow the same trend.
Diabetes Dataset. Hasan et al. (2020) reported an AUC of 0.828 (0.030) for XGBoost on the diabetes dataset, which
matches our findings. With additional feature selection and preprocessing methods they reached an AUC of 0.946 (0.020)
with XGBoost, but this was out of the scope of our work. XGBoost was the most performant model that they included in
their experiments.
Heart Dataset. Muhammad et al. (2020) used only the 303 instances from the Cleveland cohort, while we combined all
four sub-cohorts. They achieved an AUC of 0.923 with LR, which is close to our results on all sub-cohorts. They also
tested several models that outperformed LR.
Income Dataset. Many studies used the Income or Adult dataset. The review Borisov et al. (2022a) included several of
our baselines. They reported an AUC of 0.854 (0.002) for a linear model, 0.928 (0.001) for XGBoost, 0.928 (0.001) for
LightGBM, 0.916 (0.002) for SAINT, 0.911 (0.001) for TabNet, and 0.911 (0.002) for NODE. These are in accordance
with our results. We reckon the better performance of our LR model is due to more extensive parameter tuning.
Jungle Dataset. The XGBoost and TabNet baselines trained in Kadra et al. (2021) achieved a balanced accuracy of 87.3
and 73.4. They did not train a NODE moel for this dataset. The results follows the same trend as our experiments for a set
of 512 balanced training examples (512 shots).
4 https://github.com/kathrinse/TabSurvey
5 https://github.com/optuna/optuna
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Table 11: The mean performance for one prompt (ours, SD over five seed omitted) and the mean performance and SD
across five different prompts (each again over five seeds).
Dataset Bank Blood California Car Credit-g Diabetes Heart Income Jungle
TabLLM 0-shot: 1 prompt (ours) 0.63 0.61 0.61 0.81 0.53 0.68 0.54 0.84 0.60
TabLLM 0-shot: avg. 5 prompts 0.64.01 0.60.02 0.59.01 0.80.01 0.52.01 0.67.01 0.55.04 0.84.01 0.60.00
We wanted to investigate how a distribution shift caused by inflation affects the zero-shot performance of TabLLM. The
Income dataset was collected in 1994, and the label and two features (capital gain/loss in last year) contain dollar values.
T0 was trained in 2021 (Sanh et al., 2022), and we assumed that the training data is much more recent than the Income
dataset. The inflation rate from 1994 to 2021 is 1.796 . Without inflation correction the zero-shot results were 0.80 (0.01).
Correcting the two features, correcting only the prompt, and correcting both all yielded the same performance as the
uncorrected one. The accuracy values also remained the same with the inflation correction.
We wanted to understand which features were most important for the zero-shot performance of TabLLM on Income and
EoL. To this end, we used zero-shot TabLLM with the List Template serialization to predict the label probability of all
examples in the dataset. We then used 4-fold cross validation to fit a L2-regularized LR model to the predicted label using
the features in the serialization as covariates. For EoL, we used age, sex, race, and the conditions as inputs, which summed
up to 14,105 features.
For Income we compared these approximated importance scores to the feature coefficients of a LR model trained on all
data for a single seed (Table 16). We used the same setup for the LR model as for our main experiments. We did 4-fold
cross validation on an 80% training split to choose hyperparameters, and then refit the model using all training data. The
best parameters of the LR model for Income were a ‘l1’ penalty and a regularization constant of 1. For EoL, we decided
that the LR model coefficients did not provide a good estimate of the ground truth due to the vast amount of features and
possible collinearities in the data. Instead, we provide the relative risk (RR) with 95% confidence intervals (CI) treating
the occurrence of a feature as an intervention. We report the 50 most and least important features of TabLLM in Table 17.
To evaluate the effect of using a different prompt we considered the zero-shot setting, since even few training examples
mostly cancel the effect. For all datasets we constructed five different prompts that contained the same question, e.g., “Does
this person earn a lot of money?” instead of “Does this person earn more than 50000 dollars per year?” for the Income
dataset. The results are summarized in Table 11. The effects were relative small ranging from a standard deviation of 0.00
for Jungle to 0.04 for Heart across the five prompts. This suggests that TabLLM is not very sensitive to using different
prompts.
Table 12: Test AUC performance of competing methods on public tabular datasets. Each column reports the 𝑘-shot
performance for different values of 𝑘. Standard deviations across five random seeds are shown as subscripts.
Number of Shots
Method 0 4 8 16 32 64 128 256 512 all
Bank Dataset
Logistic regression — 0.55.09 0.66.09 0.75.06 0.81.02 0.84.02 0.86.02 0.88.01 0.89.00 0.91.00
Logistic regression (ordinal) — 0.51.02 0.60.12 0.68.09 0.78.04 0.82.01 0.84.03 0.86.01 0.87.00 0.88.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.77.03 0.84.03 0.88.01 0.89.00 0.94.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.78.03 0.84.02 0.87.01 0.89.00 0.94.00
XGBoost — 0.50.00 0.56.09 0.68.04 0.76.03 0.83.02 0.85.03 0.88.01 0.90.01 0.94.00
XGBoost (ordinal) — 0.50.00 0.56.09 0.69.05 0.75.04 0.82.02 0.84.03 0.87.01 0.89.00 0.93.00
SAINT — 0.51.10 0.61.11 0.70.04 0.77.03 0.81.03 0.85.02 0.88.01 0.88.01 0.93.00
TabNet — 0.51.06 0.58.05 0.64.10 0.62.04 0.71.06 0.73.03 0.80.04 0.83.03 0.93.00
NODE — 0.52.02 0.55.06 0.64.06 0.73.06 0.78.02 0.83.03 0.85.01 0.86.01 0.76.02
TabPFN — 0.59.14 0.66.08 0.69.02 0.76.03 0.82.03 0.86.02 0.89.00 0.90.00 0.91.00
TabPFN (ordinal) — 0.57.10 0.67.05 0.71.05 0.78.04 0.83.01 0.86.02 0.87.00 0.88.00 0.89.00
TabLLM (T0 + Text GPT-3) 0.63.01 0.61.04 0.62.02 0.63.03 0.64.02 0.66.04 0.76.04 0.81.02 0.82.01 *
TabLLM (T0 + Text T0) 0.54.01 0.56.08 0.60.06 0.59.06 0.60.04 0.62.04 0.67.04 0.79.03 0.85.01 *
TabLLM (T0 + Table-To-Text) 0.42.01 0.48.07 0.50.05 0.56.03 0.57.04 0.59.05 0.63.03 0.68.02 0.74.01 *
TabLLM (T0 + Text Template) 0.63.01 0.59.10 0.64.05 0.65.05 0.64.06 0.69.03 0.82.05 0.87.01 0.88.01 0.92 †
TabLLM (T0 + List Template) 0.60.01 0.59.10 0.66.02 0.65.04 0.66.05 0.74.07 0.85.02 0.87.01 0.87.01 *
TabLLM (T0 + List Only Values) 0.56.01 0.58.09 0.60.04 0.63.03 0.67.03 0.71.05 0.79.03 0.84.01 0.86.01 *
TabLLM (T0 + List Perm. Names) 0.64.00 0.55.10 0.62.07 0.63.04 0.63.05 0.68.04 0.82.02 0.86.01 0.88.00 *
TabLLM (T0 + List Perm. Values) 0.38.01 0.47.11 0.53.06 0.55.07 0.57.05 0.65.04 0.75.07 0.84.01 0.85.01 *
TabLLM (T0 3B + Text Template) 0.61.01 0.60.10 0.65.05 0.64.07 0.65.05 0.70.02 0.77.05 0.88.01 0.89.01 *
Blood Dataset
Logistic regression — 0.54.09 0.59.08 0.72.03 0.70.06 0.74.02 0.76.02 0.76.02 0.76.03 0.76.03
Logistic regression (ordinal) — 0.54.09 0.59.08 0.72.03 0.70.06 0.74.02 0.76.02 0.76.02 0.76.03 0.76.03
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.69.04 0.71.05 0.71.07 0.67.05 0.74.04
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.69.04 0.71.05 0.71.07 0.67.05 0.74.04
XGBoost — 0.50.00 0.58.07 0.66.04 0.67.06 0.68.05 0.71.06 0.70.07 0.67.06 0.71.04
XGBoost (ordinal) — 0.50.00 0.58.07 0.66.04 0.67.06 0.68.05 0.71.06 0.70.07 0.67.06 0.71.04
SAINT — 0.47.12 0.66.08 0.66.03 0.67.06 0.67.05 0.71.03 0.76.05 0.73.02 0.74.03
TabNet — 0.47.09 0.61.06 0.60.09 0.66.06 0.63.06 0.66.04 0.72.06 0.72.02 0.71.03
NODE — 0.49.04 0.60.07 0.62.04 0.67.03 0.71.05 0.76.03 0.74.03 0.76.03 0.74.03
TabPFN — 0.52.08 0.64.04 0.67.01 0.70.04 0.73.04 0.75.04 0.76.04 0.76.03 0.74.03
TabPFN (ordinal) — 0.52.08 0.64.04 0.67.01 0.70.04 0.73.04 0.75.04 0.76.04 0.76.03 0.74.03
TabLLM (T0 + Text GPT-3) 0.63.04 0.61.07 0.65.04 0.63.02 0.64.03 0.62.05 0.67.06 0.68.05 0.66.05 *
TabLLM (T0 + Text T0) 0.49.04 0.51.03 0.59.08 0.59.06 0.64.04 0.65.06 0.66.05 0.68.06 0.66.03 *
TabLLM (T0 + Table-To-Text) 0.61.04 0.59.04 0.59.03 0.57.03 0.62.07 0.56.07 0.57.07 0.64.07 0.61.05 *
TabLLM (T0 + Text Template) 0.61.04 0.58.09 0.66.03 0.66.07 0.68.04 0.68.04 0.68.06 0.70.08 0.68.04 0.70.04
TabLLM (T0 + List Template) 0.56.05 0.54.08 0.64.02 0.64.08 0.67.05 0.66.06 0.67.05 0.70.06 0.67.06 *
TabLLM (T0 + List Only Values) 0.45.05 0.49.07 0.57.03 0.57.06 0.62.06 0.61.04 0.64.04 0.68.07 0.67.05 *
TabLLM (T0 + List Perm. Names) 0.52.04 0.49.07 0.62.03 0.62.06 0.65.05 0.65.04 0.68.06 0.72.06 0.68.04 *
TabLLM (T0 + List Perm. Values) 0.51.03 0.51.06 0.54.04 0.52.07 0.55.03 0.59.06 0.59.02 0.62.06 0.62.05 *
TabLLM (T0 3B + Text Template) 0.42.05 0.47.04 0.62.04 0.62.09 0.65.07 0.67.04 0.69.04 0.71.06 0.67.04 *
California Dataset
Logistic regression — 0.58.11 0.69.13 0.80.06 0.84.03 0.88.01 0.90.00 0.91.00 0.91.00 0.92.00
Logistic regression (ordinal) — 0.58.11 0.69.13 0.80.06 0.84.03 0.88.01 0.90.00 0.91.00 0.91.00 0.92.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.81.02 0.87.01 0.90.01 0.92.00 0.97.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.81.02 0.87.01 0.90.01 0.92.00 0.97.00
XGBoost — 0.50.00 0.62.10 0.74.03 0.79.04 0.82.04 0.87.01 0.90.01 0.92.01 0.97.00
XGBoost (ordinal) — 0.50.00 0.62.10 0.74.03 0.79.04 0.82.04 0.87.01 0.90.01 0.92.01 0.97.00
SAINT — 0.59.09 0.64.12 0.73.06 0.76.06 0.81.02 0.84.01 0.88.02 0.91.02 0.95.00
TabNet — 0.50.08 0.57.06 0.67.02 0.69.05 0.72.03 0.79.02 0.84.02 0.87.01 0.96.00
NODE — 0.58.06 0.57.07 0.70.05 0.77.03 0.80.01 0.86.02 0.86.02 0.87.01 0.87.01
TabPFN — 0.63.13 0.63.11 0.80.03 0.85.03 0.89.01 0.91.01 0.92.00 0.93.00 0.94.00
TabPFN (ordinal) — 0.63.13 0.63.11 0.80.03 0.85.03 0.89.01 0.91.01 0.92.00 0.93.00 0.94.00
TabLLM (T0 + Text GPT-3) 0.56.00 0.55.03 0.57.05 0.61.06 0.73.05 0.73.04 0.82.01 0.84.01 0.85.01 *
TabLLM (T0 + Text T0) 0.49.01 0.52.02 0.51.02 0.52.02 0.54.04 0.56.04 0.69.02 0.73.03 0.80.02 *
TabLLM (T0 + Table-To-Text) 0.49.01 0.50.01 0.51.01 0.52.02 0.57.04 0.58.04 0.74.03 0.79.02 0.82.01 *
TabLLM (T0 + Text Template) 0.61.01 0.63.05 0.60.07 0.70.08 0.77.08 0.77.04 0.81.02 0.83.01 0.86.02 0.95.00
TabLLM (T0 + List Template) 0.61.01 0.64.05 0.62.06 0.68.07 0.77.07 0.79.02 0.82.02 0.84.01 0.87.01 *
TabLLM (T0 + List Only Values) 0.58.01 0.57.08 0.55.03 0.65.09 0.74.08 0.77.03 0.83.01 0.84.02 0.86.02 *
TabLLM (T0 + List Perm. Names) 0.54.01 0.52.03 0.52.04 0.52.03 0.66.06 0.74.01 0.81.02 0.84.02 0.86.02 *
TabLLM (T0 + List Perm. Values) 0.47.01 0.48.02 0.50.01 0.52.02 0.57.03 0.64.04 0.71.04 0.76.01 0.78.02 *
TabLLM (T0 3B + Text Template) 0.57.01 0.59.03 0.57.04 0.66.07 0.77.06 0.79.02 0.81.01 0.83.01 0.85.01 *
* Result omitted due to runtime limitations of TabLLM on the full dataset.
† Only a single run performed due to runtime limitations of TabLLM on the full dataset.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Table 13: Test AUC performance of competing methods on public tabular datasets. Each column reports the 𝑘-shot
performance for different values of 𝑘. Standard deviations across five random seeds are shown as subscripts.
Number of Shots
Method 0 4 8 16 32 64 128 256 512 all
Car Dataset
Logistic regression — 0.61.02 0.65.10 0.74.07 0.83.02 0.93.02 0.96.01 0.97.01 0.98.00 0.98.00
Logistic regression (ordinal) — 0.62.06 0.63.05 0.64.07 0.75.04 0.73.03 0.73.03 0.74.03 0.76.02 0.78.03
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.85.06 0.93.01 0.98.01 0.99.01 1.00.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.75.04 0.91.05 0.98.01 0.99.00 1.00.00
XGBoost — 0.50.00 0.59.04 0.70.08 0.82.03 0.91.02 0.95.01 0.98.01 0.99.01 1.00.00
XGBoost (ordinal) — 0.50.00 0.55.03 0.70.04 0.78.03 0.90.03 0.94.01 0.98.01 0.99.01 1.00.00
SAINT — 0.56.08 0.64.08 0.76.03 0.85.03 0.92.02 0.96.01 0.98.01 0.99.00 1.00.00
TabNet — † 0.54.05 0.64.05 0.66.05 0.73.07 0.81.04 0.93.02 0.98.01 1.00.00
NODE — 0.51.10 0.57.06 0.69.02 0.74.03 0.80.02 0.82.01 0.91.01 0.96.01 0.93.01
TabPFN — 0.64.06 0.75.05 0.87.04 0.92.02 0.97.00 0.99.01 1.00.00 1.00.00 1.00.00
TabPFN (ordinal) — 0.59.06 0.65.08 0.75.04 0.82.06 0.89.01 0.93.01 0.98.01 0.99.01 1.00.00
TabLLM (T0 + Text GPT-3) 0.72.02 0.75.03 0.75.02 0.78.01 0.83.01 0.87.02 0.90.01 0.93.02 0.93.02 0.96.01
TabLLM (T0 + Text T0) 0.85.01 0.85.02 0.84.03 0.86.02 0.89.02 0.92.02 0.94.01 0.98.01 0.99.00 1.00.00
TabLLM (T0 + Table-To-Text) 0.61.01 0.69.04 0.74.04 0.79.02 0.88.01 0.91.02 0.94.01 0.96.01 0.95.01 0.96.00
TabLLM (T0 + Text Template) 0.82.02 0.83.03 0.85.03 0.86.03 0.91.02 0.96.02 0.98.01 0.99.00 1.00.00 1.00.00
TabLLM (T0 + List Template) 0.79.02 0.84.03 0.85.02 0.86.03 0.91.02 0.95.01 0.98.01 0.99.00 1.00.00 1.00.00
TabLLM (T0 + List Only Values) 0.48.03 0.62.04 0.67.03 0.70.03 0.75.02 0.87.02 0.94.01 0.98.01 0.99.01 1.00.00
TabLLM (T0 + List Perm. Names) 0.39.02 0.54.10 0.58.06 0.70.03 0.86.02 0.94.01 0.97.02 0.99.01 0.99.00 1.00.00
TabLLM (T0 + List Perm. Values) 0.38.02 0.48.08 0.55.05 0.63.04 0.69.03 0.78.02 0.90.03 0.98.01 1.00.00 1.00.00
TabLLM (T0 3B + Text Template) 0.78.02 0.80.03 0.84.03 0.84.04 0.89.03 0.91.01 0.96.01 0.98.01 0.99.00 1.00.00
Credit-g Dataset
Logistic regression — 0.50.08 0.56.06 0.58.08 0.68.08 0.66.07 0.71.06 0.75.04 0.76.02 0.79.03
Logistic regression (ordinal) — 0.56.05 0.54.06 0.55.05 0.61.05 0.68.05 0.66.03 0.68.04 0.71.02 0.72.02
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.61.09 0.68.03 0.72.02 0.75.02 0.78.02
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.68.07 0.66.04 0.72.02 0.75.03 0.76.04
XGBoost — 0.50.00 0.51.07 0.59.05 0.66.03 0.67.06 0.68.02 0.73.02 0.75.03 0.78.04
XGBoost (ordinal) — 0.50.00 0.54.11 0.57.08 0.64.05 0.66.06 0.68.04 0.74.02 0.76.03 0.76.04
SAINT — 0.56.08 0.53.05 0.60.05 0.66.06 0.66.06 0.68.05 0.72.04 0.73.03 0.77.04
TabNet — 0.48.05 0.52.07 0.49.03 0.52.03 0.56.05 0.60.05 0.61.02 0.66.04 0.64.03
NODE — 0.54.09 0.54.10 0.54.09 0.59.07 0.63.04 0.68.02 0.68.05 0.70.02 0.65.03
TabPFN — 0.58.08 0.59.03 0.64.06 0.69.07 0.70.07 0.72.06 0.75.04 0.75.02 0.75.03
TabPFN (ordinal) — 0.55.08 0.51.07 0.57.06 0.62.03 0.66.05 0.70.02 0.73.01 0.73.03 0.75.04
TabLLM (T0 + Text GPT-3) 0.52.04 0.53.04 0.56.03 0.56.05 0.55.05 0.57.08 0.60.06 0.61.04 0.63.05 *
TabLLM (T0 + Text T0) 0.49.02 0.50.06 0.54.06 0.55.04 0.60.06 0.61.02 0.61.02 0.63.03 0.65.02 *
TabLLM (T0 + Table-To-Text) 0.50.06 0.65.04 0.60.05 0.60.07 0.65.05 0.67.05 0.65.05 0.68.04 0.64.05 *
TabLLM (T0 + Text Template) 0.53.05 0.69.04 0.66.04 0.66.05 0.72.06 0.70.07 0.71.07 0.72.03 0.72.02 0.70.02
TabLLM (T0 + List Template) 0.53.05 0.64.04 0.60.06 0.64.05 0.70.05 0.66.08 0.67.03 0.70.03 0.70.04 *
TabLLM (T0 + List Only Values) 0.66.06 0.71.03 0.67.06 0.69.06 0.72.06 0.69.05 0.69.07 0.70.06 0.68.04 *
TabLLM (T0 + List Perm. Names) 0.44.01 0.58.09 0.59.08 0.60.07 0.70.06 0.69.06 0.67.05 0.70.05 0.70.03 *
TabLLM (T0 + List Perm. Values) 0.50.05 0.55.06 0.56.07 0.58.04 0.64.03 0.66.08 0.67.09 0.68.03 0.69.03 *
TabLLM (T0 3B + Text Template) 0.54.03 0.65.05 0.63.05 0.63.03 0.73.04 0.69.05 0.68.06 0.73.05 0.73.03 *
Diabetes Dataset
Logistic regression — 0.60.15 0.68.11 0.73.05 0.76.05 0.80.02 0.81.02 0.83.02 0.83.02 0.83.02
Logistic regression (ordinal) — 0.60.15 0.68.11 0.73.05 0.76.05 0.80.02 0.81.02 0.83.02 0.83.02 0.83.02
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.79.04 0.79.02 0.79.03 0.83.03
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.79.04 0.79.02 0.79.03 0.83.03
XGBoost — 0.50.00 0.59.16 0.72.07 0.69.08 0.73.05 0.78.05 0.80.03 0.80.01 0.84.03
XGBoost (ordinal) — 0.50.00 0.59.16 0.72.07 0.69.08 0.73.05 0.78.05 0.80.03 0.80.01 0.84.03
SAINT — 0.46.12 0.65.11 0.73.06 0.73.06 0.79.03 0.81.03 0.81.04 0.77.03 0.83.03
TabNet — 0.56.04 0.56.06 0.64.09 0.66.06 0.71.04 0.73.04 0.74.05 0.74.07 0.81.03
NODE — 0.49.13 0.67.09 0.69.08 0.73.05 0.77.04 0.80.04 0.81.03 0.83.02 0.83.03
TabPFN — 0.61.13 0.67.11 0.71.07 0.77.03 0.82.03 0.83.03 0.83.03 0.81.02 0.81.03
TabPFN (ordinal) — 0.61.13 0.67.11 0.71.07 0.77.03 0.82.03 0.83.03 0.83.03 0.81.02 0.81.03
TabLLM (T0 + Text GPT-3) 0.61.06 0.61.07 0.56.12 0.67.08 0.74.04 0.77.02 0.79.03 0.76.03 0.78.04 0.81.04
TabLLM (T0 + Text T0) 0.58.04 0.53.05 0.53.06 0.54.09 0.59.05 0.68.02 0.73.04 0.72.05 0.72.03 0.76.01
TabLLM (T0 + Table-To-Text) 0.58.04 0.51.10 0.53.07 0.56.05 0.57.04 0.59.04 0.72.05 0.74.04 0.75.06 0.77.04
TabLLM (T0 + Text Template) 0.68.06 0.61.09 0.63.08 0.69.07 0.68.04 0.73.03 0.79.04 0.78.02 0.78.04 0.80.04
TabLLM (T0 + List Template) 0.64.06 0.64.09 0.64.10 0.67.07 0.70.05 0.76.04 0.78.03 0.78.03 0.78.04 0.81.05
TabLLM (T0 + List Only Values) 0.55.05 0.54.07 0.52.05 0.59.08 0.63.04 0.67.07 0.73.03 0.75.06 0.77.04 0.79.03
TabLLM (T0 + List Perm. Names) 0.56.07 0.60.09 0.68.12 0.74.05 0.74.03 0.72.04 0.76.04 0.77.04 0.77.04 0.81.04
TabLLM (T0 + List Perm. Values) 0.44.03 0.47.09 0.43.06 0.55.07 0.61.05 0.65.05 0.73.03 0.76.03 0.78.02 0.80.03
TabLLM (T0 3B + Text Template) 0.62.05 0.57.07 0.60.08 0.67.05 0.67.06 0.76.03 0.77.04 0.81.05 0.80.04 0.82.04
* Result omitted due to runtime limitations of TabLLM on the full dataset.
† Result omitted due to TabNet package not supporting unseen labels in validation set during cross validation.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
Table 14: Test AUC performance of competing methods on public tabular datasets. Each column reports the 𝑘-shot
performance for different values of 𝑘. Standard deviations across five random seeds are shown as subscripts.
Number of Shots
Method 0 4 8 16 32 64 128 256 512 all
Heart Dataset
Logistic regression — 0.69.17 0.75.13 0.82.06 0.87.05 0.91.01 0.90.02 0.92.01 0.93.01 0.93.01
Logistic regression (ordinal) — 0.70.17 0.73.14 0.84.04 0.88.03 0.89.01 0.88.02 0.90.02 0.92.02 0.92.02
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.91.01 0.91.01 0.91.01 0.93.00 0.94.01
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.91.01 0.91.02 0.91.01 0.92.01 0.94.01
XGBoost — 0.50.00 0.55.14 0.84.07 0.88.04 0.91.01 0.91.01 0.90.01 0.92.01 0.94.01
XGBoost (ordinal) — 0.50.00 0.56.15 0.84.07 0.90.03 0.91.01 0.90.01 0.90.01 0.92.01 0.94.01
SAINT — 0.80.12 0.83.10 0.88.07 0.90.01 0.90.04 0.90.02 0.90.01 0.92.01 0.93.01
TabNet — 0.56.12 0.70.05 0.73.14 0.80.04 0.83.05 0.84.03 0.88.02 0.88.03 0.89.03
NODE — 0.52.10 0.78.08 0.83.03 0.86.02 0.88.02 0.88.01 0.91.02 0.92.03 0.92.03
TabPFN — 0.84.06 0.88.05 0.87.06 0.91.02 0.92.02 0.92.02 0.92.01 0.92.02 0.92.02
TabPFN (ordinal) — 0.79.08 0.85.07 0.88.05 0.90.02 0.92.01 0.92.01 0.92.00 0.92.02 0.92.02
TabLLM (T0 + Text GPT-3) 0.51.04 0.72.05 0.82.03 0.85.05 0.88.03 0.91.02 0.89.02 0.91.01 0.91.01 0.93.01
TabLLM (T0 + Text T0) 0.44.03 0.74.07 0.82.10 0.87.02 0.88.02 0.89.04 0.90.01 0.89.02 0.89.03 0.93.02
TabLLM (T0 + Table-To-Text) 0.56.05 0.73.09 0.78.08 0.86.06 0.88.03 0.91.02 0.91.02 0.90.02 0.91.01 0.92.01
TabLLM (T0 + Text Template) 0.54.04 0.76.14 0.83.05 0.87.04 0.87.06 0.91.01 0.90.01 0.92.01 0.92.01 0.94.01
TabLLM (T0 + List Template) 0.52.03 0.73.12 0.83.05 0.87.04 0.88.04 0.91.02 0.91.01 0.92.01 0.92.01 0.94.01
TabLLM (T0 + List Only Values) 0.40.04 0.67.16 0.83.06 0.84.05 0.88.03 0.89.03 0.92.02 0.90.00 0.90.01 0.92.01
TabLLM (T0 + List Perm. Names) 0.57.02 0.78.07 0.85.02 0.82.06 0.87.05 0.90.02 0.92.02 0.91.01 0.91.01 0.93.02
TabLLM (T0 + List Perm. Values) 0.23.02 0.63.20 0.79.12 0.83.07 0.88.04 0.89.04 0.90.02 0.91.01 0.91.01 0.93.00
TabLLM (T0 3B + Text Template) 0.56.03 0.68.13 0.82.04 0.85.02 0.86.03 0.90.01 0.91.01 0.93.01 0.93.01 0.94.01
Income Dataset
Logistic regression — 0.68.15 0.72.13 0.80.03 0.82.01 0.83.03 0.85.01 0.87.01 0.88.00 0.90.00
Logistic regression (ordinal) — 0.55.04 0.56.06 0.58.07 0.70.06 0.76.03 0.79.01 0.80.01 0.80.00 0.81.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.78.03 0.81.03 0.87.01 0.88.00 0.93.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.78.01 0.81.01 0.86.01 0.89.00 0.93.00
XGBoost — 0.50.00 0.59.06 0.77.02 0.79.03 0.82.02 0.84.01 0.87.01 0.88.00 0.93.00
XGBoost (ordinal) — 0.50.00 0.63.04 0.74.04 0.76.04 0.79.03 0.84.02 0.86.01 0.88.00 0.93.00
SAINT — 0.74.03 0.65.15 0.79.03 0.81.03 0.84.02 0.84.02 0.87.01 0.88.00 0.91.00
TabNet — 0.56.04 0.59.07 0.62.11 0.64.06 0.71.04 0.73.05 0.80.02 0.83.02 0.92.00
NODE — 0.54.02 0.54.04 0.65.04 0.67.03 0.75.02 0.78.01 0.78.01 0.83.01 0.82.00
TabPFN — 0.73.08 0.71.09 0.76.09 0.80.04 0.82.04 0.84.01 0.86.01 0.87.01 0.89.00
TabPFN (ordinal) — 0.64.11 0.64.06 0.72.04 0.77.02 0.80.02 0.81.01 0.83.01 0.85.01 0.87.00
TabLLM (T0 + Text GPT-3) 0.75.01 0.79.03 0.80.03 0.82.02 0.82.01 0.84.02 0.84.02 0.85.01 0.86.00 *
TabLLM (T0 + Text T0) 0.65.01 0.67.03 0.66.07 0.72.02 0.75.03 0.79.04 0.82.02 0.83.02 0.86.01 *
TabLLM (T0 + Table-To-Text) 0.50.00 0.64.07 0.64.11 0.72.05 0.74.03 0.79.03 0.81.01 0.84.01 0.84.01 *
TabLLM (T0 + Text Template) 0.84.00 0.84.01 0.84.02 0.84.04 0.84.01 0.84.02 0.86.01 0.87.00 0.89.01 0.92.00
TabLLM (T0 + List Template) 0.79.01 0.83.01 0.83.03 0.83.02 0.84.01 0.85.01 0.86.01 0.87.01 0.88.01 *
TabLLM (T0 + List Only Values) 0.73.01 0.74.04 0.75.04 0.80.03 0.82.01 0.84.01 0.84.01 0.86.01 0.87.01 *
TabLLM (T0 + List Perm. Names) 0.65.00 0.75.03 0.74.05 0.82.02 0.83.02 0.84.02 0.86.01 0.86.01 0.88.01 *
TabLLM (T0 + List Perm. Values) 0.26.00 0.40.04 0.48.10 0.65.06 0.72.03 0.79.03 0.81.02 0.83.01 0.84.01 *
TabLLM (T0 3B + Text Template) 0.76.00 0.77.06 0.80.04 0.83.02 0.83.03 0.85.01 0.86.00 0.86.01 0.88.01 *
Jungle Dataset
Logistic regression — 0.62.09 0.69.09 0.68.04 0.76.03 0.79.01 0.79.00 0.80.01 0.80.00 0.81.00
Logistic regression (ordinal) — 0.62.09 0.69.09 0.68.04 0.76.03 0.79.01 0.79.00 0.80.01 0.80.00 0.81.00
LightGBM — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.84.02 0.88.01 0.91.00 0.98.00
LightGBM (ordinal) — 0.50.00 0.50.00 0.50.00 0.50.00 0.79.02 0.84.02 0.88.01 0.91.00 0.98.00
XGBoost — 0.50.00 0.58.07 0.72.05 0.78.03 0.81.02 0.84.02 0.87.01 0.91.01 0.98.00
XGBoost (ordinal) — 0.50.00 0.58.07 0.72.05 0.78.03 0.81.02 0.84.02 0.87.01 0.91.01 0.98.00
SAINT — 0.64.05 0.69.06 0.72.05 0.79.02 0.81.01 0.83.01 0.88.01 0.90.00 1.00.00
TabNet — 0.53.09 0.60.05 0.62.03 0.69.04 0.73.04 0.75.02 0.79.02 0.84.01 0.99.00
NODE — 0.60.01 0.71.03 0.68.04 0.74.02 0.75.04 0.78.01 0.79.01 0.80.00 0.81.00
TabPFN — 0.65.08 0.72.04 0.71.07 0.78.02 0.81.01 0.84.01 0.88.01 0.91.00 0.93.00
TabPFN (ordinal) — 0.65.08 0.72.04 0.71.07 0.78.02 0.81.01 0.84.01 0.88.01 0.91.00 0.93.00
TabLLM (T0 + Text GPT-3) 0.56.01 0.58.02 0.55.02 0.60.06 0.68.03 0.74.03 0.77.01 0.81.01 0.85.01 *
TabLLM (T0 + Text T0) 0.63.00 0.63.04 0.64.05 0.62.06 0.70.01 0.71.03 0.74.02 0.78.02 0.82.01 *
TabLLM (T0 + Table-To-Text) 0.51.01 0.60.02 0.60.04 0.63.05 0.69.03 0.75.01 0.78.03 0.82.01 0.85.01 *
TabLLM (T0 + Text Template) 0.60.00 0.64.01 0.64.02 0.65.03 0.71.02 0.78.02 0.81.02 0.84.01 0.89.01 1.00 †
TabLLM (T0 + List Template) 0.63.00 0.65.01 0.66.03 0.66.04 0.71.03 0.78.02 0.81.03 0.84.01 0.88.01 *
TabLLM (T0 + List Only Values) 0.58.00 0.60.03 0.62.03 0.63.02 0.65.04 0.73.01 0.76.02 0.82.02 0.88.01 *
TabLLM (T0 + List Perm. Names) 0.40.00 0.53.06 0.55.05 0.63.10 0.72.03 0.79.02 0.80.03 0.84.02 0.89.01 *
TabLLM (T0 + List Perm. Values) 0.48.00 0.50.02 0.52.03 0.53.03 0.55.01 0.59.02 0.63.01 0.72.02 0.75.01 *
TabLLM (T0 3B + Text Template) 0.54.00 0.63.02 0.64.04 0.67.03 0.72.03 0.77.02 0.80.02 0.83.01 0.87.01 *
* Result omitted due to runtime limitations of TabLLM on the full dataset.
† These experiments were only performed for a single run due to runtime limitations of TabLLM on the full dataset.
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Table 15: Full results on healthcare claims dataset. The best concept selection method (most frequent concepts) and
concept names (original concept names) were used as determined in prior zero-shot experiments. A fix number of 10
epochs was used for up to 256 shots and 3 epochs for more shots to decrease the runtime and prevent overfitting.
Number of Shots
Method 0 16 64 256 1,024 4,096 16,384 all
End of Life (EoL)
TabLLM (T0 + List Template) 0.70 0.74 0.78 0.78 0.79 0.81 0.81 —
TabLLM (T0 + Text Template) 0.63 0.71 0.74 0.76 0.78 0.79 0.80 —
TabLLM (T0 + List Short) 0.68 0.71 0.76 0.79 0.80 0.81 0.82 —
TabLLM (T0 + List Perm. Names) 0.62 0.66 0.70 0.74 0.75 0.77 0.79 —
Logistic Regression — 0.65.07 0.77.02 0.80.02 0.83.01 0.83.01 0.84.01 0.84.01
LightGBM — 0.50.00 0.71.01 0.76.02 0.80.01 0.82.01 0.83.01 0.82 *
TabLLM (T0 + List Template) unbalanced 0.70 0.64 0.69 0.74 0.74 0.77 0.79 —
Logistic Regression unbalanced — 0.44.04 0.53.12 0.75.03 0.77.03 0.80.02 0.82.02 0.84.01
Surgical Procedure (Surgery)
TabLLM (T0 + List Template) 0.67 0.73 0.72 0.73 0.75 0.78 0.79 —
TabLLM (T0 + Text Template) 0.62 0.71 0.69 0.72 0.74 0.77 0.78 —
TabLLM (T0 + List Short) 0.66 0.70 0.69 0.72 0.73 0.76 0.78 —
TabLLM (T0 + List Perm. Names) 0.60 0.68 0.70 0.72 0.74 0.77 —
Logistic Regression — 0.72.04 0.75.05 0.77.01 0.79.01 0.80.01 0.80.00 0.81.00
LightGBM — 0.50.00 0.73.02 0.77.01 0.79.01 0.80.00 0.81.01 0.82 *
TabLLM (T0 + List Template) unbalanced 0.67 0.68 0.73 0.74 0.75 0.77 0.79 —
Logistic Regression unbalanced — 0.61.15 0.77.01 0.77.02 0.78.01 0.80.01 0.80.00 0.81.00
Likelihood of Hospitalization (LoH)
TabLLM (T0 + List Template) 0.71 0.73 0.73 0.76 0.78 0.81 0.82 —
TabLLM (T0 + Text Template) 0.65 0.74 0.72 0.74 0.78 0.80 0.81 —
TabLLM (T0 + List Short) 0.70 0.73 0.75 0.78 0.79 0.80 0.82 —
TabLLM (T0 + List Perm. Names) 0.62 0.71 0.72 0.75 0.75 0.78 0.80 —
Logistic Regression — 0.72.04 0.76.03 0.80.01 0.82.01 0.83.01 0.83.01 0.84.01
LightGBM — 0.50.00 0.72.02 0.76.03 0.81.01 0.83.00 0.83.01 0.85 *
TabLLM (T0 + List Template) unbalanced 0.71 0.66 0.72 0.75 0.75 0.78 0.80 —
Logistic Regression unbalanced — 0.53.06 0.54.09 0.73.06 0.79.01 0.81.01 0.82.01 0.84.01
* These experiments were only performed for a single run due to runtime limitations on the full dataset.
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
Table 16: Feature importance of zero-shot TabLLM and LR on all data for the Income dataset. To determine the feature
importance of TabLLM, we fit a separate LR model to the predictions using the original feature values as covariates. For
LR we simply use the feature coefficients. The features are ranked by their TabLLM importance score.
Feature TabLLM LR Feature TabLLM LR
rank weight rank weight rank weight rank weight
capital gain 1 5.310 2 2.393 relationship Other-relative 54 -0.010 88 -0.759
education Masters 2 4.623 6 1.455 native country Trinadad&Tob. 55 -0.028 66 -0.097
education Doctorate 3 3.410 4 2.066 race Black 56 -0.044 74 -0.291
education Bachelors 4 2.995 7 1.135 native country England 57 -0.088 16 0.551
education Prof-school 5 2.949 5 1.900 native country Honduras 58 -0.105 58 0.000
occupation Machine-op-insp. 6 2.589 75 -0.325 relationship Not-in-family 59 -0.153 29 0.257
workclass Private 7 2.275 37 0.102 native country Holand-Neth. 60 -0.154 57 0.000
relationship Wife 8 2.109 8 0.955 occupation Craft-repair 61 -0.161 36 0.108
native country China 9 2.086 94 -0.839 capital loss 62 -0.182 31 0.255
native country United-States 10 2.045 38 0.087 race Other 63 -0.202 65 -0.085
native country Taiwan 11 1.965 54 0.000 native country Yugoslavia 64 -0.204 27 0.357
workclass Federal-gov 12 1.784 14 0.574 workclass Local-gov 65 -0.230 47 0.000
race White 13 1.685 61 0.000 occupation nan 66 -0.248 82 -0.653
education Assoc-acdm 14 1.621 13 0.574 marital status Never-married 67 -0.292 77 -0.443
native country nan 15 1.565 63 -0.056 native country Iran 68 -0.330 41 0.000
marital status Married-civ-sp. 16 1.487 3 2.214 native country Dominican-Rep. 69 -0.332 85 -0.731
occupation Protective-serv 17 1.434 17 0.535 marital status Married-sp.-abs. 70 -0.379 51 0.000
sex Male 18 1.335 42 0.000 native country Jamaica 71 -0.416 25 0.392
occupation Armed-Forces 19 1.290 60 0.000 native country Nicaragua 72 -0.425 45 0.000
occupation Adm-clerical 20 1.245 52 0.000 native country Thailand 73 -0.451 100 -1.116
hours per week 21 1.240 20 0.424 native country Peru 74 -0.522 93 -0.837
native country Hong 22 1.227 86 -0.749 native country Japan 75 -0.617 56 0.000
occupation Tech-support 23 1.164 18 0.526 relationship Unmarried 76 -0.620 48 0.000
relationship Husband 24 1.087 72 -0.212 native country France 77 -0.754 21 0.416
occupation Sales 25 0.857 28 0.298 occupation Other-service 78 -0.754 96 -0.903
native country Vietnam 26 0.803 95 -0.898 workclass Never-worked 79 -0.763 50 0.000
marital status Married-AF-sp. 27 0.792 1 2.571 education 1st-4th 80 -0.763 101 -1.172
native country Philippines 28 0.711 40 0.011 native country Columbia 81 -0.836 104 -1.855
age 29 0.710 22 0.411 education 5th-6th 82 -0.843 97 -0.961
native country Poland 30 0.698 53 0.000 marital status Divorced 83 -0.870 46 0.000
occupation Prof-specialty 31 0.684 12 0.620 education 9th 84 -0.904 102 -1.222
race Asian-Pac-Islander 32 0.651 32 0.254 native country Ecuador 85 -0.952 49 0.000
native country Outlying-US 33 0.591 92 -0.836 education 11th 86 -0.993 91 -0.825
workclass Self-emp-not-inc 34 0.582 76 -0.344 native country Haiti 87 -1.062 35 0.137
native country Italy 35 0.534 24 0.400 education Assoc-voc 88 -1.074 19 0.514
marital status Separated 36 0.523 70 -0.181 native country India 89 -1.074 71 -0.183
workclass nan 37 0.515 59 0.000 education 7th-8th 90 -1.151 103 -1.303
occupation Exec-managerial 38 0.503 10 0.773 marital status Widowed 91 -1.253 64 -0.071
native country Scotland 39 0.491 81 -0.626 education 10th 92 -1.306 89 -0.797
native country Laos 40 0.475 44 0.000 native country Greece 93 -1.319 68 -0.140
native country Cambodia 41 0.328 11 0.642 sex Female 94 -1.327 84 -0.710
native country Guatemala 42 0.276 55 0.000 native country South 95 -1.466 99 -1.101
workclass State-gov 43 0.267 73 -0.223 native country Cuba 96 -1.575 33 0.230
native country Germany 44 0.262 39 0.043 education Some-college 97 -1.950 26 0.363
native country Puerto-Rico 45 0.241 67 -0.128 occupation Handlers-cleaners 98 -1.992 83 -0.681
native country Hungary 46 0.177 34 0.191 native country Portugal 99 -2.049 15 0.572
native country Mexico 47 0.123 80 -0.579 race Amer-Indian-Eskimo 100 -2.081 78 -0.465
native country Ireland 48 0.116 9 0.954 relationship Own-child 101 -2.404 87 -0.755
education HS-grad 49 0.092 43 0.000 occupation Priv-house-serv 102 -2.840 105 -1.909
occupation Transport-moving 50 0.090 62 -0.048 education 12th 103 -3.178 79 -0.480
native country El-Salvador 51 0.027 90 -0.803 education Preschool 104 -3.520 106 -2.385
native country Canada 52 0.027 23 0.407 occupation Farming-fishing 105 -3.853 98 -0.982
workclass Self-emp-inc 53 0.001 30 0.255 workclass Without-pay 106 -4.423 69 -0.174
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Table 17: Feature importance of zero-shot TabLLM and relative risk (RR) with 95% confidence interval (CI) for EoL task
on the healthcare claims dataset. For TabLLM we fit a separate LR model to the predictions using the original feature
values as covariates. We determine the relative risk treating the respective feature as an intervention, i.e. the ratio of the
label in the group that has a concept divided by the ratio in the group without it. We selected 50 features with the highest
and the lowest importance.
Feature TabLLM RR (95% CI) Feature TabLLM RR (95% CI)
rank weight rank weight
atrial fibrillation 1 0.633 2.72 (2.51-2.95) open wound of forehead without ... 14056 -0.152 1.80 (1.18-2.74)
atherosclerosis of coronary art... 2 0.530 2.10 (1.94-2.27) prediabetes 14057 -0.157 0.81 (0.68-0.96)
atherosclerosis of aorta 3 0.473 1.99 (1.81-2.19) primary iridocyclitis 14058 -0.157 1.63 (1.03-2.56)
exudative age-related macular d... 4 0.452 2.38 (2.06-2.75) discoloration of skin 14059 -0.157 0.87 (0.73-1.04)
sex male 5 0.442 1.23 (1.14-1.33) basal cell carcinoma of truncal... 14060 -0.158 1.14 (0.94-1.40)
non-hodgkin’s lymphoma (clinical) 6 0.440 1.36 (0.94-1.96) lumbar sprain 14061 -0.158 1.14 (0.91-1.42)
chronic atrial fibrillation 7 0.436 3.36 (3.05-3.70) spasm 14062 -0.160 0.98 (0.82-1.16)
chronic kidney disease stage 3 8 0.430 2.75 (2.53-2.98) chronic rhinitis 14063 -0.161 1.22 (1.06-1.42)
atherosclerosis of arteries of ... 9 0.404 2.76 (2.42-3.15) primary cardiomyopathy 14064 -0.161 2.50 (2.11-2.97)
barrett’s esophagus 10 0.402 1.07 (0.84-1.37) benign neoplastic disease 14065 -0.162 1.04 (0.63-1.72)
chronic obstructive lung disease 11 0.401 2.39 (2.19-2.60) palpitations 14066 -0.166 1.12 (1.01-1.25)
paroxysmal atrial fibrillation 12 0.395 2.58 (2.37-2.81) localized, primary osteoarthrit... 14067 -0.167 1.50 (1.33-1.70)
systemic lupus erythematosus 13 0.395 1.51 (0.99-2.29) benign neoplasm of skin of lowe... 14068 -0.167 0.68 (0.53-0.89)
atherosclerosis of artery of lo... 14 0.394 2.45 (2.20-2.72) cyst of ovary 14069 -0.171 0.90 (0.64-1.26)
coronary atherosclerosis 15 0.381 2.15 (1.95-2.36) microscopic hematuria 14070 -0.171 1.18 (1.01-1.37)
nonexudative age-related macula... 16 0.377 2.15 (1.95-2.37) problem related to lifestyle 14071 -0.172 0.96 (0.48-1.91)
age related macular degeneration 17 0.371 2.18 (1.76-2.71) acquired hypothyroidism 14072 -0.172 1.47 (1.34-1.62)
pseudoexfoliation glaucoma 18 0.360 1.13 (0.72-1.76) abnormal findings on diagnostic... 14073 -0.176 0.63 (0.54-0.73)
degenerative joint disease invo... 19 0.359 1.77 (1.52-2.06) increased frequency of urination 14074 -0.177 1.41 (1.22-1.64)
coronary arteriosclerosis 20 0.357 2.00 (1.82-2.20) disorder of skin 14075 -0.178 1.18 (0.95-1.48)
coronary artery graft present 21 0.346 1.64 (1.41-1.91) thyroiditis 14076 -0.180 0.87 (0.49-1.57)
aortocoronary bypass graft present 22 0.335 2.24 (1.98-2.54) race hispanic or latino 14077 -0.186 0.96 (0.60-1.51)
dehydration 23 0.332 2.94 (2.68-3.22) herpes zoster without complication 14078 -0.187 1.14 (0.96-1.35)
primary malignant neoplasm of f... 24 0.327 1.19 (1.01-1.40) altered sensation of skin 14079 -0.191 1.00 (0.82-1.22)
malignant lymphoma 25 0.322 1.54 (0.96-2.46) generalized hyperhidrosis 14080 -0.194 1.37 (1.07-1.76)
cerebral infarction due to thro... 26 0.316 2.86 (2.46-3.32) primary open angle glaucoma 14081 -0.194 1.35 (1.20-1.52)
congestive heart failure 27 0.313 3.67 (3.38-3.99) stool finding 14082 -0.195 1.48 (1.26-1.73)
old myocardial infarction 28 0.299 2.04 (1.81-2.30) primary gout 14083 -0.196 1.80 (1.51-2.15)
sleep apnea 29 0.294 1.16 (0.98-1.37) localized, primary osteoarthrit... 14084 -0.199 1.10 (0.92-1.30)
acute hypoxemic respiratory fai... 30 0.292 4.02 (3.62-4.46) diarrhea 14085 -0.200 1.73 (1.57-1.90)
obstructive sleep apnea syndrome 31 0.287 1.09 (0.96-1.24) benign neoplasm of skin of uppe... 14086 -0.204 0.78 (0.58-1.03)
primary malignant neoplasm of e... 32 0.284 0.92 (0.56-1.53) prostatitis 14087 -0.204 1.20 (0.89-1.62)
sensorineural hearing loss 33 0.281 1.26 (1.09-1.47) eruption 14088 -0.205 1.25 (1.11-1.41)
retention of urine 34 0.280 2.19 (1.97-2.44) scar conditions and fibrosis of... 14089 -0.206 1.00 (0.86-1.15)
atrial flutter 35 0.280 2.14 (1.85-2.47) hashimoto thyroiditis 14090 -0.215 0.91 (0.49-1.68)
abdominal aortic aneurysm witho... 36 0.275 1.85 (1.58-2.18) acquired deformity of toe 14091 -0.227 1.25 (0.94-1.65)
chronic kidney disease due to h... 37 0.274 2.65 (2.42-2.90) race asian 14092 -0.228 0.70 (0.50-0.99)
non-rheumatic aortic sclerosis 38 0.271 2.64 (2.38-2.93) localized swelling, mass and lu... 14093 -0.242 1.48 (1.15-1.91)
type 2 diabetes mellitus 39 0.267 2.14 (1.96-2.33) benign neoplasm of skin of trunk 14094 -0.245 0.91 (0.79-1.05)
intraductal carcinoma in situ o... 40 0.265 0.62 (0.30-1.29) benign essential hypertension 14095 -0.245 1.86 (1.72-2.01)
chronic kidney disease stage 2 41 0.264 1.77 (1.55-2.03) finding of frequency of urination 14096 -0.255 1.48 (1.34-1.64)
degenerative disorder of macula 42 0.263 2.23 (1.88-2.65) benign essential microscopic he... 14097 -0.258 1.10 (0.76-1.59)
sensorineural hearing loss, bil... 43 0.262 1.30 (1.17-1.43) localized swelling, mass and lu... 14098 -0.262 1.93 (1.67-2.23)
race white 44 0.262 1.25 (1.14-1.37) digestive symptom 14099 -0.267 0.91 (0.68-1.21)
metabolic encephalopathy 45 0.259 4.42 (3.86-5.07) type 1 diabetes mellitus withou... 14100 -0.298 2.34 (2.03-2.70)
alzheimer’s disease 46 0.256 5.03 (4.45-5.69) open angle with borderline intr... 14101 -0.338 1.20 (1.03-1.40)
sick sinus syndrome 47 0.256 2.37 (2.08-2.71) primary localized osteoarthrosi... 14102 -0.366 1.08 (0.82-1.43)
ventricular tachycardia 48 0.255 2.33 (2.00-2.70) localized, primary osteoarthritis 14103 -0.393 1.23 (1.07-1.40)
acute posthemorrhagic anemia 49 0.255 2.15 (1.92-2.41) sex female 14104 -0.441 0.81 (0.75-0.88)
impaired fasting glycemia 50 0.254 0.97 (0.85-1.09) open-angle glaucoma - borderline 14105 -0.495 0.97 (0.85-1.10)
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag
Is this house block valuable? Yes or Does the white player win this two
no? pieces endgame of Jungle Chess? Yes or
Answer: no?
||| Answer:
{{ answer choices[label] }}’ |||
{{ answer choices[label] }}’
Car Dataset:
End Of Life Task:
answer choices: ’Unacceptable |||
Acceptable ||| Good ||| Very good’ answer choices: ’No ||| Yes’
jinja: ’{{serialization}} jinja: ’{{serialization}}
How would you rate the decision to buy Does this patient die in the next nine
this car? Unacceptable, acceptable, months? Yes or no?
good or very good? Answer:
Answer: |||
||| {{ answer choices[label] }}’
{{ answer choices[label] }}’
Surgical Procedure Task:
Credit-g Dataset: answer choices: ’No ||| Yes’
answer choices: ’No ||| Yes’ jinja: ’{{serialization}}
jinja: ’{{serialization}}
Does this patient need a surgery in the
Does this person receive a credit? Yes next nine months? Yes or no?
or no? Answer:
Answer: |||
||| {{ answer choices[label] }}’
{{ answer choices[label] }}’
Likelihood of Hospitalization Task:
Diabetes Dataset: answer choices: ’No ||| Yes’
answer choices: ’No ||| Yes’ jinja: ’{{serialization}}
jinja: ’{{serialization}}
Is this patient admitted to the hospital
Does this patient have diabetes? Yes or in the next nine months? Yes or no?
no? Answer:
Answer: |||
||| {{ answer choices[label] }}’
{{ answer choices[label] }}’
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
Smith, J. W., Everhart, J., Dickson, W., Knowler, W., and Johannes, R. (1988). Using the ADAP Learning Algorithm to
Forecast the Onset of Diabetes Mellitus. Proceedings of the Annual Symposium on Computer Application in Medical
Care, pages 261–265.
van Rijn, J. N. and Vis, J. K. (2014). Endgame analysis of dou shou qi. ICGA Journal, 37(2):120–124.
Yeh, I.-C., Yang, K.-J., and Ting, T.-M. (2009). Knowledge discovery on rfm model using bernoulli sequence. Expert
Systems with Applications, 36(3):5866–5871.