0% found this document useful (0 votes)
10 views16 pages

LLM para Regresion

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

LLM para Regresion

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Understanding LLM Embeddings for

Regression
Eric Tang∗1 , Bangding Yang2 and Xingyou Song3
1 Stanford University, 2 Google, 3 Google DeepMind

∗ Work performed during the Google DeepMind Academy Program.

With the rise of large language models (LLMs) for flexibly processing information as strings, a natural
application is regression, specifically by preprocessing string representations into LLM embeddings as
downstream features for metric prediction. In this paper, we provide one of the first comprehensive
investigations into embedding-based regression and demonstrate that LLM embeddings as features
arXiv:2411.14708v2 [cs.LG] 2 Dec 2024

can be better for high-dimensional regression tasks than using traditional feature engineering. This
regression performance can be explained in part due to LLM embeddings over numeric data inherently
preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of
different model effects, most notably model size and language understanding, which we find surprisingly
do not always improve regression performance.

1. Introduction and Related Work


Regression is a fundamental statistical tool used to model the relationship between a metric and a
selected set of features, playing a crucial role in various fields, enabling predictions, forecasting, and
the understanding of underlying relationships within data. Traditional regression techniques often
rely on handcrafted features or domain-specific knowledge to represent input data. However, the
5D
advent of Large Language Models (LLMs) and their ability to instead process semantic representations
of text has raised the question of whether regression can instead be performed over free-form text.
Previous works have predominantly examined
the topic of LLM-based regression through de-
coding, i.e. generating floating point predic- 60
tions using token-based sampling. For example,
(Song et al., 2024) examines the case when
the model is fully accessible and fine-tunable
Y Value

against data, while (Vacareanu et al., 2024) 45


study the ability of service-based closed-source
LLMs such as GPT-4 using in-context learning.
One understudied case however is the use of 30
service-based LLM embeddings - fixed vector
representations derived from pre-trained (but
frozen) language models, which are ubiqui-
tously offered among most LLM services (An- Figure 1 | Rugged surface of a 5D Sphere function
thropic, 2024; Google, 2024; OpenAI, 2023). when inputs are represented as Gemini embeddings of
Although they are used frequently in recent ap- dimension 6K+, post-processed by t-SNE into 2D space.
plications such as retrieval (Karpukhin et al.,
2020), semantic similarity (Li et al., 2020), and a variety of other downstream language tasks (Liu
5D
et al., 2020), there has been very little fundamental research around their use in regression, outside
of specific applications such as Bayesian Optimization (Kristiadi et al., 2024; Nguyen et al., 2024).

© 2024 Google DeepMind. All rights reserved


60
Understanding LLM Embeddings for Regression

In contrast to decoding-based regression techniques, embedding-based regression allows the possibility


of cheap data-driven training using inexpensive and customizable post-embedding layers such as
multi-layer perceptrons (MLPs). However, as shown in Figure 1, when the domain of a simple function
is expressed using high-dimensional embeddings, unexpected characteristics and irregularities can
arise, prompting the need for a thorough analysis. Furthermore, LLMs by default are not explicitly
trained for embedding-based regression, rather purely for token generation, and thus it is worth
analyzing the emergent behaviors of LLM embeddings when applied to regression.
This paper investigates the behavior of these LLM embeddings when used as features for standard
tabular regression tasks. Most notably, our findings are:
• LLM embeddings are dimensionally robust, i.e. regression performance can remain strong even
over high-dimensional data, whereas traditional representations significantly suffer.
• Over numeric formats, LLM embeddings preserve Lipschitz-continuity and smoothness over
feature space, which naturally enables regression when using a downstream MLP head.
• Factors which directly impact language understanding (e.g. size, pre-training, and input
formatting) have more nuanced effects for regression and do not always provide significantly
better outcomes.

2. Problem and Methodology


A regression task T = ( 𝑓 , X , D) consists of an underlying scalar-valued function 𝑓 : X → ℝ over
an input space X. Provided are offline training data D𝑡𝑟𝑎𝑖𝑛 = {( 𝑥1 , 𝑦1 ) , ..., ( 𝑥𝑇 , 𝑦𝑇 )} collected from
querying 𝑓 and an analogous test set D𝑡𝑒𝑠𝑡 for evaluation. Given access to training data D𝑡𝑟𝑎𝑖𝑛 , the
goal is to obtain accurate predictions over test points ( 𝑥, 𝑦 ) ∈ D𝑡𝑒𝑠𝑡 , usually measured by an aggregate
performance measure, e.g. mean squared error or Kendall-Tau ranking scores.
Required by nearly all learnable regression methods are features, which we assume come from an
embedder 𝜙 : X → ℝ𝑑 which takes an input 𝑥 and returns a fixed-dimensional feature representation,
of dimension 𝑑 . Here, we use the terms "features" and "embedding" interchangeably, since traditional
methods typically use a canonical, manually defined feature engineering method for tabular data, in
which continuous values are normalized and categorical selections are one-hot encoded. This feature
vector 𝜙 ( 𝑥 ) is then sent to a downstream predictor, e.g. MLP or random forest, which is trained using
a loss function such as mean squared error.
Language models also provide a canonical definition of embedding, which typically consists of, in
order:
1. Tokenizing a string representation 𝑥 into 𝐿 tokens.
2. Obtaining a "soft prompt" ℝ𝐿× 𝑣 via vocabulary look-up.
3. Applying a forward pass of a Transformer to obtain an output ℝ𝐿× 𝑓 .
4. Pooling down to a fixed dimension vector in ℝ𝑑 .
Afterwards, one may also attach an MLP predictor head and apply an analogous training procedure
as in the traditional case. Thus we can see that the only difference becomes the input representation
𝜙, i.e. whether we used a traditional 𝜙trad or LLM-based 𝜙LLM .

While it is straightforward to assume that the whole process outlined for LLMs should constitute the
definition of a language model embedding 𝜙LLM , it is not obvious how much each of these steps may
contribute to the final regression result. For instance, one could simply skip applying a forward pass
in step (3) and pool the soft prompt directly, or use a randomly initialized model as opposed to a
pretrained one. We extensively study this case in Section 3.3.

2
Understanding LLM Embeddings for Regression

2.1. Modeling

To minimize confounding factors and maintain fairness during comparisons, we use the exact
same MLP prediction head (2 hidden layers, ReLU activation), loss (mean squared error), and
𝑦 -normalization scheme (shifting by empirical mean and dividing by empirical deviation), regardless
of using 𝜙LLM and 𝜙trad . Note however, that the embedding dimensions of the two representations
may be different, and so we distinguish them using notation 𝑑llm and 𝑑trad respectively, where typically
𝑑llm ≫ 𝑑trad . Further details can be found in Appendix B.1.

To demonstrate consistent results over different families of language models, we benchmark over both
the T5 (Raffel et al., 2020) and Gemini 1.0 (Google, 2024) families, which use different architectures
(encoder-decoder and decoder-only), different vocabulary sizes (32K and 256K), and embedding
dimensions (see Appendix B.2) respectively. However, to remain consistent with the definition of
embedding, we follow previous literature (Li et al., 2020; Reimers and Gurevych, 2019) and use
average-pooling as the canonical method of aggregating Transformer outputs, and thus the embedding
dimension 𝑑llm is equivalent to the the output feature dimension 𝑓 following a forward pass.
Similar to previous work (Nguyen et al., 2024; Song et al., 2024), for string representations of 𝑥 from
any regression task, by default we use a key-value JSON format with consistent ordering of keys, i.e.
{param1:value1,param2:value2,...}, with specific examples shown in Appendix C.

2.2. Regression Tasks

For regression tasks, we first use synthetic, closed-form objective functions in order to produce
controlled studies in which we may query any 𝑥 from the input space. Our synthetic functions are
defined from the standard Black-Box Optimization Benchmarking (BBOB) suite (Elhara et al., 2019).
To avoid confounding terminology between embedding "dimension" 𝑑 and the intrinsic "dimension"
of an objective 𝑓 , we denote the latter as "degree-of-freedom" (DOF), and thus 𝑓 (·) is dependent on
input coordinates 𝑥 (1) , . . . , 𝑥 (DOF) , each of which is between [−5, 5]. This provides a comprehensive
variety of both convex and non-convex objective landscapes to regress upon.
We further use real-world regression tasks representative of those encountered in the wild and in
industry settings by benchmarking over offline objective evaluations found in Google Vizier (Golovin
et al., 2017), which optimizes Google’s largest production and research systems. These consist of four
families, with each family containing at least 50 individual yet similar regression tasks. The families
are:
• AutoML (Google Cloud, 2023): Automated Machine Learning platform for automating TFX
(Google, 2023) pipelines (e.g. batch size, activation, layer counts) over tabular or text data.
• Init2Winit (Dahl et al., 2023): Learning rate scheduling parameters influencing common image
classification tasks (e.g. ResNets on CIFAR-10 and ImageNet).
• XLA (Phothilimthana et al., 2021): Tuning for the Accelerated Linear Algebra (XLA) compiler
which affects LLM serving latencies.
• L2DA (Yazdanbakhsh et al., 2021): "Learning to Design Accelerators", for improving accelerators
such as TPUs and corresponding computer architectures to improve hardware performance.
In the real world regression tasks, each parameter may be continuous or categorical, and we define
the DOF of such a task by its number of parameters. Note that for synthetic objectives, where all inputs
are continuous, 𝑑trad = DOF. However, for real-world tasks with categorical parameters, 𝑑trad > DOF
due to additional one-hot encodings.
For obtaining data, we may either sample ( 𝑥, 𝑦 ) pairs (in the case of synthetic objectives where 𝑥
are sampled from X), or use the given offline data (in the case of real-world tasks, where they were

3
Understanding LLM Embeddings for Regression

actual evaluations from an optimization trajectory), using a standard 8-1-1 train-validation-test split.
Due to the inherent differing of metric scales across tasks, it would be inappropriate to aggregate
results based on scale-dependent metrics such as mean squared error (MSE). Furthermore, we found
that the selection of the regression metric (e.g. Kendall-Tau, Pearson, mean squared error, mean
absolute error) did not matter for comparisons, as they all strongly correlated with each other. Thus,
by default we report the Kendall-Tau ranking correlation, which is always within [0, 1] and can be
aggregated across different tasks.

3. Experimental Results

3.1. High Dimensional Regression

We begin by demonstrating cases in which LLM embeddings better represent inputs over high DOF
spaces than traditional representations. In Figure 2, we show that for a subset of functions, LLM
embeddings possess surprising robustness, retaining the same performance for varying DOFs whereas
traditional baselines such as XGBoost and MLPs significantly falter over higher DOFs.

Sphere RosenbrockRotated Lunacek NegativeSphere


0.9 0.9 0.75 0.9

0.6 0.6 0.50 0.6

0.3 0.3 0.25 0.3

0.0 0.0 0.00 0.0


25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
SharpRidge Gallagher101Me Gallagher21Me GriewankRosenbrock
Kendall-Tau Correlation

0.8 0.75 0.75 0.75

0.50 0.50 0.50


0.4
0.25 0.25 0.25
0.0
0.00 0.00
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
SchaffersF7 SchaffersF7IllConditioned Rastrigin StepEllipsoidal
0.9
0.6 0.6 0.75
0.6
0.4 0.4 0.50
0.3
0.2 0.2 0.25
0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
Degrees of Freedom (DOF)
XGBoost MLP Gemini Pro T5-XXL

Figure 2 | Higher (↑) is better. Degrees of freedom (DOF) vs Kendall-Tau correlation for various BBOB functions.
Results are averaged over 12 runs for each regression method. Each task’s data consists of 500 ( 𝑥, 𝑦 ) evaluations
sampled uniformly across the input space, using a 8-1-1 split for train-validation-test.

This result is not universal however, as we show in Appendix A.1, this pattern does not apply for
some selected functions, but nonetheless it occurs in the majority of the BBOB functions. We further
corroborate this observation over real-world tasks in Table 1. We see that in general, regressions on
LLM embeddings outperform traditional methods more often for tasks with higher DOFs (AutoML
and XLA).

4
Understanding LLM Embeddings for Regression

Task Name Avg. DOF T5-Small % T5-XXL % Gemini Nano % Gemini Pro %
Init2Winit 4 6.7 8.0 11.3 19.0
L2DA 10 2.7 12.0 9.3 10.7
AutoML 29 30.7 41.3 29.3 36.0
XLA 35 17.2 29.3 18.9 24.1

Table 1 | Percentage of tasks in which 𝜙LLM outperforms 𝜙trad across various real world regression tasks. Results
reported for 75 tasks per family, except for XLA, which only contains 58 tasks. Full results in Appendix A.2.

3.2. LLM Embedding Smoothness

Particularly due to the discrete nature of tokenization, it is non-obvious whether LLM embeddings
possess a notion of continuity in embedding space. For example, assuming character-wise tokenization,
1.234 is not so numerically distant from 1.567, but is token-wise distant, as the majority of the
tokens (234 and 567) are not shared.
The notion of continuity and smoothness is crucial for neural network generalization (Kalimeris et al.,
2019; Neyshabur et al., 2018), robustness (Weng et al., 2018), vulnerability to adversarial examples
(Goodfellow et al., 2015), and more. We can characterize smoothness in the regression case by the
Lipschitz-continuity induced by a representation 𝜙 in its latent space ℝ𝑑 .

Sphere, DOF=100 RosenbrockRotated, DOF=100 Lunacek, DOF=100


80 80
80
60 60 60
40 40 40
20 20 20
0 0 0
0 80 160 240 320 0.0 0.5 1.0 1.5 2.0 0 400 800 1200
Count

1e6

100
Schwefel, DOF=100 LinearSlope, DOF=100 Discus, DOF=100
80 100
75
60 75
50
40 50
25 20 25
0 0 0
0 2 4 6 8 0 100 200 300 400 0.0 0.5 1.0 1.5 2.0
1e5 1e7
Normalized Lipschitz Factor
LLM Traditional

Figure 3 | Left-skewness (←) is better. NLFDs induced by 𝜙LLM (T5-XXL) and 𝜙trad . Top: Cases where 𝜙LLM
outperforms 𝜙trad for regression. Bottom: Vice-versa where 𝜙trad outperforms 𝜙LLM .

Intuitively, similar inputs should lead to similar objective values, which can be quantified inversely by
the Lipschitz factor 𝐿 ( 𝑥, 𝑥 ′ ) = ∥ 𝑓 ( 𝑥 ) − 𝑓 ( 𝑥 ′ )∥ /∥ 𝜙 ( 𝑥 ) − 𝜙 ( 𝑥 ′ )∥ with respect to a representation 𝜙 and
∥·∥ norm. We emphasize to the reader that the input space X does not actually have an explicit notion
of distance on its own. Instead, traditionally it has always been assumed that the distance was defined
canonically by Euclidean distance over the traditional embedding method, i.e. ∥ 𝜙trad ( 𝑥 ) − 𝜙trad ( 𝑥 ′ ) ∥ 2
as demonstrated by common use of Euclidean-based radial basis and Matern kernels (Genton, 2002)
during regression modeling. However, as seen from the results previously, it may be the case that
𝜙trad is suboptimal for some regression tasks.

5
Understanding LLM Embeddings for Regression

In order to analyze the continuity of an embedding 𝜙 with respect to offline data D, we define a
Normalized Lipschitz Factor Distribution (NLFD) as follows:
1. Full-batch normalize, i.e. apply shifting and scaling to each 𝜙 ( 𝑥 ) so that in aggregate, D has
zero mean and unit variance per coordinate.
2. For each 𝑥 ∈ D, choose 𝑥 ′ ∈ D such that 𝜙 ( 𝑥 ′ ) is the nearest ℓ2 neighbor of 𝜙 ( 𝑥 ), and compute
the Lipschitz factor 𝐿 ( 𝑥, 𝑥 ′ ).
3. To assume an average embedding
√ norm of 1 for different embedding dimensions 𝑑 , we downscale
all Lipschitz factors by 𝑑 .
We see that there is a high inverse relationship between the skewedness of the NLFD and regression
performance. Specifically, in Figure 3, when 𝜙LLM outperforms 𝜙trad for regression, 𝜙LLM ’s distribution
of Lipschitz factors also tends to skew relatively more to zero than 𝜙trad , and vice-versa.
T5-Small, DOF=100 T5-Large, DOF=100 T5-XL, DOF=100 T5-XXL, DOF=100
K: 0.64, S: 0.83, P: 0.77 K: 0.71, S: 0.88, P: 0.79 K: 0.79, S: 0.92, P: 0.86 K: 0.75, S: 0.91, P: 0.83
0.6 0.6
0.6 0.6
Regression Performance Gap (Kendall)

0.3 0.3 0.3 0.3


0.0 0.0 0.0 0.0
0.3 0.3 0.3 0.3
0.15 0.00 0.15 0.1 0.0 0.1 0.2 0.15 0.00 0.15 0.30 0.0 0.1 0.2 0.3
Gemini Pro, DOF=10 Gemini Pro, DOF=25 Gemini Pro, DOF=50 Gemini Pro, DOF=100
K: 0.58, S: 0.79, P: 0.77 K: 0.54, S: 0.67, P: 0.73 K: 0.65, S: 0.78, P: 0.82 0.8
K: 0.75, S: 0.93, P: 0.88
0.8
0.00 0.3
0.4 0.4
0.25 0.0
0.0 0.0
0.50 0.3
0.4 0.4
0.75
0.2 0.0 0.2 0.4 0.15 0.00 0.15 0.30 0.30 0.15 0.00 0.15 0.30 0.30 0.15 0.00 0.15
NLFD Gap (Z-Score)
Figure 4 | Relationship between gaps in NLFD (via Z-score) and regression performance for all 23 BBOB
functions. Relationship is quantified using (K, S, P), which respectively are Kendall-Tau, Spearman and Pearson
correlations. Top: We vary model size within the T5 model family. Bottom: We vary the objective’s DOF for
Gemini Pro.

To formally quantify comparisons between NLFDs from 𝜙LLM and 𝜙trad , for a fixed regression task,
we may thus compute the Z-score using the difference of the two distributions:
𝜇 𝜙 − 𝜇 𝜙LLM
𝑍 = √︃ trad (1)
𝜎𝜙2 + 𝜎𝜙2
trad LLM

where 𝜇 𝜙 and 𝜎𝜙 are respectively mean and standard deviations of the NLFD of a representation 𝜙.
We may then observe the relationship between gaps in representation smoothness vs. regression
performance. In Figure 4 with extended results in Appendix A.3, we see that for a given BBOB
regression task, the Z-score (i.e. gap in embedding smoothness) is highly correlated with the gap in
regression performance, regardless of the model used (T5 or Gemini) or the DOF of the underlying
objective 𝑓 .
We further visualize whether 𝜙LLM is distance aware, i.e. whether 𝜙LLM ( 𝑥 ) are 𝜙LLM ( 𝑥 ′ ) are close in
embedding space if 𝜙trad ( 𝑥 ) and 𝜙trad ( 𝑥 ′ ) are close. As mentioned before however, there is no ground
truth notion of "closeness" - nonetheless, we use 𝜙trad as a point of comparison. Since it is inappropriate

6
Understanding LLM Embeddings for Regression

to
√ simply sample 𝑥 ’s uniformly in a high DOF space, as then average distances concentrate around
DOF, we instead take a reference point and sample points from ℓ2 -balls of increasing distance from
the reference.
In Figure 5, we see that distances over the LLM embedding space are correlated with the traditional
measure of distance, but may be non-linearly warped, which benefits LLM-based regression in certain
cases as seen in Section 3.1.
Gemini Nano Gemini Pro
Reference Point 20
20 15

Distance From Reference Point

Distance From Reference Point


t-SNE Dimension 2

t-SNE Dimension 2
10
10 45 45
5
30 0 30
0
15 5 15
10
10
15
20 Reference Point
20
20 10 0 10 20 10 0 10 20
t-SNE Dimension 1 t-SNE Dimension 1

Figure 5 | t-SNE for Gemini (Nano and Pro) embeddings of points sampled around a DOF=100
reference point. Traditional ℓ2 distance is overlayed in color.

3.3. Model Effects

In this subsection, we comprehensively investigate the impact of many common LLM factors on
regression performance.
Are Larger Models Always Better? Within the research community, the prevailing assumption is
that there exists a direct correlation between language model size and performance improvement.
However, with the rise of leaderboards such as LMSYS (LMS, 2023), smaller models have been shown
to outperform larger competitors, due to differences in their "recipe", such as training data quality,
pre-training and post-training techniques, and architecture.

T5 Scaling Gemini Scaling


0.5
0.50
Kendall-Tau Correlation

0.4 0.45
AutoML
0.40 Init2Winit
0.3 XLA
0.35 L2DA
0.2 0.30
0.25
108 109 1010 Nano Pro Ultra
Model Parameters Model Tier
Figure 6 | Higher (↑) is better. Model size vs regression performance on hyperparameter tuning tasks
across T5 and Gemini model families. Median performance is plotted, along with 40-60 percentiles
as error bars.

7
Understanding LLM Embeddings for Regression

In Figure 6, we see that over various real world regression tasks, T5 models exhibit a clear trend of
improved performance when increasing model size, when training methodology is fixed. In contrast,
model tiers within the Gemini family exhibit substantial variance, and larger model sizes do not
consistently translate to superior results. We hypothesize this is due to differences in Gemini "recipes",
as e.g. different model tiers may have used different pre-training datasets, architecture tweaks, and
post-training configurations, whereas all T5 model sizes have only been pre-trained on the C4 web
crawl corpus.
Does Language Understanding Actually Help? Recent works (Devlin et al., 2019; Li et al., 2020)
have claimed that logit-based embeddings mostly measure the semantic similarity between string
inputs, and thus it is unconfirmed whether they may be beneficial for numeric regression tasks. To
resolve this, using the T5 family, we compare against using (1) a randomly initialized model for
the forward pass, and (2) representing our features via vocabulary embeddings without applying a
forward pass.
Real-World Tasks (Pre-trained vs Random Init) Real-World Tasks (Full Model vs Vocab Table)
0.48 0.48

Kendall-Tau Correlation
Kendall-Tau Correlation

0.42 0.42
0.36 0.36
0.30 0.30
0.24 0.24
0.18 0.18
0.12 0.12
AutoML Init2Winit XLA Accelerator Design AutoML Init2Winit XLA L2DA
Task Task
T5-Small T5-Large T5-XL T5-XXL T5-* Random T5-Small T5-Large T5-XL T5-XXL T5-* Vocab

Figure 7 | Kendall-Tau regression comparisons when comparing to random initialization (left) and vocabulary
embeddings (right). Each bar is averaged across 75 tasks per family.

In Figure 7, we see that the default mode of applying a forward pass of a pre-trained model performs
the best, as expected. However, it is worth noting that in some tasks such as AutoML and L2DA, the
improvement is surprisingly quite minimal, suggesting that applying forward passes by pretrained
models does not always help for regression.
We further ablate differences in string rep- Real-World Tasks (Feature Names vs Omitted)
0.16 T5-Small
resentation, i.e. whether by default to T5-Large
0.12 T5-XL
K-T Corr. Difference

show feature names as {param1:value1,


T5-XXL
param2:value2,...} or omit them, only 0.08
showing [value1,value2,...]. In Figure 0.04
8, for the majority of tasks, omitting feature
0.00
names does not significantly affect performance,
although specific tasks such as XLA do bene- 0.04

fit from feature names. This is surprising, as AutoML Init2Winit XLA L2DA
Task
presumably feature names in XLA tasks such
as auto_cross_replica_sharding are not Figure 8 | Difference in Kendall correlation when
as common as names such as batch_size or using full dictionary containing feature names, or
learning_rate found in both AutoML and only values.
Init2winit.
The results of Figures 7 and 8 combined lead to additionally surprising conclusions, such as language-
to-numeric transfer. For instance, inputs 𝑥 from Init2Winit tasks only possess numeric values, and
as expected, removing feature names does not significantly change regression results. Yet applying

8
Understanding LLM Embeddings for Regression

forward passes by pre-trained T5 models still benefits regression, despite the fact that T5’s pre-training
data contains mostly web-corpus data which is unlikely to contain significant amounts of scientific or
numeric information (Dodge et al., 2021).
More Training Data Reduces Baseline Gaps: Intuitively, as more samples are available in a task, the
difference in inductive biases between regression methods should matter less, since predictions will
be more influenced by training data. We verify this in Figure 9, where we see that for tasks with low
numbers of ( 𝑥, 𝑦 ) points, there is more variability in performance between using 𝜙LLM and 𝜙trad , but
additional training points decreases these differences.

AutoML XLA
(T5-XXL - MLP) Kendall-Tau Correlation

0.3 0.2

0.2
0.1
0.1
0.0
0.0
0.1
0.1

0.2 0.2

0.3 0.3
0.4
200 400 600 800 1000 200 400 600 800 1000 1200 1400
Number of Training Points
StdDev Mean

Figure 9 | Performance gap between an MLP baseline and regression over T5-XXL embeddings for individual
trials within the AutoML and XLA task settings. Higher (↑) is better for LLM embeddings. Error bars are plotted
for {0.5, 1.0, 2.0} of the standard deviation.

4. Conclusion and Future Work


We thoroughly investigated multiple important aspects around the use of LLM embeddings for
traditional regression. We found that LLM embeddings can be quite performant for input spaces with
high degrees of freedom, and proposed the Lipschitz factor distribution to understand the embedding-
to-objective landscape and its relationship to regression performance. We further investigated the
nuanced conditions for which better language understanding does improve LLM-based regression.
Since strings, and more generally tokens, can represent many varieties of data, it is worth further un-
derstanding the effects of LLM embeddings over non-tabular forms of inputs, including combinatorial
objects such as graphs, and even other modalities such as images and videos.

Acknowledgments

We thank Yutian Chen, Daniel Golovin, Chansoo Lee, Tung Nguyen, and Sagi Perel for relevant
discussions during experimentation and the writing of this paper. We further thank the organizers of
the Google DeepMind Academy Program for providing the opportunity to do this research.

9
Understanding LLM Embeddings for Regression

References
LMSYS: Large model systems organization, 2023. URL https://lmsys.org/.

Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen,


P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rabbat, S. Krishnan,
D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson.
Benchmarking neural network training algorithms. CoRR, abs/2306.07179, 2023.

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers
for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1
(Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi:
10.18653/V1/N19-1423.

J. Dodge, M. Sap, A. Marasovic, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner.


Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In M. Moens,
X. Huang, L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11
November, 2021, pages 1286–1305. Association for Computational Linguistics, 2021.

O. Elhara, K. Varelas, D. Nguyen, T. Tusar, D. Brockhoff, N. Hansen, and A. Auger. Coco: the large scale
black-box optimization benchmarking (bbob-largescale) test suite. arXiv preprint arXiv:1903.06396,
2019.

M. G. Genton. Classes of kernels for machine learning: a statistics perspective. J. Mach. Learn. Res.,
2:299–312, Mar. 2002. ISSN 1532-4435.

D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service
for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 1487–1495.
ACM, 2017.

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In


Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Google. Tfx: A tensorflow-based production-scale machine learning platform. https://www.


tensorflow.org/tfx, 2023. Accessed: November 1, 2024.
Google. Gemini: A family of highly capable multimodal models, 2024.

Google Cloud.Vertex ai automl. https://cloud.google.com/vertex-ai/docs/start/


automl-intro, 2023. Accessed: November 1, 2024.
D. Kalimeris, G. Kaplun, P. Nakkiran, B. L. Edelman, T. Yang, B. Barak, and H. Zhang. SGD on neural
networks learns functions of increasing complexity. In H. M. Wallach, H. Larochelle, A. Beygelzimer,
F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems
32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December
8-14, 2019, Vancouver, BC, Canada, pages 3491–3501, 2019.

10
Understanding LLM Embeddings for Regression

V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih. Dense passage
retrieval for open-domain question answering. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors,
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP
2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics,
2020.

A. Kristiadi, F. Strieth-Kalthoff, M. Skreta, P. Poupart, A. Aspuru-Guzik, and G. Pleiss. A sober look at


llms for material discovery: Are they actually good for bayesian optimization over molecules? In
Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,
2024. OpenReview.net, 2024.

B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li. On the sentence embeddings from pre-trained
language models. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November
16-20, 2020, pages 9119–9130. Association for Computational Linguistics, 2020.

Q. Liu, M. J. Kusner, and P. Blunsom. A survey on contextual embeddings. CoRR, abs/2003.07278,


2020.

B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin


bounds for neural networks. In 6th International Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
2018.

T. Nguyen, Q. Zhang, B. Yang, C. Lee, J. Bornschein, Y. Miao, S. Perel, Y. Chen, and X. Song. Predicting
from strings: Language model embeddings for bayesian optimization, 2024.

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.

P. M. Phothilimthana, A. Sabne, N. Sarda, K. S. Murthy, Y. Zhou, C. Angermueller, M. Burrows, S. Roy,


K. Mandke, R. Farahani, Y. E. Wang, B. Ilbeyi, B. A. Hechtman, B. Roune, S. Wang, Y. Xu, and S. J.
Kaufman. A flexible approach to autotuning multi-pass machine learning compilers. In J. Lee and
A. Cohen, editors, 30th International Conference on Parallel Architectures and Compilation Techniques,
PACT 2021, Atlanta, GA, USA, September 26-29, 2021, pages 1–16. IEEE, 2021.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring
the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:
140:1–140:67, 2020.

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.


In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–
3990. Association for Computational Linguistics, 2019.

X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as
universal regressors. CoRR, abs/2402.14547, 2024.

R. Vacareanu, V. Negru, V. Suciu, and M. Surdeanu. From words to numbers: Your large language
model is secretly A capable regressor when given in-context examples. CoRR, abs/2404.07544,
2024.

11
Understanding LLM Embeddings for Regression

T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel. Evaluating the robustness of
neural networks: An extreme value theory approach. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings. OpenReview.net, 2018.

A. Yazdanbakhsh, C. Angermüller, B. Akin, Y. Zhou, A. Jones, M. Hashemi, K. Swersky, S. Chat-


terjee, R. Narayanaswami, and J. Laudon. Apollo: Transferable architecture exploration. CoRR,
abs/2102.01723, 2021.

12
Understanding LLM Embeddings for Regression

Appendix
A. Extended Experiments

A.1. High Dimensional Regression

For full transparency, In Figure 10, we display BBOB functions where LLM-based regression was not
consistently dimensionally robust against MLP and XGBoost baselines. Note that even in these cases,
we still see certain cases where a language model outperforms at least one of the baselines, e.g. in
the Discus and DifferentPowers functions, Gemini and T5 outperform MLP but not XGBoost.

DifferentPowers Ellipsoidal LinearSlope Discus


0.9 1.0
0.9 0.9

0.6 0.8
0.6 0.6
0.3 0.6
0.3 0.3
0.0 0.4
0.0 0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
BentCigar BuecheRastrigin AttractiveSector Schwefel
Kendall-Tau Correlation

1.00
0.9 0.9
0.75 0.75
0.6 0.6
0.50 0.50
0.3 0.3
0.25 0.25
0.0 0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
Weierstrass Katsuura NegativeMinDifference
0.6 0.9
0.1
0.4 0.6
0.0
0.2 0.3
0.0 0.1
0.0
25 50 75 100 25 50 75 100 25 50 75 100
Degrees of Freedom (DOF)
XGBoost MLP Gemini Pro T5-XXL

Figure 10 | Following Figure 2 in the main body, we present BBOB functions in which LLM embeddings
did not completely outperform traditional baselines.

13
Understanding LLM Embeddings for Regression

A.2. Real World Results

Despite Table 1 of the main body showing that there were numerous cases where LLM embeddings
outperform traditional ones, we remind the reader in Table 11 that on average, LLM embeddings still
slightly underperform.

0.7
Full Real World Regression Results
0.6
Kendall-Tau Correlation

0.5
XGBoost T5-Small
0.4 MLP T5-Large
Gemini Nano T5-XL
0.3 Gemini Pro T5-XXL
Gemini Ultra
0.2
0.1
0.0
AutoTFX Init2Winit XLA L2DA
Task

Figure 11 | Full Results over real world tasks. Displayed is the mean Kendall-Tau Correlation over all
tasks within each family.

A.3. Performance Correlations

Following Figure 4, in Table 2, we see that the relationship between the smoothness induced by the
embedding and the performance in regression is consistent throughout.

Model DOF=5 DOF=10 DOF=25 DOF=50 DOF=100


Gemini Nano 0.81 0.81 0.70 0.75 0.86
Gemini Pro 0.78 0.77 0.72 0.82 0.88
T5-Small 0.75 0.76 0.79 0.79 0.76
T5-Large 0.78 0.73 0.79 0.85 0.79
T5-XL 0.82 0.60 0.80 0.86 0.85
T5-XXL 0.72 0.76 0.82 0.83 0.83

Table 2 | Full set of data for Pearson correlation 𝜌 between Kendall’s regression performance and
gap in NLFD between input and embedding space for regression on all 23 BBOB functions, over
DOF=[5, 10, 25, 50, 100].

14
Understanding LLM Embeddings for Regression

B. Exact Modeling Details

B.1. Hyperparameters Used

The full list of hyperparameters and training details for MLP-based regression (using traditional and
language model features):
• Regression Head: MLP with 2 ReLU hidden layers of dimension 256.
• 𝑦 -Normalization: We compute the empirical mean 𝜇 and standard deviation 𝜎 over all 𝑦 -values
in the task’s training data, and apply 𝑦 ← ( 𝑦 − 𝜇 )/𝜎 as a preprocessing step.
• Optimizer: AdamW with sweeped learning rates across {1e-4, 5e-4, 1e-3, 5e-3, 1e-2} and
weight decay across {0, 1e-1, 1}.
• Loss: Mean Squared Error.
• Maximum Epochs: 300, with early stopping enabled.
For XGBoost, we additionally grid-searched over the following parameters for each task:
• “min_child_weight": [1, 5, 10]
• “learning_rate": [0.001, 0.01, 0.1]
• “gamma": [0.0, 0.3, 0.5]
• “subsample": [0.6, 0.8, 1.0]
• “colsample_bytree": [0.6, 0.8, 1.0]
• “max_depth": [3, 5, 7]

B.2. Embedding Sizes

Table 3 displays the embedding 𝑑llm for each model used in our experiments. As mentioned in the
main text, note that 𝑑llm is significantly larger than 𝑑trad .

T5 Model 𝑑llm
Gemini Model 𝑑llm
Small 512
Nano 1536
Large 1024
Pro 6144
XL 2048
Ultra 14336
XXL 4096

Table 3 | Embedding dimensions 𝑑llm for T5 and Gemini model families.

15
Understanding LLM Embeddings for Regression

C. Example String Representations


Table 4 contains example string representations of 𝑥 for different regression task families.

Task Family Example Representations


BBOB x0:0.32, x1:-4.21, x2:3.12, x3:1.56
AutoML batch_size:128, ml_feature_selection_threshold:0.05,
model_type:’DNN_ESTIMATOR’, activation_fn:’selu’, batch_norm:’False’,
bucketization_strategy:’mdl’,dropout:0.071, hidden_units:359
Init2Winit lr_hparams.base_lr:0.0696, opt_hparams.0.hps.one_minus_b1:0.2823,
opt_hparams.0.hps.one_minus_b2:0.0432,
opt_hparams.1.hps.weight_decay:0.0023
XLA auto_cross_replica_sharding:’False’,
rematerialization_percent_shared_memory_limit:97,
spmd_threshold_for_windowed_einsum_mib:100000, ...
L2DA input_activation_memory_depth:11.0, instruction_memory_depth:15.0,
io_bandwidth_gbps:4.321, narrow_memory_capacity_bytes:21.0, ...

Table 4 | Example 𝑥 representations from each of the regression task families. ‘. . . ’ denotes that there
are actually more parameters, but we omit them due to length.

16

You might also like