0% found this document useful (0 votes)

10 views16 pages

LLM para Regresion

Uploaded by

jose angel florian romero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views16 pages

LLM para Regresion

Uploaded by

jose angel florian romero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Understanding LLM Embeddings for

Regression
Eric Tang∗1 , Bangding Yang2 and Xingyou Song3
1 Stanford University, 2 Google, 3 Google DeepMind

∗ Work performed during the Google DeepMind Academy Program.

With the rise of large language models (LLMs) for flexibly processing information as strings, a natural
application is regression, specifically by preprocessing string representations into LLM embeddings as
downstream features for metric prediction. In this paper, we provide one of the first comprehensive
investigations into embedding-based regression and demonstrate that LLM embeddings as features
arXiv:2411.14708v2 [cs.LG] 2 Dec 2024

can be better for high-dimensional regression tasks than using traditional feature engineering. This
regression performance can be explained in part due to LLM embeddings over numeric data inherently
preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of
different model effects, most notably model size and language understanding, which we find surprisingly
do not always improve regression performance.

1. Introduction and Related Work

Regression is a fundamental statistical tool used to model the relationship between a metric and a
selected set of features, playing a crucial role in various fields, enabling predictions, forecasting, and
the understanding of underlying relationships within data. Traditional regression techniques often
rely on handcrafted features or domain-specific knowledge to represent input data. However, the
5D
advent of Large Language Models (LLMs) and their ability to instead process semantic representations
of text has raised the question of whether regression can instead be performed over free-form text.
Previous works have predominantly examined
the topic of LLM-based regression through de-
coding, i.e. generating floating point predic- 60
tions using token-based sampling. For example,
(Song et al., 2024) examines the case when
the model is fully accessible and fine-tunable
Y Value

against data, while (Vacareanu et al., 2024) 45

study the ability of service-based closed-source
LLMs such as GPT-4 using in-context learning.
One understudied case however is the use of 30
service-based LLM embeddings - fixed vector
representations derived from pre-trained (but
frozen) language models, which are ubiqui-
tously offered among most LLM services (An- Figure 1 | Rugged surface of a 5D Sphere function
thropic, 2024; Google, 2024; OpenAI, 2023). when inputs are represented as Gemini embeddings of
Although they are used frequently in recent ap- dimension 6K+, post-processed by t-SNE into 2D space.
plications such as retrieval (Karpukhin et al.,
2020), semantic similarity (Li et al., 2020), and a variety of other downstream language tasks (Liu
5D
et al., 2020), there has been very little fundamental research around their use in regression, outside
of specific applications such as Bayesian Optimization (Kristiadi et al., 2024; Nguyen et al., 2024).

© 2024 Google DeepMind. All rights reserved

60
Understanding LLM Embeddings for Regression

In contrast to decoding-based regression techniques, embedding-based regression allows the possibility

of cheap data-driven training using inexpensive and customizable post-embedding layers such as
multi-layer perceptrons (MLPs). However, as shown in Figure 1, when the domain of a simple function
is expressed using high-dimensional embeddings, unexpected characteristics and irregularities can
arise, prompting the need for a thorough analysis. Furthermore, LLMs by default are not explicitly
trained for embedding-based regression, rather purely for token generation, and thus it is worth
analyzing the emergent behaviors of LLM embeddings when applied to regression.
This paper investigates the behavior of these LLM embeddings when used as features for standard
tabular regression tasks. Most notably, our findings are:
• LLM embeddings are dimensionally robust, i.e. regression performance can remain strong even
over high-dimensional data, whereas traditional representations significantly suffer.
• Over numeric formats, LLM embeddings preserve Lipschitz-continuity and smoothness over
feature space, which naturally enables regression when using a downstream MLP head.
• Factors which directly impact language understanding (e.g. size, pre-training, and input
formatting) have more nuanced effects for regression and do not always provide significantly
better outcomes.

2. Problem and Methodology

A regression task T = ( 𝑓 , X , D) consists of an underlying scalar-valued function 𝑓 : X → ℝ over
an input space X. Provided are offline training data D𝑡𝑟𝑎𝑖𝑛 = {( 𝑥1 , 𝑦1 ) , ..., ( 𝑥𝑇 , 𝑦𝑇 )} collected from
querying 𝑓 and an analogous test set D𝑡𝑒𝑠𝑡 for evaluation. Given access to training data D𝑡𝑟𝑎𝑖𝑛 , the
goal is to obtain accurate predictions over test points ( 𝑥, 𝑦 ) ∈ D𝑡𝑒𝑠𝑡 , usually measured by an aggregate
performance measure, e.g. mean squared error or Kendall-Tau ranking scores.
Required by nearly all learnable regression methods are features, which we assume come from an
embedder 𝜙 : X → ℝ𝑑 which takes an input 𝑥 and returns a fixed-dimensional feature representation,
of dimension 𝑑 . Here, we use the terms "features" and "embedding" interchangeably, since traditional
methods typically use a canonical, manually defined feature engineering method for tabular data, in
which continuous values are normalized and categorical selections are one-hot encoded. This feature
vector 𝜙 ( 𝑥 ) is then sent to a downstream predictor, e.g. MLP or random forest, which is trained using
a loss function such as mean squared error.
Language models also provide a canonical definition of embedding, which typically consists of, in
order:
1. Tokenizing a string representation 𝑥 into 𝐿 tokens.
2. Obtaining a "soft prompt" ℝ𝐿× 𝑣 via vocabulary look-up.
3. Applying a forward pass of a Transformer to obtain an output ℝ𝐿× 𝑓 .
4. Pooling down to a fixed dimension vector in ℝ𝑑 .
Afterwards, one may also attach an MLP predictor head and apply an analogous training procedure
as in the traditional case. Thus we can see that the only difference becomes the input representation
𝜙, i.e. whether we used a traditional 𝜙trad or LLM-based 𝜙LLM .

While it is straightforward to assume that the whole process outlined for LLMs should constitute the
definition of a language model embedding 𝜙LLM , it is not obvious how much each of these steps may
contribute to the final regression result. For instance, one could simply skip applying a forward pass
in step (3) and pool the soft prompt directly, or use a randomly initialized model as opposed to a
pretrained one. We extensively study this case in Section 3.3.

2
Understanding LLM Embeddings for Regression

2.1. Modeling

To minimize confounding factors and maintain fairness during comparisons, we use the exact
same MLP prediction head (2 hidden layers, ReLU activation), loss (mean squared error), and
𝑦 -normalization scheme (shifting by empirical mean and dividing by empirical deviation), regardless
of using 𝜙LLM and 𝜙trad . Note however, that the embedding dimensions of the two representations
may be different, and so we distinguish them using notation 𝑑llm and 𝑑trad respectively, where typically
𝑑llm ≫ 𝑑trad . Further details can be found in Appendix B.1.

To demonstrate consistent results over different families of language models, we benchmark over both
the T5 (Raffel et al., 2020) and Gemini 1.0 (Google, 2024) families, which use different architectures
(encoder-decoder and decoder-only), different vocabulary sizes (32K and 256K), and embedding
dimensions (see Appendix B.2) respectively. However, to remain consistent with the definition of
embedding, we follow previous literature (Li et al., 2020; Reimers and Gurevych, 2019) and use
average-pooling as the canonical method of aggregating Transformer outputs, and thus the embedding
dimension 𝑑llm is equivalent to the the output feature dimension 𝑓 following a forward pass.
Similar to previous work (Nguyen et al., 2024; Song et al., 2024), for string representations of 𝑥 from
any regression task, by default we use a key-value JSON format with consistent ordering of keys, i.e.
{param1:value1,param2:value2,...}, with specific examples shown in Appendix C.

2.2. Regression Tasks

For regression tasks, we first use synthetic, closed-form objective functions in order to produce
controlled studies in which we may query any 𝑥 from the input space. Our synthetic functions are
defined from the standard Black-Box Optimization Benchmarking (BBOB) suite (Elhara et al., 2019).
To avoid confounding terminology between embedding "dimension" 𝑑 and the intrinsic "dimension"
of an objective 𝑓 , we denote the latter as "degree-of-freedom" (DOF), and thus 𝑓 (·) is dependent on
input coordinates 𝑥 (1) , . . . , 𝑥 (DOF) , each of which is between [−5, 5]. This provides a comprehensive
variety of both convex and non-convex objective landscapes to regress upon.
We further use real-world regression tasks representative of those encountered in the wild and in
industry settings by benchmarking over offline objective evaluations found in Google Vizier (Golovin
et al., 2017), which optimizes Google’s largest production and research systems. These consist of four
families, with each family containing at least 50 individual yet similar regression tasks. The families
are:
• AutoML (Google Cloud, 2023): Automated Machine Learning platform for automating TFX
(Google, 2023) pipelines (e.g. batch size, activation, layer counts) over tabular or text data.
• Init2Winit (Dahl et al., 2023): Learning rate scheduling parameters influencing common image
classification tasks (e.g. ResNets on CIFAR-10 and ImageNet).
• XLA (Phothilimthana et al., 2021): Tuning for the Accelerated Linear Algebra (XLA) compiler
which affects LLM serving latencies.
• L2DA (Yazdanbakhsh et al., 2021): "Learning to Design Accelerators", for improving accelerators
such as TPUs and corresponding computer architectures to improve hardware performance.
In the real world regression tasks, each parameter may be continuous or categorical, and we define
the DOF of such a task by its number of parameters. Note that for synthetic objectives, where all inputs
are continuous, 𝑑trad = DOF. However, for real-world tasks with categorical parameters, 𝑑trad > DOF
due to additional one-hot encodings.
For obtaining data, we may either sample ( 𝑥, 𝑦 ) pairs (in the case of synthetic objectives where 𝑥
are sampled from X), or use the given offline data (in the case of real-world tasks, where they were

3
Understanding LLM Embeddings for Regression

actual evaluations from an optimization trajectory), using a standard 8-1-1 train-validation-test split.
Due to the inherent differing of metric scales across tasks, it would be inappropriate to aggregate
results based on scale-dependent metrics such as mean squared error (MSE). Furthermore, we found
that the selection of the regression metric (e.g. Kendall-Tau, Pearson, mean squared error, mean
absolute error) did not matter for comparisons, as they all strongly correlated with each other. Thus,
by default we report the Kendall-Tau ranking correlation, which is always within [0, 1] and can be
aggregated across different tasks.

3. Experimental Results

3.1. High Dimensional Regression

We begin by demonstrating cases in which LLM embeddings better represent inputs over high DOF
spaces than traditional representations. In Figure 2, we show that for a subset of functions, LLM
embeddings possess surprising robustness, retaining the same performance for varying DOFs whereas
traditional baselines such as XGBoost and MLPs significantly falter over higher DOFs.

Sphere RosenbrockRotated Lunacek NegativeSphere

0.9 0.9 0.75 0.9

0.6 0.6 0.50 0.6

0.3 0.3 0.25 0.3

0.0 0.0 0.00 0.0

25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
SharpRidge Gallagher101Me Gallagher21Me GriewankRosenbrock
Kendall-Tau Correlation

0.8 0.75 0.75 0.75

0.50 0.50 0.50

0.4
0.25 0.25 0.25
0.0
0.00 0.00
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
SchaffersF7 SchaffersF7IllConditioned Rastrigin StepEllipsoidal
0.9
0.6 0.6 0.75
0.6
0.4 0.4 0.50
0.3
0.2 0.2 0.25
0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
Degrees of Freedom (DOF)
XGBoost MLP Gemini Pro T5-XXL

Figure 2 | Higher (↑) is better. Degrees of freedom (DOF) vs Kendall-Tau correlation for various BBOB functions.
Results are averaged over 12 runs for each regression method. Each task’s data consists of 500 ( 𝑥, 𝑦 ) evaluations
sampled uniformly across the input space, using a 8-1-1 split for train-validation-test.

This result is not universal however, as we show in Appendix A.1, this pattern does not apply for
some selected functions, but nonetheless it occurs in the majority of the BBOB functions. We further
corroborate this observation over real-world tasks in Table 1. We see that in general, regressions on
LLM embeddings outperform traditional methods more often for tasks with higher DOFs (AutoML
and XLA).

4
Understanding LLM Embeddings for Regression

Task Name Avg. DOF T5-Small % T5-XXL % Gemini Nano % Gemini Pro %
Init2Winit 4 6.7 8.0 11.3 19.0
L2DA 10 2.7 12.0 9.3 10.7
AutoML 29 30.7 41.3 29.3 36.0
XLA 35 17.2 29.3 18.9 24.1

Table 1 | Percentage of tasks in which 𝜙LLM outperforms 𝜙trad across various real world regression tasks. Results
reported for 75 tasks per family, except for XLA, which only contains 58 tasks. Full results in Appendix A.2.

3.2. LLM Embedding Smoothness

Particularly due to the discrete nature of tokenization, it is non-obvious whether LLM embeddings
possess a notion of continuity in embedding space. For example, assuming character-wise tokenization,
1.234 is not so numerically distant from 1.567, but is token-wise distant, as the majority of the
tokens (234 and 567) are not shared.
The notion of continuity and smoothness is crucial for neural network generalization (Kalimeris et al.,
2019; Neyshabur et al., 2018), robustness (Weng et al., 2018), vulnerability to adversarial examples
(Goodfellow et al., 2015), and more. We can characterize smoothness in the regression case by the
Lipschitz-continuity induced by a representation 𝜙 in its latent space ℝ𝑑 .

Sphere, DOF=100 RosenbrockRotated, DOF=100 Lunacek, DOF=100

80 80
80
60 60 60
40 40 40
20 20 20
0 0 0
0 80 160 240 320 0.0 0.5 1.0 1.5 2.0 0 400 800 1200
Count

1e6

100
Schwefel, DOF=100 LinearSlope, DOF=100 Discus, DOF=100
80 100
75
60 75
50
40 50
25 20 25
0 0 0
0 2 4 6 8 0 100 200 300 400 0.0 0.5 1.0 1.5 2.0
1e5 1e7
Normalized Lipschitz Factor
LLM Traditional

Figure 3 | Left-skewness (←) is better. NLFDs induced by 𝜙LLM (T5-XXL) and 𝜙trad . Top: Cases where 𝜙LLM
outperforms 𝜙trad for regression. Bottom: Vice-versa where 𝜙trad outperforms 𝜙LLM .

Intuitively, similar inputs should lead to similar objective values, which can be quantified inversely by
the Lipschitz factor 𝐿 ( 𝑥, 𝑥 ′ ) = ∥ 𝑓 ( 𝑥 ) − 𝑓 ( 𝑥 ′ )∥ /∥ 𝜙 ( 𝑥 ) − 𝜙 ( 𝑥 ′ )∥ with respect to a representation 𝜙 and
∥·∥ norm. We emphasize to the reader that the input space X does not actually have an explicit notion
of distance on its own. Instead, traditionally it has always been assumed that the distance was defined
canonically by Euclidean distance over the traditional embedding method, i.e. ∥ 𝜙trad ( 𝑥 ) − 𝜙trad ( 𝑥 ′ ) ∥ 2
as demonstrated by common use of Euclidean-based radial basis and Matern kernels (Genton, 2002)
during regression modeling. However, as seen from the results previously, it may be the case that
𝜙trad is suboptimal for some regression tasks.

5
Understanding LLM Embeddings for Regression

In order to analyze the continuity of an embedding 𝜙 with respect to offline data D, we define a
Normalized Lipschitz Factor Distribution (NLFD) as follows:
1. Full-batch normalize, i.e. apply shifting and scaling to each 𝜙 ( 𝑥 ) so that in aggregate, D has
zero mean and unit variance per coordinate.
2. For each 𝑥 ∈ D, choose 𝑥 ′ ∈ D such that 𝜙 ( 𝑥 ′ ) is the nearest ℓ2 neighbor of 𝜙 ( 𝑥 ), and compute
the Lipschitz factor 𝐿 ( 𝑥, 𝑥 ′ ).
3. To assume an average embedding
√ norm of 1 for different embedding dimensions 𝑑 , we downscale
all Lipschitz factors by 𝑑 .
We see that there is a high inverse relationship between the skewedness of the NLFD and regression
performance. Specifically, in Figure 3, when 𝜙LLM outperforms 𝜙trad for regression, 𝜙LLM ’s distribution
of Lipschitz factors also tends to skew relatively more to zero than 𝜙trad , and vice-versa.
T5-Small, DOF=100 T5-Large, DOF=100 T5-XL, DOF=100 T5-XXL, DOF=100
K: 0.64, S: 0.83, P: 0.77 K: 0.71, S: 0.88, P: 0.79 K: 0.79, S: 0.92, P: 0.86 K: 0.75, S: 0.91, P: 0.83
0.6 0.6
0.6 0.6
Regression Performance Gap (Kendall)

0.3 0.3 0.3 0.3

0.0 0.0 0.0 0.0
0.3 0.3 0.3 0.3
0.15 0.00 0.15 0.1 0.0 0.1 0.2 0.15 0.00 0.15 0.30 0.0 0.1 0.2 0.3
Gemini Pro, DOF=10 Gemini Pro, DOF=25 Gemini Pro, DOF=50 Gemini Pro, DOF=100
K: 0.58, S: 0.79, P: 0.77 K: 0.54, S: 0.67, P: 0.73 K: 0.65, S: 0.78, P: 0.82 0.8
K: 0.75, S: 0.93, P: 0.88
0.8
0.00 0.3
0.4 0.4
0.25 0.0
0.0 0.0
0.50 0.3
0.4 0.4
0.75
0.2 0.0 0.2 0.4 0.15 0.00 0.15 0.30 0.30 0.15 0.00 0.15 0.30 0.30 0.15 0.00 0.15
NLFD Gap (Z-Score)
Figure 4 | Relationship between gaps in NLFD (via Z-score) and regression performance for all 23 BBOB
functions. Relationship is quantified using (K, S, P), which respectively are Kendall-Tau, Spearman and Pearson
correlations. Top: We vary model size within the T5 model family. Bottom: We vary the objective’s DOF for
Gemini Pro.

To formally quantify comparisons between NLFDs from 𝜙LLM and 𝜙trad , for a fixed regression task,
we may thus compute the Z-score using the difference of the two distributions:
𝜇 𝜙 − 𝜇 𝜙LLM
𝑍 = √︃ trad (1)
𝜎𝜙2 + 𝜎𝜙2
trad LLM

where 𝜇 𝜙 and 𝜎𝜙 are respectively mean and standard deviations of the NLFD of a representation 𝜙.
We may then observe the relationship between gaps in representation smoothness vs. regression
performance. In Figure 4 with extended results in Appendix A.3, we see that for a given BBOB
regression task, the Z-score (i.e. gap in embedding smoothness) is highly correlated with the gap in
regression performance, regardless of the model used (T5 or Gemini) or the DOF of the underlying
objective 𝑓 .
We further visualize whether 𝜙LLM is distance aware, i.e. whether 𝜙LLM ( 𝑥 ) are 𝜙LLM ( 𝑥 ′ ) are close in
embedding space if 𝜙trad ( 𝑥 ) and 𝜙trad ( 𝑥 ′ ) are close. As mentioned before however, there is no ground
truth notion of "closeness" - nonetheless, we use 𝜙trad as a point of comparison. Since it is inappropriate

6
Understanding LLM Embeddings for Regression

to
√ simply sample 𝑥 ’s uniformly in a high DOF space, as then average distances concentrate around
DOF, we instead take a reference point and sample points from ℓ2 -balls of increasing distance from
the reference.
In Figure 5, we see that distances over the LLM embedding space are correlated with the traditional
measure of distance, but may be non-linearly warped, which benefits LLM-based regression in certain
cases as seen in Section 3.1.
Gemini Nano Gemini Pro
Reference Point 20
20 15

Distance From Reference Point

t-SNE Dimension 2

t-SNE Dimension 2
10
10 45 45
5
30 0 30
0
15 5 15
10
10
15
20 Reference Point
20
20 10 0 10 20 10 0 10 20
t-SNE Dimension 1 t-SNE Dimension 1

Figure 5 | t-SNE for Gemini (Nano and Pro) embeddings of points sampled around a DOF=100
reference point. Traditional ℓ2 distance is overlayed in color.

3.3. Model Effects

In this subsection, we comprehensively investigate the impact of many common LLM factors on
regression performance.
Are Larger Models Always Better? Within the research community, the prevailing assumption is
that there exists a direct correlation between language model size and performance improvement.
However, with the rise of leaderboards such as LMSYS (LMS, 2023), smaller models have been shown
to outperform larger competitors, due to differences in their "recipe", such as training data quality,
pre-training and post-training techniques, and architecture.

T5 Scaling Gemini Scaling

0.5
0.50
Kendall-Tau Correlation

0.4 0.45
AutoML
0.40 Init2Winit
0.3 XLA
0.35 L2DA
0.2 0.30
0.25
108 109 1010 Nano Pro Ultra
Model Parameters Model Tier
Figure 6 | Higher (↑) is better. Model size vs regression performance on hyperparameter tuning tasks
across T5 and Gemini model families. Median performance is plotted, along with 40-60 percentiles
as error bars.

7
Understanding LLM Embeddings for Regression

In Figure 6, we see that over various real world regression tasks, T5 models exhibit a clear trend of
improved performance when increasing model size, when training methodology is fixed. In contrast,
model tiers within the Gemini family exhibit substantial variance, and larger model sizes do not
consistently translate to superior results. We hypothesize this is due to differences in Gemini "recipes",
as e.g. different model tiers may have used different pre-training datasets, architecture tweaks, and
post-training configurations, whereas all T5 model sizes have only been pre-trained on the C4 web
crawl corpus.
Does Language Understanding Actually Help? Recent works (Devlin et al., 2019; Li et al., 2020)
have claimed that logit-based embeddings mostly measure the semantic similarity between string
inputs, and thus it is unconfirmed whether they may be beneficial for numeric regression tasks. To
resolve this, using the T5 family, we compare against using (1) a randomly initialized model for
the forward pass, and (2) representing our features via vocabulary embeddings without applying a
forward pass.
Real-World Tasks (Pre-trained vs Random Init) Real-World Tasks (Full Model vs Vocab Table)
0.48 0.48

Kendall-Tau Correlation
Kendall-Tau Correlation

0.42 0.42
0.36 0.36
0.30 0.30
0.24 0.24
0.18 0.18
0.12 0.12
AutoML Init2Winit XLA Accelerator Design AutoML Init2Winit XLA L2DA
Task Task
T5-Small T5-Large T5-XL T5-XXL T5-* Random T5-Small T5-Large T5-XL T5-XXL T5-* Vocab

Figure 7 | Kendall-Tau regression comparisons when comparing to random initialization (left) and vocabulary
embeddings (right). Each bar is averaged across 75 tasks per family.

In Figure 7, we see that the default mode of applying a forward pass of a pre-trained model performs
the best, as expected. However, it is worth noting that in some tasks such as AutoML and L2DA, the
improvement is surprisingly quite minimal, suggesting that applying forward passes by pretrained
models does not always help for regression.
We further ablate differences in string rep- Real-World Tasks (Feature Names vs Omitted)
0.16 T5-Small
resentation, i.e. whether by default to T5-Large
0.12 T5-XL
K-T Corr. Difference

show feature names as {param1:value1,

T5-XXL
param2:value2,...} or omit them, only 0.08
showing [value1,value2,...]. In Figure 0.04
8, for the majority of tasks, omitting feature
0.00
names does not significantly affect performance,
although specific tasks such as XLA do bene- 0.04

fit from feature names. This is surprising, as AutoML Init2Winit XLA L2DA
Task
presumably feature names in XLA tasks such
as auto_cross_replica_sharding are not Figure 8 | Difference in Kendall correlation when
as common as names such as batch_size or using full dictionary containing feature names, or
learning_rate found in both AutoML and only values.
Init2winit.
The results of Figures 7 and 8 combined lead to additionally surprising conclusions, such as language-
to-numeric transfer. For instance, inputs 𝑥 from Init2Winit tasks only possess numeric values, and
as expected, removing feature names does not significantly change regression results. Yet applying

8
Understanding LLM Embeddings for Regression

forward passes by pre-trained T5 models still benefits regression, despite the fact that T5’s pre-training
data contains mostly web-corpus data which is unlikely to contain significant amounts of scientific or
numeric information (Dodge et al., 2021).
More Training Data Reduces Baseline Gaps: Intuitively, as more samples are available in a task, the
difference in inductive biases between regression methods should matter less, since predictions will
be more influenced by training data. We verify this in Figure 9, where we see that for tasks with low
numbers of ( 𝑥, 𝑦 ) points, there is more variability in performance between using 𝜙LLM and 𝜙trad , but
additional training points decreases these differences.

AutoML XLA
(T5-XXL - MLP) Kendall-Tau Correlation

0.3 0.2

0.2
0.1
0.1
0.0
0.0
0.1
0.1

0.2 0.2

0.3 0.3
0.4
200 400 600 800 1000 200 400 600 800 1000 1200 1400
Number of Training Points
StdDev Mean

Figure 9 | Performance gap between an MLP baseline and regression over T5-XXL embeddings for individual
trials within the AutoML and XLA task settings. Higher (↑) is better for LLM embeddings. Error bars are plotted
for {0.5, 1.0, 2.0} of the standard deviation.

4. Conclusion and Future Work

We thoroughly investigated multiple important aspects around the use of LLM embeddings for
traditional regression. We found that LLM embeddings can be quite performant for input spaces with
high degrees of freedom, and proposed the Lipschitz factor distribution to understand the embedding-
to-objective landscape and its relationship to regression performance. We further investigated the
nuanced conditions for which better language understanding does improve LLM-based regression.
Since strings, and more generally tokens, can represent many varieties of data, it is worth further un-
derstanding the effects of LLM embeddings over non-tabular forms of inputs, including combinatorial
objects such as graphs, and even other modalities such as images and videos.

Acknowledgments

We thank Yutian Chen, Daniel Golovin, Chansoo Lee, Tung Nguyen, and Sagi Perel for relevant
discussions during experimentation and the writing of this paper. We further thank the organizers of
the Google DeepMind Academy Program for providing the opportunity to do this research.

9
Understanding LLM Embeddings for Regression

References
LMSYS: Large model systems organization, 2023. URL https://lmsys.org/.

Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen,

P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rabbat, S. Krishnan,
D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson.
Benchmarking neural network training algorithms. CoRR, abs/2306.07179, 2023.

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers
for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1
(Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi:
10.18653/V1/N19-1423.

J. Dodge, M. Sap, A. Marasovic, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner.

Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In M. Moens,
X. Huang, L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11
November, 2021, pages 1286–1305. Association for Computational Linguistics, 2021.

O. Elhara, K. Varelas, D. Nguyen, T. Tusar, D. Brockhoff, N. Hansen, and A. Auger. Coco: the large scale
black-box optimization benchmarking (bbob-largescale) test suite. arXiv preprint arXiv:1903.06396,
2019.

M. G. Genton. Classes of kernels for machine learning: a statistics perspective. J. Mach. Learn. Res.,
2:299–312, Mar. 2002. ISSN 1532-4435.

D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service
for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 1487–1495.
ACM, 2017.

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In

Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Google. Tfx: A tensorflow-based production-scale machine learning platform. https://www.

tensorflow.org/tfx, 2023. Accessed: November 1, 2024.
Google. Gemini: A family of highly capable multimodal models, 2024.

Google Cloud.Vertex ai automl. https://cloud.google.com/vertex-ai/docs/start/

automl-intro, 2023. Accessed: November 1, 2024.
D. Kalimeris, G. Kaplun, P. Nakkiran, B. L. Edelman, T. Yang, B. Barak, and H. Zhang. SGD on neural
networks learns functions of increasing complexity. In H. M. Wallach, H. Larochelle, A. Beygelzimer,
F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems
32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December
8-14, 2019, Vancouver, BC, Canada, pages 3491–3501, 2019.

10
Understanding LLM Embeddings for Regression

V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih. Dense passage
retrieval for open-domain question answering. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors,
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP
2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics,
2020.

A. Kristiadi, F. Strieth-Kalthoff, M. Skreta, P. Poupart, A. Aspuru-Guzik, and G. Pleiss. A sober look at

llms for material discovery: Are they actually good for bayesian optimization over molecules? In
Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,
2024. OpenReview.net, 2024.

B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li. On the sentence embeddings from pre-trained
language models. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November
16-20, 2020, pages 9119–9130. Association for Computational Linguistics, 2020.

Q. Liu, M. J. Kusner, and P. Blunsom. A survey on contextual embeddings. CoRR, abs/2003.07278,

2020.

B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin

bounds for neural networks. In 6th International Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
2018.

T. Nguyen, Q. Zhang, B. Yang, C. Lee, J. Bornschein, Y. Miao, S. Perel, Y. Chen, and X. Song. Predicting
from strings: Language model embeddings for bayesian optimization, 2024.

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.

P. M. Phothilimthana, A. Sabne, N. Sarda, K. S. Murthy, Y. Zhou, C. Angermueller, M. Burrows, S. Roy,

K. Mandke, R. Farahani, Y. E. Wang, B. Ilbeyi, B. A. Hechtman, B. Roune, S. Wang, Y. Xu, and S. J.
Kaufman. A flexible approach to autotuning multi-pass machine learning compilers. In J. Lee and
A. Cohen, editors, 30th International Conference on Parallel Architectures and Compilation Techniques,
PACT 2021, Atlanta, GA, USA, September 26-29, 2021, pages 1–16. IEEE, 2021.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring
the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:
140:1–140:67, 2020.

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.

In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–
3990. Association for Computational Linguistics, 2019.

X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as
universal regressors. CoRR, abs/2402.14547, 2024.

R. Vacareanu, V. Negru, V. Suciu, and M. Surdeanu. From words to numbers: Your large language
model is secretly A capable regressor when given in-context examples. CoRR, abs/2404.07544,
2024.

11
Understanding LLM Embeddings for Regression

T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel. Evaluating the robustness of
neural networks: An extreme value theory approach. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings. OpenReview.net, 2018.

A. Yazdanbakhsh, C. Angermüller, B. Akin, Y. Zhou, A. Jones, M. Hashemi, K. Swersky, S. Chat-

terjee, R. Narayanaswami, and J. Laudon. Apollo: Transferable architecture exploration. CoRR,
abs/2102.01723, 2021.

12
Understanding LLM Embeddings for Regression

Appendix
A. Extended Experiments

A.1. High Dimensional Regression

For full transparency, In Figure 10, we display BBOB functions where LLM-based regression was not
consistently dimensionally robust against MLP and XGBoost baselines. Note that even in these cases,
we still see certain cases where a language model outperforms at least one of the baselines, e.g. in
the Discus and DifferentPowers functions, Gemini and T5 outperform MLP but not XGBoost.

DifferentPowers Ellipsoidal LinearSlope Discus

0.9 1.0
0.9 0.9

0.6 0.8
0.6 0.6
0.3 0.6
0.3 0.3
0.0 0.4
0.0 0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
BentCigar BuecheRastrigin AttractiveSector Schwefel
Kendall-Tau Correlation

1.00
0.9 0.9
0.75 0.75
0.6 0.6
0.50 0.50
0.3 0.3
0.25 0.25
0.0 0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
Weierstrass Katsuura NegativeMinDifference
0.6 0.9
0.1
0.4 0.6
0.0
0.2 0.3
0.0 0.1
0.0
25 50 75 100 25 50 75 100 25 50 75 100
Degrees of Freedom (DOF)
XGBoost MLP Gemini Pro T5-XXL

Figure 10 | Following Figure 2 in the main body, we present BBOB functions in which LLM embeddings
did not completely outperform traditional baselines.

13
Understanding LLM Embeddings for Regression

A.2. Real World Results

Despite Table 1 of the main body showing that there were numerous cases where LLM embeddings
outperform traditional ones, we remind the reader in Table 11 that on average, LLM embeddings still
slightly underperform.

0.7
Full Real World Regression Results
0.6
Kendall-Tau Correlation

0.5
XGBoost T5-Small
0.4 MLP T5-Large
Gemini Nano T5-XL
0.3 Gemini Pro T5-XXL
Gemini Ultra
0.2
0.1
0.0
AutoTFX Init2Winit XLA L2DA
Task

Figure 11 | Full Results over real world tasks. Displayed is the mean Kendall-Tau Correlation over all
tasks within each family.

A.3. Performance Correlations

Following Figure 4, in Table 2, we see that the relationship between the smoothness induced by the
embedding and the performance in regression is consistent throughout.

Model DOF=5 DOF=10 DOF=25 DOF=50 DOF=100

Gemini Nano 0.81 0.81 0.70 0.75 0.86
Gemini Pro 0.78 0.77 0.72 0.82 0.88
T5-Small 0.75 0.76 0.79 0.79 0.76
T5-Large 0.78 0.73 0.79 0.85 0.79
T5-XL 0.82 0.60 0.80 0.86 0.85
T5-XXL 0.72 0.76 0.82 0.83 0.83

Table 2 | Full set of data for Pearson correlation 𝜌 between Kendall’s regression performance and
gap in NLFD between input and embedding space for regression on all 23 BBOB functions, over
DOF=[5, 10, 25, 50, 100].

14
Understanding LLM Embeddings for Regression

B. Exact Modeling Details

B.1. Hyperparameters Used

The full list of hyperparameters and training details for MLP-based regression (using traditional and
language model features):
• Regression Head: MLP with 2 ReLU hidden layers of dimension 256.
• 𝑦 -Normalization: We compute the empirical mean 𝜇 and standard deviation 𝜎 over all 𝑦 -values
in the task’s training data, and apply 𝑦 ← ( 𝑦 − 𝜇 )/𝜎 as a preprocessing step.
• Optimizer: AdamW with sweeped learning rates across {1e-4, 5e-4, 1e-3, 5e-3, 1e-2} and
weight decay across {0, 1e-1, 1}.
• Loss: Mean Squared Error.
• Maximum Epochs: 300, with early stopping enabled.
For XGBoost, we additionally grid-searched over the following parameters for each task:
• “min_child_weight": [1, 5, 10]
• “learning_rate": [0.001, 0.01, 0.1]
• “gamma": [0.0, 0.3, 0.5]
• “subsample": [0.6, 0.8, 1.0]
• “colsample_bytree": [0.6, 0.8, 1.0]
• “max_depth": [3, 5, 7]

B.2. Embedding Sizes

Table 3 displays the embedding 𝑑llm for each model used in our experiments. As mentioned in the
main text, note that 𝑑llm is significantly larger than 𝑑trad .

T5 Model 𝑑llm
Gemini Model 𝑑llm
Small 512
Nano 1536
Large 1024
Pro 6144
XL 2048
Ultra 14336
XXL 4096

Table 3 | Embedding dimensions 𝑑llm for T5 and Gemini model families.

15
Understanding LLM Embeddings for Regression

C. Example String Representations

Table 4 contains example string representations of 𝑥 for different regression task families.

Task Family Example Representations

BBOB x0:0.32, x1:-4.21, x2:3.12, x3:1.56
AutoML batch_size:128, ml_feature_selection_threshold:0.05,
model_type:’DNN_ESTIMATOR’, activation_fn:’selu’, batch_norm:’False’,
bucketization_strategy:’mdl’,dropout:0.071, hidden_units:359
Init2Winit lr_hparams.base_lr:0.0696, opt_hparams.0.hps.one_minus_b1:0.2823,
opt_hparams.0.hps.one_minus_b2:0.0432,
opt_hparams.1.hps.weight_decay:0.0023
XLA auto_cross_replica_sharding:’False’,
rematerialization_percent_shared_memory_limit:97,
spmd_threshold_for_windowed_einsum_mib:100000, ...
L2DA input_activation_memory_depth:11.0, instruction_memory_depth:15.0,
io_bandwidth_gbps:4.321, narrow_memory_capacity_bytes:21.0, ...

Table 4 | Example 𝑥 representations from each of the regression task families. ‘. . . ’ denotes that there
are actually more parameters, but we omit them due to length.

Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Are Language Models Actually Useful For Time Series
No ratings yet
Are Language Models Actually Useful For Time Series
28 pages
CRC Press Practical Security For Agile and DevOps 2022
100% (1)
CRC Press Practical Security For Agile and DevOps 2022
236 pages
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
No ratings yet
Large Language Models (LLMS) On Tabular Data: Predic-Tion, Generation, and Understanding - A Survey
47 pages
Challenges and Applications of Large Language Models: Desi GN Behavior
No ratings yet
Challenges and Applications of Large Language Models: Desi GN Behavior
72 pages
From Words To Numbers: Your Large Language Model Is Se-Cretly A Capable Regressor When Given In-Context Examples
No ratings yet
From Words To Numbers: Your Large Language Model Is Se-Cretly A Capable Regressor When Given In-Context Examples
50 pages
Recurrent Neural Networks Cheatsheet
No ratings yet
Recurrent Neural Networks Cheatsheet
44 pages
On The Dimensionality of Word Embedding
No ratings yet
On The Dimensionality of Word Embedding
18 pages
Sources of Hallucination by Large Language Models On Inference Tasks
No ratings yet
Sources of Hallucination by Large Language Models On Inference Tasks
17 pages
Standards For Belief Representations in LLMS
No ratings yet
Standards For Belief Representations in LLMS
19 pages
OmniPred - Language Models As Universal Regressors
No ratings yet
OmniPred - Language Models As Universal Regressors
24 pages
Lang Bridge
No ratings yet
Lang Bridge
21 pages
Few-Shot Classification of Tabular Data With Large Language Models
No ratings yet
Few-Shot Classification of Tabular Data With Large Language Models
33 pages
Incorporating LLM Priors Into Tabular Learners: Table Representation Learning Workshop at Neurips 2023
No ratings yet
Incorporating LLM Priors Into Tabular Learners: Table Representation Learning Workshop at Neurips 2023
10 pages
Large Language Models As General Pattern Machines
No ratings yet
Large Language Models As General Pattern Machines
21 pages
Decoding-Based Regression: Google Deepmind
No ratings yet
Decoding-Based Regression: Google Deepmind
25 pages
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
No ratings yet
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
16 pages
Llms 1 15
No ratings yet
Llms 1 15
15 pages
Extreme Learning Machine: A Review
No ratings yet
Extreme Learning Machine: A Review
14 pages
On Embeddings For Numerical Features in Tabular Deep Learning
No ratings yet
On Embeddings For Numerical Features in Tabular Deep Learning
21 pages
Ba LLMS W3 S2 2024 2025
No ratings yet
Ba LLMS W3 S2 2024 2025
64 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
A Multi-Perspective Analysis of Memorization in Large Language Models
No ratings yet
A Multi-Perspective Analysis of Memorization in Large Language Models
18 pages
Research Paper
No ratings yet
Research Paper
14 pages
DSA5102 Lecture1
No ratings yet
DSA5102 Lecture1
60 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
STORM - The Function of Large Language - Models in Embedding Space and The Subspace of The Learned Manifold ?
No ratings yet
STORM - The Function of Large Language - Models in Embedding Space and The Subspace of The Learned Manifold ?
11 pages
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
No ratings yet
NeurIPS 2022 On Embeddings For Numerical Features in Tabular Deep Learning Paper Conference
14 pages
Large Language Model Routing With Benchmark Datasets
No ratings yet
Large Language Model Routing With Benchmark Datasets
18 pages
Can Large Language Models Reason and Plan?
No ratings yet
Can Large Language Models Reason and Plan?
5 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
Trend
No ratings yet
Trend
47 pages
Are Language Models Actually Useful For Time Series Forecasting?
No ratings yet
Are Language Models Actually Useful For Time Series Forecasting?
25 pages
2001 ATRA Seminar Manual Contents PDF
No ratings yet
2001 ATRA Seminar Manual Contents PDF
218 pages
Large Language Models Are
No ratings yet
Large Language Models Are
14 pages
Are Language Models Actually Useful For Time Series Forecasting?
No ratings yet
Are Language Models Actually Useful For Time Series Forecasting?
28 pages
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
100% (1)
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
6 pages
640803469
No ratings yet
640803469
9 pages
LLM-ABBA: Understand Time Series Via Symbolic Approximation
No ratings yet
LLM-ABBA: Understand Time Series Via Symbolic Approximation
13 pages
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
No ratings yet
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
32 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Architectures
3 pages
Extracting Probabilistic Knowledge From Large Lang
No ratings yet
Extracting Probabilistic Knowledge From Large Lang
25 pages
No Training Required Exploring Random Encoders For Sentence Classification
No ratings yet
No Training Required Exploring Random Encoders For Sentence Classification
16 pages
Entity Embeddings of Categorical Variables
No ratings yet
Entity Embeddings of Categorical Variables
9 pages
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
No ratings yet
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
25 pages
Linearity of Relation Decoding in Transformer Language Models
No ratings yet
Linearity of Relation Decoding in Transformer Language Models
23 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Teddy CADiNP Introduction
No ratings yet
Teddy CADiNP Introduction
59 pages
RetroMagazine 07 Eng
No ratings yet
RetroMagazine 07 Eng
55 pages
Cmos Schmitt Trigger
No ratings yet
Cmos Schmitt Trigger
6 pages
Copyright in Digital Age
100% (1)
Copyright in Digital Age
12 pages
MPR-214F Instruction
No ratings yet
MPR-214F Instruction
35 pages
Flange Dim EN1092-1
No ratings yet
Flange Dim EN1092-1
18 pages
Case Study - How Aggressively Should A Bank Pursue - 240820 - 080128
No ratings yet
Case Study - How Aggressively Should A Bank Pursue - 240820 - 080128
11 pages
35C+ & 45C+ Gas Fryer Parts Manual: Pitco Frialator Inc
No ratings yet
35C+ & 45C+ Gas Fryer Parts Manual: Pitco Frialator Inc
16 pages
E9100 Gryphon z97 Ug For Web Only
No ratings yet
E9100 Gryphon z97 Ug For Web Only
170 pages
ACW Writing Roundabout Ebook
No ratings yet
ACW Writing Roundabout Ebook
23 pages
IC Software Project Sign Off Document 11340
No ratings yet
IC Software Project Sign Off Document 11340
7 pages
Puyat Na Kami - Ni Jong Final
No ratings yet
Puyat Na Kami - Ni Jong Final
57 pages
Analytics of Observational Data Lec 12
No ratings yet
Analytics of Observational Data Lec 12
24 pages
TVF2 5
No ratings yet
TVF2 5
107 pages
Hydac Compact Power Units CO1: (Three-Phase Current)
No ratings yet
Hydac Compact Power Units CO1: (Three-Phase Current)
9 pages
75 C1.1 Exam Dec. 2021
No ratings yet
75 C1.1 Exam Dec. 2021
5 pages
The Smart Thermostat
No ratings yet
The Smart Thermostat
15 pages
Id Unit 5
No ratings yet
Id Unit 5
9 pages
Company Profile-Falcon Comp
No ratings yet
Company Profile-Falcon Comp
9 pages
TAM Final LAS
No ratings yet
TAM Final LAS
4 pages
Blue and White Modern Digital Marketing Agency Presentation
No ratings yet
Blue and White Modern Digital Marketing Agency Presentation
9 pages
E-M-HG2-S-V2 Instruction Manual 011013
No ratings yet
E-M-HG2-S-V2 Instruction Manual 011013
55 pages
Tutorial Bootstrap Part 3 - Cara Menginstall Bootstrap 5
No ratings yet
Tutorial Bootstrap Part 3 - Cara Menginstall Bootstrap 5
6 pages
Flashpool Design
No ratings yet
Flashpool Design
25 pages
LESSON How Hubspot Uses Blogging To Rank SCRIPT
No ratings yet
LESSON How Hubspot Uses Blogging To Rank SCRIPT
8 pages
Files With Fstream: Short Answer
No ratings yet
Files With Fstream: Short Answer
9 pages
Pechenik Worksheet 2013
No ratings yet
Pechenik Worksheet 2013
2 pages
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Mastering Dynamic Programming in Java
From Everand
Mastering Dynamic Programming in Java
Ed A Norex
No ratings yet
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
From Everand
Advanced Guide to Dynamic Programming in Python: Techniques and Applications
Adam Jones
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Clojure Essentials: Definitive Reference for Developers and Engineers
From Everand
Clojure Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Techniques in Dynamic Programming: A Comprehensive Guide for Java Developers
From Everand
Advanced Techniques in Dynamic Programming: A Comprehensive Guide for Java Developers
Adam Jones
No ratings yet
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Prolog Programming Foundations: Definitive Reference for Developers and Engineers
From Everand
Prolog Programming Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied APL Programming: Definitive Reference for Developers and Engineers
From Everand
Applied APL Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Blob Detection: Unveiling Patterns in Visual Data
From Everand
Blob Detection: Unveiling Patterns in Visual Data
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

LLM para Regresion

Uploaded by

LLM para Regresion

Uploaded by

Understanding LLM Embeddings for

∗ Work performed during the Google DeepMind Academy Program.

1. Introduction and Related Work

against data, while (Vacareanu et al., 2024) 45

© 2024 Google DeepMind. All rights reserved

In contrast to decoding-based regression techniques, embedding-based regression allows the possibility

2. Problem and Methodology

2.2. Regression Tasks

3.1. High Dimensional Regression

Sphere RosenbrockRotated Lunacek NegativeSphere

0.6 0.6 0.50 0.6

0.3 0.3 0.25 0.3

0.0 0.0 0.00 0.0

0.8 0.75 0.75 0.75

0.50 0.50 0.50

3.2. LLM Embedding Smoothness

Sphere, DOF=100 RosenbrockRotated, DOF=100 Lunacek, DOF=100

0.3 0.3 0.3 0.3

Distance From Reference Point

Distance From Reference Point

3.3. Model Effects

T5 Scaling Gemini Scaling

show feature names as {param1:value1,

4. Conclusion and Future Work

Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen,

J. Dodge, M. Sap, A. Marasovic, W. Agnew, G. Ilharco, D. Groeneveld, M. Mitchell, and M. Gardner.

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In

Google. Tfx: A tensorflow-based production-scale machine learning platform. https://www.

Google Cloud.Vertex ai automl. https://cloud.google.com/vertex-ai/docs/start/

A. Kristiadi, F. Strieth-Kalthoff, M. Skreta, P. Poupart, A. Aspuru-Guzik, and G. Pleiss. A sober look at

Q. Liu, M. J. Kusner, and P. Blunsom. A survey on contextual embeddings. CoRR, abs/2003.07278,

B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.

P. M. Phothilimthana, A. Sabne, N. Sarda, K. S. Murthy, Y. Zhou, C. Angermueller, M. Burrows, S. Roy,

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.

A. Yazdanbakhsh, C. Angermüller, B. Akin, Y. Zhou, A. Jones, M. Hashemi, K. Swersky, S. Chat-

A.1. High Dimensional Regression

DifferentPowers Ellipsoidal LinearSlope Discus

A.2. Real World Results

A.3. Performance Correlations

Model DOF=5 DOF=10 DOF=25 DOF=50 DOF=100

B. Exact Modeling Details

B.1. Hyperparameters Used

B.2. Embedding Sizes

Table 3 | Embedding dimensions 𝑑llm for T5 and Gemini model families.

C. Example String Representations

Task Family Example Representations

You might also like