LLM para Regresion
LLM para Regresion
Regression
Eric Tang∗1 , Bangding Yang2 and Xingyou Song3
1 Stanford University, 2 Google, 3 Google DeepMind
With the rise of large language models (LLMs) for flexibly processing information as strings, a natural
application is regression, specifically by preprocessing string representations into LLM embeddings as
downstream features for metric prediction. In this paper, we provide one of the first comprehensive
investigations into embedding-based regression and demonstrate that LLM embeddings as features
arXiv:2411.14708v2 [cs.LG] 2 Dec 2024
can be better for high-dimensional regression tasks than using traditional feature engineering. This
regression performance can be explained in part due to LLM embeddings over numeric data inherently
preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of
different model effects, most notably model size and language understanding, which we find surprisingly
do not always improve regression performance.
While it is straightforward to assume that the whole process outlined for LLMs should constitute the
definition of a language model embedding 𝜙LLM , it is not obvious how much each of these steps may
contribute to the final regression result. For instance, one could simply skip applying a forward pass
in step (3) and pool the soft prompt directly, or use a randomly initialized model as opposed to a
pretrained one. We extensively study this case in Section 3.3.
2
Understanding LLM Embeddings for Regression
2.1. Modeling
To minimize confounding factors and maintain fairness during comparisons, we use the exact
same MLP prediction head (2 hidden layers, ReLU activation), loss (mean squared error), and
𝑦 -normalization scheme (shifting by empirical mean and dividing by empirical deviation), regardless
of using 𝜙LLM and 𝜙trad . Note however, that the embedding dimensions of the two representations
may be different, and so we distinguish them using notation 𝑑llm and 𝑑trad respectively, where typically
𝑑llm ≫ 𝑑trad . Further details can be found in Appendix B.1.
To demonstrate consistent results over different families of language models, we benchmark over both
the T5 (Raffel et al., 2020) and Gemini 1.0 (Google, 2024) families, which use different architectures
(encoder-decoder and decoder-only), different vocabulary sizes (32K and 256K), and embedding
dimensions (see Appendix B.2) respectively. However, to remain consistent with the definition of
embedding, we follow previous literature (Li et al., 2020; Reimers and Gurevych, 2019) and use
average-pooling as the canonical method of aggregating Transformer outputs, and thus the embedding
dimension 𝑑llm is equivalent to the the output feature dimension 𝑓 following a forward pass.
Similar to previous work (Nguyen et al., 2024; Song et al., 2024), for string representations of 𝑥 from
any regression task, by default we use a key-value JSON format with consistent ordering of keys, i.e.
{param1:value1,param2:value2,...}, with specific examples shown in Appendix C.
For regression tasks, we first use synthetic, closed-form objective functions in order to produce
controlled studies in which we may query any 𝑥 from the input space. Our synthetic functions are
defined from the standard Black-Box Optimization Benchmarking (BBOB) suite (Elhara et al., 2019).
To avoid confounding terminology between embedding "dimension" 𝑑 and the intrinsic "dimension"
of an objective 𝑓 , we denote the latter as "degree-of-freedom" (DOF), and thus 𝑓 (·) is dependent on
input coordinates 𝑥 (1) , . . . , 𝑥 (DOF) , each of which is between [−5, 5]. This provides a comprehensive
variety of both convex and non-convex objective landscapes to regress upon.
We further use real-world regression tasks representative of those encountered in the wild and in
industry settings by benchmarking over offline objective evaluations found in Google Vizier (Golovin
et al., 2017), which optimizes Google’s largest production and research systems. These consist of four
families, with each family containing at least 50 individual yet similar regression tasks. The families
are:
• AutoML (Google Cloud, 2023): Automated Machine Learning platform for automating TFX
(Google, 2023) pipelines (e.g. batch size, activation, layer counts) over tabular or text data.
• Init2Winit (Dahl et al., 2023): Learning rate scheduling parameters influencing common image
classification tasks (e.g. ResNets on CIFAR-10 and ImageNet).
• XLA (Phothilimthana et al., 2021): Tuning for the Accelerated Linear Algebra (XLA) compiler
which affects LLM serving latencies.
• L2DA (Yazdanbakhsh et al., 2021): "Learning to Design Accelerators", for improving accelerators
such as TPUs and corresponding computer architectures to improve hardware performance.
In the real world regression tasks, each parameter may be continuous or categorical, and we define
the DOF of such a task by its number of parameters. Note that for synthetic objectives, where all inputs
are continuous, 𝑑trad = DOF. However, for real-world tasks with categorical parameters, 𝑑trad > DOF
due to additional one-hot encodings.
For obtaining data, we may either sample ( 𝑥, 𝑦 ) pairs (in the case of synthetic objectives where 𝑥
are sampled from X), or use the given offline data (in the case of real-world tasks, where they were
3
Understanding LLM Embeddings for Regression
actual evaluations from an optimization trajectory), using a standard 8-1-1 train-validation-test split.
Due to the inherent differing of metric scales across tasks, it would be inappropriate to aggregate
results based on scale-dependent metrics such as mean squared error (MSE). Furthermore, we found
that the selection of the regression metric (e.g. Kendall-Tau, Pearson, mean squared error, mean
absolute error) did not matter for comparisons, as they all strongly correlated with each other. Thus,
by default we report the Kendall-Tau ranking correlation, which is always within [0, 1] and can be
aggregated across different tasks.
3. Experimental Results
We begin by demonstrating cases in which LLM embeddings better represent inputs over high DOF
spaces than traditional representations. In Figure 2, we show that for a subset of functions, LLM
embeddings possess surprising robustness, retaining the same performance for varying DOFs whereas
traditional baselines such as XGBoost and MLPs significantly falter over higher DOFs.
Figure 2 | Higher (↑) is better. Degrees of freedom (DOF) vs Kendall-Tau correlation for various BBOB functions.
Results are averaged over 12 runs for each regression method. Each task’s data consists of 500 ( 𝑥, 𝑦 ) evaluations
sampled uniformly across the input space, using a 8-1-1 split for train-validation-test.
This result is not universal however, as we show in Appendix A.1, this pattern does not apply for
some selected functions, but nonetheless it occurs in the majority of the BBOB functions. We further
corroborate this observation over real-world tasks in Table 1. We see that in general, regressions on
LLM embeddings outperform traditional methods more often for tasks with higher DOFs (AutoML
and XLA).
4
Understanding LLM Embeddings for Regression
Task Name Avg. DOF T5-Small % T5-XXL % Gemini Nano % Gemini Pro %
Init2Winit 4 6.7 8.0 11.3 19.0
L2DA 10 2.7 12.0 9.3 10.7
AutoML 29 30.7 41.3 29.3 36.0
XLA 35 17.2 29.3 18.9 24.1
Table 1 | Percentage of tasks in which 𝜙LLM outperforms 𝜙trad across various real world regression tasks. Results
reported for 75 tasks per family, except for XLA, which only contains 58 tasks. Full results in Appendix A.2.
Particularly due to the discrete nature of tokenization, it is non-obvious whether LLM embeddings
possess a notion of continuity in embedding space. For example, assuming character-wise tokenization,
1.234 is not so numerically distant from 1.567, but is token-wise distant, as the majority of the
tokens (234 and 567) are not shared.
The notion of continuity and smoothness is crucial for neural network generalization (Kalimeris et al.,
2019; Neyshabur et al., 2018), robustness (Weng et al., 2018), vulnerability to adversarial examples
(Goodfellow et al., 2015), and more. We can characterize smoothness in the regression case by the
Lipschitz-continuity induced by a representation 𝜙 in its latent space ℝ𝑑 .
1e6
100
Schwefel, DOF=100 LinearSlope, DOF=100 Discus, DOF=100
80 100
75
60 75
50
40 50
25 20 25
0 0 0
0 2 4 6 8 0 100 200 300 400 0.0 0.5 1.0 1.5 2.0
1e5 1e7
Normalized Lipschitz Factor
LLM Traditional
Figure 3 | Left-skewness (←) is better. NLFDs induced by 𝜙LLM (T5-XXL) and 𝜙trad . Top: Cases where 𝜙LLM
outperforms 𝜙trad for regression. Bottom: Vice-versa where 𝜙trad outperforms 𝜙LLM .
Intuitively, similar inputs should lead to similar objective values, which can be quantified inversely by
the Lipschitz factor 𝐿 ( 𝑥, 𝑥 ′ ) = ∥ 𝑓 ( 𝑥 ) − 𝑓 ( 𝑥 ′ )∥ /∥ 𝜙 ( 𝑥 ) − 𝜙 ( 𝑥 ′ )∥ with respect to a representation 𝜙 and
∥·∥ norm. We emphasize to the reader that the input space X does not actually have an explicit notion
of distance on its own. Instead, traditionally it has always been assumed that the distance was defined
canonically by Euclidean distance over the traditional embedding method, i.e. ∥ 𝜙trad ( 𝑥 ) − 𝜙trad ( 𝑥 ′ ) ∥ 2
as demonstrated by common use of Euclidean-based radial basis and Matern kernels (Genton, 2002)
during regression modeling. However, as seen from the results previously, it may be the case that
𝜙trad is suboptimal for some regression tasks.
5
Understanding LLM Embeddings for Regression
In order to analyze the continuity of an embedding 𝜙 with respect to offline data D, we define a
Normalized Lipschitz Factor Distribution (NLFD) as follows:
1. Full-batch normalize, i.e. apply shifting and scaling to each 𝜙 ( 𝑥 ) so that in aggregate, D has
zero mean and unit variance per coordinate.
2. For each 𝑥 ∈ D, choose 𝑥 ′ ∈ D such that 𝜙 ( 𝑥 ′ ) is the nearest ℓ2 neighbor of 𝜙 ( 𝑥 ), and compute
the Lipschitz factor 𝐿 ( 𝑥, 𝑥 ′ ).
3. To assume an average embedding
√ norm of 1 for different embedding dimensions 𝑑 , we downscale
all Lipschitz factors by 𝑑 .
We see that there is a high inverse relationship between the skewedness of the NLFD and regression
performance. Specifically, in Figure 3, when 𝜙LLM outperforms 𝜙trad for regression, 𝜙LLM ’s distribution
of Lipschitz factors also tends to skew relatively more to zero than 𝜙trad , and vice-versa.
T5-Small, DOF=100 T5-Large, DOF=100 T5-XL, DOF=100 T5-XXL, DOF=100
K: 0.64, S: 0.83, P: 0.77 K: 0.71, S: 0.88, P: 0.79 K: 0.79, S: 0.92, P: 0.86 K: 0.75, S: 0.91, P: 0.83
0.6 0.6
0.6 0.6
Regression Performance Gap (Kendall)
To formally quantify comparisons between NLFDs from 𝜙LLM and 𝜙trad , for a fixed regression task,
we may thus compute the Z-score using the difference of the two distributions:
𝜇 𝜙 − 𝜇 𝜙LLM
𝑍 = √︃ trad (1)
𝜎𝜙2 + 𝜎𝜙2
trad LLM
where 𝜇 𝜙 and 𝜎𝜙 are respectively mean and standard deviations of the NLFD of a representation 𝜙.
We may then observe the relationship between gaps in representation smoothness vs. regression
performance. In Figure 4 with extended results in Appendix A.3, we see that for a given BBOB
regression task, the Z-score (i.e. gap in embedding smoothness) is highly correlated with the gap in
regression performance, regardless of the model used (T5 or Gemini) or the DOF of the underlying
objective 𝑓 .
We further visualize whether 𝜙LLM is distance aware, i.e. whether 𝜙LLM ( 𝑥 ) are 𝜙LLM ( 𝑥 ′ ) are close in
embedding space if 𝜙trad ( 𝑥 ) and 𝜙trad ( 𝑥 ′ ) are close. As mentioned before however, there is no ground
truth notion of "closeness" - nonetheless, we use 𝜙trad as a point of comparison. Since it is inappropriate
6
Understanding LLM Embeddings for Regression
to
√ simply sample 𝑥 ’s uniformly in a high DOF space, as then average distances concentrate around
DOF, we instead take a reference point and sample points from ℓ2 -balls of increasing distance from
the reference.
In Figure 5, we see that distances over the LLM embedding space are correlated with the traditional
measure of distance, but may be non-linearly warped, which benefits LLM-based regression in certain
cases as seen in Section 3.1.
Gemini Nano Gemini Pro
Reference Point 20
20 15
t-SNE Dimension 2
10
10 45 45
5
30 0 30
0
15 5 15
10
10
15
20 Reference Point
20
20 10 0 10 20 10 0 10 20
t-SNE Dimension 1 t-SNE Dimension 1
Figure 5 | t-SNE for Gemini (Nano and Pro) embeddings of points sampled around a DOF=100
reference point. Traditional ℓ2 distance is overlayed in color.
In this subsection, we comprehensively investigate the impact of many common LLM factors on
regression performance.
Are Larger Models Always Better? Within the research community, the prevailing assumption is
that there exists a direct correlation between language model size and performance improvement.
However, with the rise of leaderboards such as LMSYS (LMS, 2023), smaller models have been shown
to outperform larger competitors, due to differences in their "recipe", such as training data quality,
pre-training and post-training techniques, and architecture.
0.4 0.45
AutoML
0.40 Init2Winit
0.3 XLA
0.35 L2DA
0.2 0.30
0.25
108 109 1010 Nano Pro Ultra
Model Parameters Model Tier
Figure 6 | Higher (↑) is better. Model size vs regression performance on hyperparameter tuning tasks
across T5 and Gemini model families. Median performance is plotted, along with 40-60 percentiles
as error bars.
7
Understanding LLM Embeddings for Regression
In Figure 6, we see that over various real world regression tasks, T5 models exhibit a clear trend of
improved performance when increasing model size, when training methodology is fixed. In contrast,
model tiers within the Gemini family exhibit substantial variance, and larger model sizes do not
consistently translate to superior results. We hypothesize this is due to differences in Gemini "recipes",
as e.g. different model tiers may have used different pre-training datasets, architecture tweaks, and
post-training configurations, whereas all T5 model sizes have only been pre-trained on the C4 web
crawl corpus.
Does Language Understanding Actually Help? Recent works (Devlin et al., 2019; Li et al., 2020)
have claimed that logit-based embeddings mostly measure the semantic similarity between string
inputs, and thus it is unconfirmed whether they may be beneficial for numeric regression tasks. To
resolve this, using the T5 family, we compare against using (1) a randomly initialized model for
the forward pass, and (2) representing our features via vocabulary embeddings without applying a
forward pass.
Real-World Tasks (Pre-trained vs Random Init) Real-World Tasks (Full Model vs Vocab Table)
0.48 0.48
Kendall-Tau Correlation
Kendall-Tau Correlation
0.42 0.42
0.36 0.36
0.30 0.30
0.24 0.24
0.18 0.18
0.12 0.12
AutoML Init2Winit XLA Accelerator Design AutoML Init2Winit XLA L2DA
Task Task
T5-Small T5-Large T5-XL T5-XXL T5-* Random T5-Small T5-Large T5-XL T5-XXL T5-* Vocab
Figure 7 | Kendall-Tau regression comparisons when comparing to random initialization (left) and vocabulary
embeddings (right). Each bar is averaged across 75 tasks per family.
In Figure 7, we see that the default mode of applying a forward pass of a pre-trained model performs
the best, as expected. However, it is worth noting that in some tasks such as AutoML and L2DA, the
improvement is surprisingly quite minimal, suggesting that applying forward passes by pretrained
models does not always help for regression.
We further ablate differences in string rep- Real-World Tasks (Feature Names vs Omitted)
0.16 T5-Small
resentation, i.e. whether by default to T5-Large
0.12 T5-XL
K-T Corr. Difference
fit from feature names. This is surprising, as AutoML Init2Winit XLA L2DA
Task
presumably feature names in XLA tasks such
as auto_cross_replica_sharding are not Figure 8 | Difference in Kendall correlation when
as common as names such as batch_size or using full dictionary containing feature names, or
learning_rate found in both AutoML and only values.
Init2winit.
The results of Figures 7 and 8 combined lead to additionally surprising conclusions, such as language-
to-numeric transfer. For instance, inputs 𝑥 from Init2Winit tasks only possess numeric values, and
as expected, removing feature names does not significantly change regression results. Yet applying
8
Understanding LLM Embeddings for Regression
forward passes by pre-trained T5 models still benefits regression, despite the fact that T5’s pre-training
data contains mostly web-corpus data which is unlikely to contain significant amounts of scientific or
numeric information (Dodge et al., 2021).
More Training Data Reduces Baseline Gaps: Intuitively, as more samples are available in a task, the
difference in inductive biases between regression methods should matter less, since predictions will
be more influenced by training data. We verify this in Figure 9, where we see that for tasks with low
numbers of ( 𝑥, 𝑦 ) points, there is more variability in performance between using 𝜙LLM and 𝜙trad , but
additional training points decreases these differences.
AutoML XLA
(T5-XXL - MLP) Kendall-Tau Correlation
0.3 0.2
0.2
0.1
0.1
0.0
0.0
0.1
0.1
0.2 0.2
0.3 0.3
0.4
200 400 600 800 1000 200 400 600 800 1000 1200 1400
Number of Training Points
StdDev Mean
Figure 9 | Performance gap between an MLP baseline and regression over T5-XXL embeddings for individual
trials within the AutoML and XLA task settings. Higher (↑) is better for LLM embeddings. Error bars are plotted
for {0.5, 1.0, 2.0} of the standard deviation.
Acknowledgments
We thank Yutian Chen, Daniel Golovin, Chansoo Lee, Tung Nguyen, and Sagi Perel for relevant
discussions during experimentation and the writing of this paper. We further thank the organizers of
the Google DeepMind Academy Program for providing the opportunity to do this research.
9
Understanding LLM Embeddings for Regression
References
LMSYS: Large model systems organization, 2023. URL https://lmsys.org/.
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers
for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1
(Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi:
10.18653/V1/N19-1423.
O. Elhara, K. Varelas, D. Nguyen, T. Tusar, D. Brockhoff, N. Hansen, and A. Auger. Coco: the large scale
black-box optimization benchmarking (bbob-largescale) test suite. arXiv preprint arXiv:1903.06396,
2019.
M. G. Genton. Classes of kernels for machine learning: a statistics perspective. J. Mach. Learn. Res.,
2:299–312, Mar. 2002. ISSN 1532-4435.
D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service
for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 1487–1495.
ACM, 2017.
10
Understanding LLM Embeddings for Regression
V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih. Dense passage
retrieval for open-domain question answering. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors,
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP
2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics,
2020.
B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li. On the sentence embeddings from pre-trained
language models. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November
16-20, 2020, pages 9119–9130. Association for Computational Linguistics, 2020.
T. Nguyen, Q. Zhang, B. Yang, C. Lee, J. Bornschein, Y. Miao, S. Perel, Y. Chen, and X. Song. Predicting
from strings: Language model embeddings for bayesian optimization, 2024.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring
the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:
140:1–140:67, 2020.
X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as
universal regressors. CoRR, abs/2402.14547, 2024.
R. Vacareanu, V. Negru, V. Suciu, and M. Surdeanu. From words to numbers: Your large language
model is secretly A capable regressor when given in-context examples. CoRR, abs/2404.07544,
2024.
11
Understanding LLM Embeddings for Regression
T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel. Evaluating the robustness of
neural networks: An extreme value theory approach. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings. OpenReview.net, 2018.
12
Understanding LLM Embeddings for Regression
Appendix
A. Extended Experiments
For full transparency, In Figure 10, we display BBOB functions where LLM-based regression was not
consistently dimensionally robust against MLP and XGBoost baselines. Note that even in these cases,
we still see certain cases where a language model outperforms at least one of the baselines, e.g. in
the Discus and DifferentPowers functions, Gemini and T5 outperform MLP but not XGBoost.
0.6 0.8
0.6 0.6
0.3 0.6
0.3 0.3
0.0 0.4
0.0 0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
BentCigar BuecheRastrigin AttractiveSector Schwefel
Kendall-Tau Correlation
1.00
0.9 0.9
0.75 0.75
0.6 0.6
0.50 0.50
0.3 0.3
0.25 0.25
0.0 0.0
25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100
Weierstrass Katsuura NegativeMinDifference
0.6 0.9
0.1
0.4 0.6
0.0
0.2 0.3
0.0 0.1
0.0
25 50 75 100 25 50 75 100 25 50 75 100
Degrees of Freedom (DOF)
XGBoost MLP Gemini Pro T5-XXL
Figure 10 | Following Figure 2 in the main body, we present BBOB functions in which LLM embeddings
did not completely outperform traditional baselines.
13
Understanding LLM Embeddings for Regression
Despite Table 1 of the main body showing that there were numerous cases where LLM embeddings
outperform traditional ones, we remind the reader in Table 11 that on average, LLM embeddings still
slightly underperform.
0.7
Full Real World Regression Results
0.6
Kendall-Tau Correlation
0.5
XGBoost T5-Small
0.4 MLP T5-Large
Gemini Nano T5-XL
0.3 Gemini Pro T5-XXL
Gemini Ultra
0.2
0.1
0.0
AutoTFX Init2Winit XLA L2DA
Task
Figure 11 | Full Results over real world tasks. Displayed is the mean Kendall-Tau Correlation over all
tasks within each family.
Following Figure 4, in Table 2, we see that the relationship between the smoothness induced by the
embedding and the performance in regression is consistent throughout.
Table 2 | Full set of data for Pearson correlation 𝜌 between Kendall’s regression performance and
gap in NLFD between input and embedding space for regression on all 23 BBOB functions, over
DOF=[5, 10, 25, 50, 100].
14
Understanding LLM Embeddings for Regression
The full list of hyperparameters and training details for MLP-based regression (using traditional and
language model features):
• Regression Head: MLP with 2 ReLU hidden layers of dimension 256.
• 𝑦 -Normalization: We compute the empirical mean 𝜇 and standard deviation 𝜎 over all 𝑦 -values
in the task’s training data, and apply 𝑦 ← ( 𝑦 − 𝜇 )/𝜎 as a preprocessing step.
• Optimizer: AdamW with sweeped learning rates across {1e-4, 5e-4, 1e-3, 5e-3, 1e-2} and
weight decay across {0, 1e-1, 1}.
• Loss: Mean Squared Error.
• Maximum Epochs: 300, with early stopping enabled.
For XGBoost, we additionally grid-searched over the following parameters for each task:
• “min_child_weight": [1, 5, 10]
• “learning_rate": [0.001, 0.01, 0.1]
• “gamma": [0.0, 0.3, 0.5]
• “subsample": [0.6, 0.8, 1.0]
• “colsample_bytree": [0.6, 0.8, 1.0]
• “max_depth": [3, 5, 7]
Table 3 displays the embedding 𝑑llm for each model used in our experiments. As mentioned in the
main text, note that 𝑑llm is significantly larger than 𝑑trad .
T5 Model 𝑑llm
Gemini Model 𝑑llm
Small 512
Nano 1536
Large 1024
Pro 6144
XL 2048
Ultra 14336
XXL 4096
15
Understanding LLM Embeddings for Regression
Table 4 | Example 𝑥 representations from each of the regression task families. ‘. . . ’ denotes that there
are actually more parameters, but we omit them due to length.
16