0% found this document useful (0 votes)
10 views25 pages

When Large Language Models Meet Evolutionary Algorithms

LLM paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views25 pages

When Large Language Models Meet Evolutionary Algorithms

LLM paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

When large language models meet evolutionary

algorithms
Chao Wang1 , Jiaxuan Zhao1 , Licheng Jiao1*, Lingling Li1 ,
Fang Liu1 , Shuyuan Yang1
1* School of Artificial Intelligence, Xidian University, No. 2 South Taibai
Road, Xi’an, 710071, Shaanxi, China.
arXiv:2401.10510v2 [cs.NE] 29 Jun 2024

*Corresponding author(s). E-mail(s): [email protected];


Contributing authors: [email protected];
[email protected]; [email protected]; [email protected];
[email protected];

Abstract
Pre-trained large language models (LLMs) have powerful capabilities for gener-
ating creative natural text. Evolutionary algorithms (EAs) can discover diverse
solutions to complex real-world problems. Motivated by the common collective
and directionality of text generation and evolution, this paper illustrates the
parallels between LLMs and EAs, which includes multiple one-to-one key char-
acteristics: token representation and individual representation, position encoding
and fitness shaping, position embedding and selection, Transformers block and
reproduction, and model training and parameter adaptation. By examining these
parallels, we analyze existing interdisciplinary research, with a specific focus on
evolutionary fine-tuning and LLM-enhanced EAs. Drawing from these insights,
valuable future directions are presented for advancing the integration of LLMs
and EAs, while highlighting key challenges along the way. These parallels not only
reveal the evolution mechanism behind LLMs but also facilitate the development
of evolved artificial agents that approach or surpass biological organisms.

Keywords: Large Language Models, Evolutionary Algorithms

Large language models (LLMs) learn statistical patterns with temporal relations from
text sequences in an unsupervised manner to establish probability distributions of
texts, such as generative pre-trained Transformer (GPT) [1] and bidirectional encoder

1
I love you, deeply and truly. Your
Fitness
presence brings joy to my life. You
are my world, m y everything. I
cherish you always. I love you with
all my heart, deeply and truly. Your
existence brings pure joy to my life.
Forever in my thoughts, I hold you.
A person who never made a mistake Sequence
never tried anything new. Individual
......
x1 x2 x3 x4 x5

Token: ["I", "love", "you", ",", "deeply"] Individual: [x1, x2, x3, x4, x5]
Position: 1 2 3 4 5 Fitness Rank: 5 4 2 3 1

Fig. 1 Both tokens in a text and individuals in a population can be regarded as sequences. In a text,
each token corresponds to a specific position, while in a population, each individual is associated with
a particular fitness rank. Text sequences have a natural directionality derived from human-defined
grammatical rules. Population evolution is also directional, which is primarily driven by individual
fitness.

representations from transformer (BERT) [2]. Typical LLMs can analyze input tokens
and generate the most likely subsequent tokens, a process that iterates to produce
creative texts. This sequence-to-sequence model with powerful understanding and gen-
eration capabilities has been employed to assist users on a variety of creative tasks,
including writing, mathematical discovery, and chemical research [3–8]. Meanwhile,
training LLMs on vast text demands significant computing resources, as exemplified
by ChatGPT’s pre-training consumption of several thousand petaflop/s-days [3]. Fine-
tuning techniques have been proposed to alleviate the computational challenges of
training from scratch [9]. In a model-as-a-service scenario [10], LLMs are only accessed
as inference APIs. Thanks to their gradient-free nature, evolutionary algorithms (EAs)
are employed to fine-tune LLMs in this black-box scenario, aiming to boost perfor-
mance on downstream tasks [10]. Evolutionary fine-tuning is renowned for low cost,
needing no internal model gradient access.
Drawing from biological evolution, EAs continuously maintain evolving systems
(population or probability distribution) through reproduction and selection to explore
fitness landscapes [11, 12]. Typical methods include the genetic algorithm (GA), evo-
lutionary strategy (ES), and genetic programming (GP) [13, 14]. In principle, only
individuals and their fitness are needed to drive the evolutionary process in these
approaches. Due to advancements in computational resources, EAs have provided
diverse solutions to complex black-box optimization issues, such as neuroevolution
[15], industrial design [16–19], and natural sciences [20]. Nonetheless, most EAs are
task-specific, and their capabilities do not automatically increase with experience [21–
24]. Recently, Transformer-enhanced EAs utilize basic Transformer models [25] to
learn optimization experiences, while LLM-enhanced EAs employ well-trained LLMs
[26, 27] to produce optimization experiences. Both approaches intend to improve the
performance and generalization of EAs.
Fig. 1 demonstrates that text and population can be regarded as sequence data,
specifically exhibiting directionality. LLMs and EAs are designed to learn or simulate

2
Generative Pre-Training
Transformer L× Transformer Block
Windows X(t) Windows X(t+1) Windows X(T)

0,1,0,...,0 Multi- 1,0,0,...,0 No 1,0,0,...,0


Add Add Yes
Token 0,0,1,...,0 Head Feed ...... ......
& & Stop?
Repr esentation ...... Self- For war d ...... ......
Nor m Nor m
0,0,0,...,1 Attention ...... ......
Model
Position
Position Encoding 1,2, ..., N N+1,...,... Tr aining
Embedding
(Tr aining Phase)

Consistency

Evaluation & Par ameter


1,10, ..., N Selection Reproduction
Fitness Shaping 7,N, ..., 23 Adaptation
(Optional)
1,1,0,...,1 0,1,0,...,1 0,1,0,...,1
Individual 0,1,1,...,1 0,1,0,...,0 0,1,0,...,0
Cr ossover Mutation Stop?
Repr esentation ...... ...... ......
Yes
1,0,1,...,0 1,0,1,...,1 No 1,0,1,...,1

Poputation P(t) Poputation P(t+1) Poputation P(T)


Genetic Algorithm

Fig. 2 Overview of the generative pre-training Transformer (GPT) and genetic algorithm (GA).
Modules of the same color indicate parallels, as exemplified by the analogy between crossover in
GA and attention in GPT. GPT continuously generates subsequent tokens by iterating over a con-
text window. The tokens in the input window are transformed by a large-scale Transformer block
(multi-head self-attention and feed-forward neural network). These tokens function collectively, pro-
viding contextual information for accurate generation. GA generates individuals with high fitness by
iterating a population. Individuals in the parent population are transformed by reproduction oper-
ators (crossover and mutation). These individuals exhibit collective intelligence, helping to explore
the search space and seek the optimal solutions.

such sequential data. Fig. 2 illustrates the process of the sequences generated by LLMs
and EAs, taking the GPT [1] and GA [28] as examples respectively. During the genera-
tion process, the context window and population are continuously updated to generate
creative texts and diverse solutions. Notably, the context window and population share
a collective nature, with constituent elements functioning cohesively in their domains.
Inspired by the above directionality and collective, we raise the following issues: Are
there parallels between LLMs and EAs? Moreover, current interdisciplinary research,
whether focusing on evolutionary fine-tuning or LLM-enhanced EAs, remains in its
infancy. Can these parallels guide future interdisciplinary integration? To address these
issues, this paper draws analogies between the primary characteristics of LLMs and
EAs, emphasizing their common mechanisms. At the micro level, we analyze key inter-
disciplinary research related to each parallel, which not only supports our analogies
but also offers insights for potential improvements. At the macro level, we systemat-
ically summarize evolutionary fine-tuning and LLM-enhanced EAs to reveal critical
challenges.

1 Parallels
Transformer-based LLMs have developed rapidly since the introduction of the Trans-
former architecture [38]. Taking GPT as an example, LLMs mainly contain several

3
Table 1 A comparison of large language models and evolutionary algorithms in terms of key characteristics, where ”N/A” indicates a lack of
corresponding interdisciplinary research in that area.

Large language models Evolutionary algorithms


Characteristics Classic methods Traits Key Characteristics Classic methods Traits Key
interdis- interdis-
ciplinary ciplinary
research research
Token represen- One-hot encod- Collective, [10, 29] Individual repre- Real encoding + Collective, [27, 31]
tation ing + Linear Uniqueness, sentation Random embed- Uniqueness,
transformation finite ding [30] Infinite and
changing
Position encod- Sine and cosine Uniqueness, Rel- N/A Fitness shaping Rank transfor- Uniqueness, [25]
ing functions ativity mation, Utility Relativity,
function [32] Directionality

4
Position embed- Absolute, Rela- Relativity N/A Selection Tournament Relativity, [25]
ding tive, Rotary [33] selection, Rank Directionality
selection
Transformer Multi-head self- Position- [34] Reproduction Arithmetic Fitness- [25, 35]
block attention + insensitive, crossover + Uni- insensitive,
Feed-forward Parallelism, form mutation Parallelism,
neural network Sparsity, token Sparsity, Indi-
and position vidual and
information, Sin- fitness informa-
gleness, synergy tion, Singleness,
synergy
Model training Pre-training, Learning, Explo- [10, 29] Parameter adap- Hyper- Learning, Explo- [25, 27, 31,
Fine-tuning, ration, Parame- tation heuristics, ration, Parame- 36, 37]
Reinforcement ter space, Lan- Pre-training, ter space, Search
learning guage space Fine-tuning, space
Meta-learning
characteristics: token representation, position encoding, position embedding1 , Trans-
former block, and model training. In 1950, Turing proposed a ‘learning machine’
similar to the principles of evolution [39]. Since then, evolution-inspired computa-
tional theories have been explored and refined. Taking GA as an example, EAs include
several typical characteristics: individual representation, fitness shaping, selection,
reproduction, and parameter adaptation. Table 1 lists these characteristics, classic
methods, traits, and key interdisciplinary research integrating LLMs and EAs. Next,
each subsection focuses on elucidating the corresponding characteristics of LLMs and
EAs.

1.1 Token representation and individual representation


In LLMs, the input is represented as a token sequence X = {x1 , ..., xN } ∈ RN ×|V | ,
where N and |V | are the sizes of context window and vocabulary V , respectively. Each
token is encoded as a high-dimensional sparse one-hot vector. Subsequently, token
embeddings map token encoding sequences into a low-dimensional dense vector space
[1]. For example, linear transformation applies a word embedding matrix We ∈ R|V |×dt
to the token encoding sequence X to generate a new representation X = XWe .
In EAs, the population is represented as an individual sequence P = {pi , ..., pN } ∈
RN ×d , where N and d are the population size and coding dimension, respectively.
Each individual (or solution) is encoded into a data structure manipulable by genetic
operators. In numerical optimization, real encoding maps the individual into a real-
valued vector. The high dimensionality of encoding increases optimization difficulty.
Many strategies are proposed to deal with the curse of dimensionality [30, 40, 41]. For
example, random embedding applies a random projection matrix Wr ∈ R|V |×dr to the
population P to generate a low-dimensional representation P = P Wr .
Token representation can be regarded as an individual representation, which satis-
fies collective and uniqueness. The tokens in the context window are individuals in the
population. Both token encoding and individual encoding guarantee one-to-one map-
ping. This analogy conceptually provides bidirectional support for interdisciplinary
research. EAs using token representations can operate directly within embedded or
original token spaces to find high-quality input prompts [10, 29]. Natural language on
a finite vocabulary has demonstrated powerful representation capabilities, which may
bring new opportunities for individual representation. In evolution, the decision space
may be infinite, changing, and difficult to describe mathematically. Fortunately, these
complex search spaces are represented directly with the help of natural language. This
flexibility enables LLM-enhanced EAs to tackle tasks that are not easily reducible to
simple mathematical formulas or notations, such as paths and coding [27, 31].

1.2 Position encoding and fitness shaping


Position encoding models the dependence of tokens at different positions in the
sequence. In GPT [1], sine and cosine functions of different frequencies are adopted to
encode dependencies. Each token has a unique position encoding consisting of cosine

1
‘Position encoding’ and ‘position embedding’ are closely related in some literature. For the sake of
analogy, we have chosen to treat them separately.

5
functions with different frequencies. The combination of several cosine waves con-
tains relative distance information between tokens. However, due to the symmetry of
distances, the position encoding cannot distinguish sequence direction.
Fitness shaping transforms the fitness of individuals in a population to cope with
selection pressure. For example, rank-based fitness shaping is commonly used in ES
[32, 42]. Individuals are sorted in descending order of fitness. The corresponding fitness
is transformed into a set of utility values u1 ≥ u2 ≥ ... ≥ uN by a utility function.
This utility function ensures invariance under fitness order-preserving transformations,
which preserves the relativity and directionality of fitness.
Both position encoding and fitness shaping share the characteristics of coding
uniqueness. Classic position encoding effectively models the relative positions between
tokens, although it does not capture the directionality of the sequence. Fitness shaping
describes the relative and directional ordering of individual fitness. Inspired by fitness
shaping, the integration of sequence directionality into position encoding emerges as
a noteworthy research direction. Conversely, in existing Transformer-enhanced EAs
[25], fitness values are directly utilized for position encoding within the Transformer
model. According to our analogy, the introduction of fitness shaping has the potential
to aid Transformer-enhanced EAs in managing selective pressures.

1.3 Transformer block and reproduction


A vanilla Transformer block is composed of a multi-head self-attention attention, a
feed-forward neural network (FFN), residual connections, and layer normalization.
The self-attention attention mechanism performs feature transformation on the token
embedding. Token embedding X ∈ RN ×dt is transformed into query Q = XW Q , key
K = XW K and value V = XW V , through linear transformations W Q ∈ Rdt ×dq ,
W K ∈ Rdt ×dk , and W V ∈ Rdt ×dv . Then, query Q and key K are used to calculate the
attention matrix A describing token relationships. Applying the Sof tmax function to

A and multiplying by value V yields the transformed output X :

!

T A ′
A = QK , A = Sof tmax √ , X = AV. (1)
d

The multi-head self-attention mechanism operates by combining multiple self-


attention mechanisms to focus on information from different subspaces. Many studies
show that the learned attention matrix is sparse. To reduce computational complex-
ity, improvement mechanisms have been proposed, such as sparse attention and linear
attention [43]. In Transformer blocks, the FFN enhances the expressive ability of each
′ ′
token xi by applying a nonlinear function f (xi ). Residual connections help alleviate
the vanishing gradient problem, enabling deeper feature learning. Layer normalization
stabilizes the training process and improves convergence speed.
Typical reproduction involves crossover and mutation. Crossover acts on the parent
population to generate new individuals. Classic real crossover operators include arith-
metic crossover, simulated binary crossover[44], and more. We illustrate crossover’s
workflow using arithmetic crossover [45] as an example. Any two individuals pi and

pj are randomly selected from the parent population P ∈ RN ×d . New individual pi is

6
a linear combination of parental genes:

pi = ai pi + aj pj . (2)

We reformulate this process as:



pi = 0p1 + ... + ai pi + ... + aj pj + 0pN = [0, ..., ai , ..., aj , ...]P = Ai P. (3)

Without loss of generality, N individuals can be produced in a batch:



P = [A1 ; A2 ; ...; AN ]P = AP, (4)

where A is a sparse matrix determined by the selection, termed the selection matrix

in the paper. In the reproduction, mutation applies a nonlinear perturbation P M (pi )

to each individual pi to promote individual diversity.
Comparing (1) with (4), attention and crossover share a similar mathematical rep-
resentation, exhibiting parallelism and sparsity. Attention does not explicitly model
token positions. Similarly, crossover inherently does not consider individual fitness.
The attention and selection matrices play analogous roles: one determines token fea-
ture combinations, while the other governs parent genetic combinations. The attention
matrix is parameterized based on token embeddings, while the selection matrix is
heuristically built on individual relationships. Additionally, (1) demonstrates that the
input and output of attention occupy distinct latent spaces. However, (4) reveals the
population keeps the same representation space across crossover.
Both the FFN and mutation operate independently on each singleton (token or
individual) and can be processed in parallel. Existing works [34, 46] show that the
removal of FFNs degrades Transformer performance, emphasizing the crucial role of
both attention and FFNs. Analogously, the synergistic effect of crossover and mutation
results in the super efficiency of EAs [47, 48]. The randomness of mutations contributes
to the generation of diverse individuals within a population. FFNs with dropouts show
randomness in training but are deterministic in testing. Introducing randomness into
FFNs has the potential to enhance the diversity of LLM outputs.
Zhang et al. [34] first analogized Transformer blocks to reproduction, utilizing
dynamic local populations in EAs for a stronger Vision Transformer. Crossover and
mutation were modeled with attention and FFN for Transformer-enhanced EAs [35].
These efforts confirm our analogy and demonstrate the potential for progress through
idea sharing between advanced Transformers and reproduction.

1.4 Position embedding and selection


Position embeddings integrate positional information into the attention mechanism,
using absolute, relative, and rotary techniques, to capture sequential dependencies
and contextual relationships within token sequences [33]. A typical absolute position
embedding is the sinusoidal position embedding, adding position information encoded
by sine and cosine functions to token embeddings. For any two tokens xt and xs , with

7
position information pt and ps , the attention matrix is expressed as:

At,s = QTt Ks = (xt + pt )T WQT WK (xs + ps ) (5)


= xTt WQT WK xs + xTt WQT WK ps + pTt WQT WK xs + pTt WQT WK ps . (6)

Due to WQT WK , the token relative information is destroyed [49]. Several models like
T5 [50], Transformer-XL [51], TENER [49], and DeBERTa [52] have integrated rel-
ative positional information into the attention matrix to address this limitation. For
example, T5 directly adds token offsets to attention weights:

At,s = QTt Ks + rb(t−s) = xTt WQT WK xs + rb(t−s) . (7)

Rotary position embedding incorporates relative position information through token


embedding rotation [33]:
T T
At,s = xt WQ Rt−s WK xs = xt WQ Rt (RsT WK
T T
xs ), (8)

which achieves a unification of absolute and relative position embeddings.


Based on fitness information, selection operators such as tournament and rank
selection are designed to identify excellent parents for crossover [45]. These selec-
tion methods can be viewed as heuristics for building the selection matrix. In binary
tournament selection, for instance, the selection matrix A is randomly created based
on fitness comparisons. In basic differential evolution [53], multiple individuals are
randomly chosen for differential operations, which influences the composition of the
selection matrix A. In OpenAI-ES [54], each individual is sampled from a multivariate
PN
Gaussian distribution with mean N1 2
1 ui pi and covariance σ I:

N
′ 1 X
pi = ui pi + N (0, σ 2 I), (9)
N 1

where ui is the utility function of i-th parent individual pi . The first term is crossover
and the second is mutation. The crossover is rewritten as:
N  
1 X 1 1
ui pi = u1 , ..., uN P = Ai P, (10)
N 1 N N

where each row of the selection matrix Ai is determined by the parent’s utility value. In
OpenAI-ES, the selection matrix has identical rows due to the same genetic material
from parents assigned to each individual. Furthermore, in NSGAII [55], the selec-
tion matrix A is constructed using non-dominated sorting and crowding distance,
considering individual relationships in the objective space.
By operating on matrix A, position embedding and selection add position and
fitness information to the attention mechanism and crossover, respectively. The selec-
tion notably steers the population towards enhanced fitness. Individuals with higher

8
fitness are preferentially selected for crossover, which considers the directionality of
individual fitness, fostering a more adaptive population. Standard position embed-
ding effectively captures the relative positions between tokens but does not explicitly
encode the sequential order of tokens. Current efforts introduced task-specific super-
vision during training to assist LLMs in comprehending the sequential relationships
among tokens. For example, GPT [1] employs a masked multi-head attention mech-
anism, ensuring that the output at each position is solely determined by preceding
tokens. This approach guarantees a unidirectional information flow, forcing the self-
attention mechanism to consider only the past context. Masked LLMs like BERT [56]
learn contextual representations by predicting masked tokens, necessitating a focus on
the entire textual context in both directions rather than just a unilateral one. Inspired
by the directionality of fitness considered in selection, introducing token order directly
into position embedding may enhance the generative capabilities of LLMs.
The attention matrix is influenced by both token and position relations. The
selection matrix is usually designed based on fitness relations. In complex fitness
landscapes, additional considerations such as genetic similarity among individuals are
factored into the selection matrix. For instance, in multi-modal optimization with mul-
tiple global optima [57, 58], individual distances within the search space are employed
as a criterion to preserve population diversity during selection. Therefore, the selection
matrix consists of individual relations and fitness relations. Essentially, the striking
similarity between attention and selection matrices in handling relational information
stems from the analogy drawn to tokens and individuals, as well as to positions and
fitness. Existing Transformer-enhanced EA merges individual embeddings with fitness
embeddings, echoing how token embeddings are combined with positional embeddings
[25]. This practice serves as a support for our analogy. Advanced attention mech-
anisms, such as those incorporating rotational positional embedding [33], have the
potential to enhance the performance of this operation in Transformer-enhanced EAs.

1.5 Model training and parameter adaptation


Model training typically begins with unsupervised pre-training, modeling natural lan-
guage on a vast amount of text. This is followed by fine-tuning, which adjusts the model
to downstream tasks. Given a token sequence X = {x1 , ..., xN }, a unidirectional LLM
estimates a conditional probability distribution P (xi |x1 , ..., xi−1 ; θ) to generate subse-
quent tokens. For example, In GPT’s pre-training [1], the goal of language modeling
is to maximize the log-likelihood:
N
X
LP T = log P (xi |x1 , ..., xN ; θ), (11)
i=1

where N is the context window size and θ is the model parameter. During fine-tuning,
the optimization objective is a weighted sum of pre-training loss LP T and fine-tuning
loss LF T [1]:
L = LF T + µLP T . (12)
Hyperparameter µ ∈ [0, 1] determines the trade-off across losses. Furthermore, rein-
forcement learning [59] fine-tunes LLMs by optimizing outputs’ overall performance

9
(rewards) to generate high-quality responses continuously. Evolutionary fine-tuning
is proposed for black-box cases with inaccessible gradients and limited resources.
These methods typically guide LLMs to generate the desired output by automatically
constructing prompts directly within the input sequence [10, 29].
In GA, given parent population P = {p1 , ..., pN }, offspring are sampled from
an implicit conditional probability distribution P (pi |p1 , ..., pN ), which is induced by
genetic operators including selection, crossover, and mutation [31]. Genetic opera-
tor parameters are often determined through repeated experiments or adaptively
updated using hyper-heuristic strategies [60]. In ES, the probability distribution
P (p|p1 , ...pN ; θ) over the parent population is employed to produce offspring. For exam-
ple, in CMAES [42], the parameters (mean and covariance matrix) of a multivariate
Gaussian distribution are adapted by maximizing the log-likelihood:

N N  
X X pi:N − m
ui log P (pi:N ; m); ui log P ;C , (13)
i=1 i=1
σ

where pi:N refers to the ith ranked individual in a population with N individuals based
on fitness. The first term is the mean update, while the second term is the ranking-
N update of the covariance matrix. Recently, Transformer-enhanced EAs adaptively
updated parameters from optimization experiences on a set of optimization tasks,
improving the generalization ability of EAs on new tasks. Common methods include
pre-training [25, 27] and meta-learning [36, 37]. In pre-training, optimization experi-
ences for multi-objective optimization [25] consist of the population and their fitness
generated by multi-objective EAs on numerous benchmarks. Optimization experiences
for GP [27] include incremental changes in files submitted by humans to version con-
trol systems like Github. Meta-learning [36, 37] adaptively updates parameters by
optimizing the average performance of EAs across a set of tasks. Regrettably, no
comprehensive study has compared these two methods within a single framework.
Furthermore, well-trained LLMs are directly utilized as reproduction operators with
human-like experience. Prompts based on historical populations are constructed to
guide LLMs in generating the desired output population [27, 31].
Despite differing implementation strategies, LLMs and EAs converge on a shared
fundamental objective: revealing the underlying probability distributions within data,
thereby facilitating the learning and exploration of knowledge. In pre-training and
supervised fine-tuning, LLMs construct conditional probability distributions through
the accurate prediction of tokens, learning vast pre-existing knowledge. Reinforcement
learning adjusts the LLM parameters based on the rewards. Evolutionary fine-tuning
automatically searches for high-quality prompts to improve output quality. These two
paradigms explore new knowledge specific to the target task. Learning and exploration
jointly ensure the generative and generalization abilities of LLMs. EAs shape prob-
ability distributions based on fitness evaluated via real-time individual-environment
interaction. Traditional EAs continuously explore knowledge specific to a single target
task. In Transformer/LLM-enhanced EAs, models learn from existing optimization
experiences or human-like experiences, endowing EAs with powerful learning capa-
bilities. Additionally, LLM training works in the parameter space, while evolutionary

10
fine-tuning extends to the language space. EAs operate both in the search space (e.g.,
GA) and the parameter space (e.g., CMAES). The aforementioned analogy provides
a reasonable motivation for interdisciplinary research: merging the exceptional learn-
ing capabilities of LLMs with the remarkable exploration abilities of EAs can foster
advancements in their respective fields.
In (12), the hyperparameter µ is carefully tuned manually, as it affects the general-
ization ability of LLMs. The loss trade-off is modeled as a multi-objective optimization
problem [61, 62]. Applying advanced multi-objective EAs aids in creating stronger
supervised fine-tuning paradigms. Recent studies [54] show that ES has advantages
over gradient-based reinforcement learning for long episodes with very many time
steps. ES is promising as an alternative to reinforcement learning for training LLMs in
multi-turn dialogue systems. Evolutionary fine-tuning typically focuses on addressing
a single-task scenario. Incorporating optimization experiences from various fine-tuning
contexts into EAs can boost their adaptability.
The generalization of the Transformer-enhanced EAs is influenced by optimiza-
tion experience, which involves a set of historical optimization tasks. The similarity
between historical and new tasks determines the effectiveness of optimization expe-
rience utilization, echoing the motivation behind evolutionary transfer optimization
(ETO) [21]. Benchmarks for ETO [22] can potentially serve as optimization experi-
ences for training in diverse transfer scenarios. Additionally, using algorithms, human
expertise, or LLMs to generate optimization experiences across various benchmarks is
critical for expanding the application scope.

1.6 Discussion
Although LLMs and EAs developed independently, their parallels are truly remark-
able. From a macro perspective, analogies hold promise for the development of artificial
agents capable of learning from established knowledge while continuously exploring
new knowledge. These parallels have been implicitly mentioned or experimentally
demonstrated in interdisciplinary research [10, 25, 27, 29, 31, 34–37]. However, a
unified paradigm with one-to-one key feature correspondence has not emerged. Our
analogy aims to provide a roadmap for its realization. In existing efforts to integrate
EAs and LLMs, evolutionary fine-tuning in black-box scenarios and LLM-enhanced
EAs are receiving increasing attention. Next, this paper provides a comprehensive
review of them to identify key challenges.

2 Evolutionary fine-tuning in black-box scenarios


The fine-tuning reduces the risk of data leakage and avoids the huge computational
cost of training a model from scratch [9]. EA is widely used to fine-tune LLMs in com-
plex scenarios due to their flexibility. Evolutionary model tuning adjusts the model’s
weights or architecture [63–65], requiring a deep understanding of LLM internals. How-
ever, real-world constraints like limited computing resources and access restrictions
can hinder this process. In contrast, evolutionary prompt tuning [10] and evolution-
ary self-tuning [66–70] primarily focus on modifying the model’s input to enhance

11
Continuous Search Space z
Data Flow
Az+p0 Best film ever. It was <MASK>.

Evolutionar y Algor ithm X Best film ever. It was <MASK>. Lar ge Language Model
Concat

� ∈ ℝ�×�
Evolutionary Population
Operator
Az+p0 z p0
Evaluation
f(z)/f(P)
Discr ete Sear ch Space P
positive or negative. Best film ever.

X Best film ever.

Concat

P positive or negative.

Fig. 3 Basic workflow of evolutionary prompt tuning. Evolutionary algorithms are utilized to effi-
ciently search for optimal discrete prompts or continuous prompt embeddings thereby boosting the
performance of large language models on downstream tasks.

performance on specific tasks, requiring access to no internal information. These evo-


lutionary fine-tuning techniques in black-box scenarios are gaining attention for their
low cost, as detailed in Tables 2 and 3.
As shown in Fig. 3, evolutionary prompt tuning enhances model generation qual-
ity in few-shot or zero-shot settings by searching input prompts. EAs are employed
to find prompts to maximize task performance [10], relying solely on LLM inference
results. Current approaches are categorized as continuous and discrete prompt tuning.
Continuous prompt tuning uses continuous EAs like CMAES to refine prompt embed-
dings. To enrich the information within the embedding space, various decomposition
strategies are incorporated such as divide-and-conquer, subspace learning, and others
[71, 72, 74, 76]. Meanwhile, techniques like knowledge distillation, variational inference,
and federated learning are used to boost search efficiency, improve generalization, and
enhance security [73, 75, 77, 78]. Continuous prompts require embedding space access,
unsuitable for strict black-box settings. Discrete prompt tuning directly searches the
prompt space using discrete EAs, in which custom genetic operators heuristically mod-
ify prompts [29, 79]. Zhou et al. [80] clustered and pruned the discrete search space to
target promising prompt regions, addressing combinatorial explosion. In addition, evo-
lutionary prompt tuning is also used in adversarial attacks and multi-modal learning
[81, 82], generating effective attacks and diverse prompts. Recently, LLMs, with strong
generative capacity, act as genetic operators in EAs, creating high-quality prompts
[66–69], termed self-tuning in this paper. In addition to prompt generation, LLMs can
serve as versatile prompt selectors for out-of-domain tasks [70]. Self-tuning works in a
flexible language space, independent of parameter updates.
Evolutionary model tuning targets parameter space, while evolutionary prompt
tuning and self-tuning focus on language (search) space. Compared to model training
involving multiple gradient descents, evolutionary black-box tuning is highly cost-
effective. Current research focuses on model evolution within the language space.
In open environments, complex tasks may require self-coevolution in language and

12
Table 2 A comprehensive summary of evolutionary prompt tuning, highlighting its key characteristics: decision variable and its traits, objective
and its traits, adopted method, introducing new models, retraining, and internal access.

Literature Decision Variable Objective Objective Adopted methods New model Retraining Internal
variable traits traits access
BBT [10] Prompt Continuous Loss Single- Random embedding, CMAES No No No
embedding objective
BBTv2 [71] Prompt Continuous Loss Single- Divide-and-conquer, Random No No Yes
embedding objective embedding, CMAES
Textual Prompt Continuous Loss Single- Subspace decomposition, Ran- No No No
inversion embedding objective dom embedding, CMAES
[72]
SNPE/ABC- Prompt Continuous Loss Single- Variational inference, Random No No No
SMC [73] embedding objective embedding, CMAES
PCT [74] Prompt Continuous Loss Single- Prompt-Calibrated Tuning, No No Yes
embedding objective Whole-word mask, CMAES
BBT-RGB Prompt Continuous Loss Single- Divide-and-conquer, Ran- No No Yes
[75] embedding objective dom embedding, COBYLA,
CMAES
BSL [76] Prompt Continuous Loss Single- Subspace learning, Random No No No

13
embedding objective embedding, CMAES
GDFO [77] Prompt Continuous Loss Single- Knowledge distillation, Ran- Student Yes No
embedding objective dom embedding, CMAES model
FedBPT Prompt Continuous Multi- Single- Federated CMAES No No No
[78] embedding client loss objective
GAP3 [29] Prompt Discrete Performance Multi- Multi-level evaluation, Genetic No No No
score, Pre- objective algorithm
dicted
probability
GrIPS [79] Prompt Discrete Accuracy, Multi- Weighted sum, Genetic algo- No No No
Entropy objective rithm
ClaPS [80] Prompt Discrete Loss Single- Clustering and pruning, Evolu- No No No
objective tionary algorithm
Attacks Prompt Discrete Cosine sim- Single- Fitness approximation, No No No
[81] ilarity objective Genetic algorithm
BPT-VLM Text-image Continuous Loss Single- Random embedding, MMES, No No No
[82] prompt objective MAES, CMAES
embedding
Table 3 A comprehensive summary of evolutionary self-tuning, highlighting its key characteristics: decision variable and its traits, objective and its
traits, adopted method, introducing new models, retraining, and internal access.

Literature Decision Variable Objective Objective Adopted methods New Retraining Internal
variable traits traits model access
iPrompt [66] Prompt Discrete Render Single- LLM-based genetic operators, No No No
function objective Rank-based selection, Explo-
ration
Promptbreeder Task- Discrete Performance Single- LLM-based genetic operators, No No No
[67] mutation score objective Genetic algorithm

14
prompt
Auto-Instruct Instruction Discrete Predicted Single- LLM-based genetic operators, Selection Yes No
[70] score objective Rank-based selection model model
SPELL [68] Prompt Discrete Classification Single- LLM-based genetic operators, No No No
accuracy objective Genetic algorithm
EVOPROMPT Prompt Discrete Performance Single- LLM-based genetic operators, No No No
[69] score objective Genetic algorithm, Differential
evolution
1 -> 2 -> 3 -> 4 -> 5 3.2, 1.5, 3.9, 2.1, 0.8 2x^2+3sin(x)-4exp(x) disp('Hi!'); Hello, World! Dream Bigger
1 -> 2 -> 3 -> 5 -> 4 2.3, 4.2, 3.2, 3.2, 5.3 cos(x)+x^3-2exp(x) print("Hi!") 你好,世界! Embrace Change
1 -> 2 -> 4 -> 3 -> 5 2.3, 4.5, 5.3, 6.8, 7.8 sin(x)cos(x)+xexp(x) printf("Hi!\n"); ¡Hola, Mundo! Seek Wisdom
1 -> 2 -> 4 -> 5 -> 3 2.1, 4.3, 4.5, 4.2, 4.3 x^2cos(x)+3sin(x)-2 println("Hi!"); Bonjour, Monde! Ignite Passion
1 -> 2 -> 5 -> 3 -> 4 3.2, 4.6, 6.4, 9.0, 3.2 2xsin(x)+cos(x)exp(x) cout<<"Hi!"<<endl; こんにちは、せかい! Unleash Creativity

Mathematical
Path Number Code Sentences Prompt
Expression

Fig. 4 Various complex individual representations can be represented directly using natural language
descriptions, such as paths, numbers, mathematical expressions, code, sentences, and prompts.

parameter spaces, posing challenges like resource management, catastrophic forget-


ting, fitness assessment, and security issues. Efficient resource management strategies
help save costs in continuous evolution. Finding a balance between new and old knowl-
edge mitigates catastrophic forgetting. Designing collaborative evaluation strategies
for language and parameter spaces tailored to specific tasks is essential. Self-evolving
systems urgently require a robust security assessment mechanism to avert ethical chal-
lenges. In addition, integrating text, images, audio, and video is increasingly crucial
[83]. Evolutionary multimodal fusion techniques [65, 82] offer a promising path to
unifying diverse information, thereby expanding the applicability and versatility of
evolutionary fine-tuning.

3 Large language model-enhanced evolutionary


algorithm
Fig. 4 illustrates how complex individual representations are represented via flex-
ible natural language. Language-represented populations can be directly processed
by LLMs with strong text comprehension and generation skills. Table 4 summarizes
the LLM-enhanced EAs, where LLMs are employed as the reproduction and muta-
tion operator. These methods maintain the population via LLM-based evolutionary
operators to find diverse solutions to complex real-world challenges.
LLM-based reproduction enables the LLMs to derive offspring from parents based
on prompts. Prompts usually consist of a problem description (optional), parent
population, and task instructions. LLMs apply task instructions to the parent popula-
tion, generating offspring [31]. Romera-Paredes et al. [4] introduced a program search
method for mathematical reasoning, where LLMs create multiple programs from par-
ents. Fitness, expressed via numerical values, training logs, and human feedback, is also
integrated into prompts to guide the reproduction process [26, 27, 84, 85, 91–100]. For
example, in automated learning, training logs act as fitness for finding efficient archi-
tectures and hyperparameters [86–90]. Reproduction using LLMs operates directly in
language space, without needing extensive parameter access, resulting in cost savings.
LLMs can also be viewed as mutation operators affecting a single individual. Muta-
tion prompts typically include individual and task instructions. For instance, in the
data management strategy SEED [100], LLMs branch an initial code fragment into
multiple new code fragments based on task instructions. In addition, Lehman et al.
[27] introduced a mutation operator diff that acts on the parameter space. The diff
model performs parameter updates in an autoregressive manner, learning incremental

15
Table 4 Large language models-enhanced evolutionary algorithms. Large language models are
employed as the reproduction and mutation operators in evolutionary algorithms.

Reproduction Mutation
Prompt construction Problem description (optional), Individual, Task instructions
Population, Task instructions
Operation space Language space Language space, Parameter space
Evolutionary algo- Hill climbing, Genetic algorithm, Genetic algorithm, Genetic pro-
rithms Genetic programming, Quality gramming, Quality diversity
diversity, MOEA/D, Local search
Applications Function search[4], Combinato- Code generation[26, 27], Data
rial optimization[84], Reward func- management[99], Fuzzing[100]
tion optimization[85], Automatic
machine learning[86–90], Multi-
objective optimization[91], Prompt
turning[92], Text generation[93,
94], Game design[95, 96], Materials
science[97], Image generation[31],
Algorithm design[98]

changes to files. Given a parent code, the diff can simulate the modification behavior
of human programmers to generate new code.
Current methods primarily operate in the language space, offering high flex-
ibility and low cost. However, when model parameters are accessible, designing
efficient genetic operators in both parameter and language spaces deserves deeper
investigation for potential improvements. During evolution, LLMs must address the
exploration-exploitation challenge. Exploration encourages the generation of novel and
diverse outputs. While exploitation prioritizes outputs that are highly relevant to the
given context, potentially sacrificing creativity. Striking a balance between these two
strategies determines the ability of LLMs to autonomously acquire new knowledge.
Evolutionary multi-objective optimization [101] promises to provide a set of solutions
with different trade-offs.

4 Conclusion
LLMs and EAs have spurred innovation across interdisciplinary domains. The syn-
ergistic integration of EAs with LLMs is poised to realize the evolutionary learning
machines proposed by Turing [39]. This paper demonstrates the parallels between
LLMs and EAs from five aspects, initially indicating that analogies can potentially
spark a new artificial intelligence paradigm integrating LLM’s learning abilities with
EA’s exploratory capabilities. Recently, LLMs have shown promise in utilizing prin-
ciples of evolution [4, 27, 31, 66–70, 85]. The exponential growth in computing power
enables large models combined with evolutionary mechanisms to perform reasoning in
complex environments. Analogies offer an early explanation for interconnected studies.

16
References
[1] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving
language understanding by generative pre-training. Preprint at https://openai.
com/research/language-unsupervised (2018)

[2] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
bidirectional transformers for language understanding. In: Burstein, J., Doran,
C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association
for Computational Linguistics, Minneapolis, Minnesota (2019)

[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter,
C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J.,
Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language
models are few-shot learners. Advances in neural information processing systems
33, 1877–1901 (2020)

[4] Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M.P.,
Dupont, E., Ruiz, F.J., Ellenberg, J.S., Wang, P., Fawzi, O., et al.: Mathematical
discoveries from program search with large language models. Nature, 1–3 (2023)

[5] Shanahan, M., McDonell, K., Reynolds, L.: Role play with large language
models. Nature, 1–6 (2023)

[6] Schramowski, P., Turan, C., Andersen, N., Rothkopf, C.A., Kersting, K.: Large
pre-trained language models contain human-like biases of what is right and
wrong to do. Nature Machine Intelligence 4(3), 258–268 (2022)

[7] Liu, S., Nie, W., Wang, C., Lu, J., Qiao, Z., Liu, L., Tang, J., Xiao, C., Anand-
kumar, A.: Multi-modal molecule structure–text model for text-based retrieval
and editing. Nature Machine Intelligence 5(12), 1447–1457 (2023)

[8] Boiko, D.A., MacKnight, R., Kline, B., Gomes, G.: Autonomous chemical
research with large language models. Nature 624(7992), 570–578 (2023)

[9] Zheng, H., Shen, L., Tang, A., Luo, Y., Hu, H., Du, B., Tao, D.: Learn from
model beyond fine-tuning: A survey. arXiv preprint arXiv:2310.08184 (2023)

[10] Sun, T., Shao, Y., Qian, H., Huang, X., Qiu, X.: Black-box tuning for language-
model-as-a-service. In: International Conference on Machine Learning, pp.
20841–20855 (2022). PMLR

[11] Eiben, A.E., Smith, J.: From evolutionary computation to the evolution of

17
things. Nature 521(7553), 476–482 (2015)

[12] Kudela, J.: A critical problem in benchmarking and analysis of evolutionary


computation methods. Nature Machine Intelligence 4(12), 1238–1245 (2022)

[13] Jin, Y., Wang, H., Sun, C.: Data-driven Evolutionary Optimization. Springer,
Cham, Switzerland (2021)

[14] Miikkulainen, R., Forrest, S.: A biological perspective on evolutionary compu-


tation. Nature Machine Intelligence 3(1), 9–15 (2021)

[15] Stanley, K.O., Clune, J., Lehman, J., Miikkulainen, R.: Designing neural
networks through neuroevolution. Nature Machine Intelligence 1(1), 24–35
(2019)

[16] Matthews, D., Spielberg, A., Rus, D., Kriegman, S., Bongard, J.: Efficient
automatic design of robots. Proceedings of the National Academy of Sciences
120(41), 2305180120 (2023)

[17] Shu, J.I., Wang, Y., Brown, A., Kaminsky, A.: Genetic-algorithm-guided devel-
opment of parametric aeroelastic reduced-order models with state-consistence
enforcement. AIAA Journal, 1–19 (2023)

[18] Jin, Y., Wang, H., Chugh, T., Guo, D., Miettinen, K.: Data-driven evolutionary
optimization: An overview and case studies. IEEE Transactions on Evolutionary
Computation 23(3), 442–458 (2018)

[19] Li, B., Wei, Z., Wu, J., Yu, S., Zhang, T., Zhu, C., Zheng, D., Guo, W.,
Zhao, C., Zhang, J.: Machine learning-enabled globally guaranteed evolutionary
computation. Nature Machine Intelligence, 1–11 (2023)

[20] Lin, T., Chen, S., Basu, R., Pei, D., Cheng, X., Kara, L.B.: Target specific pep-
tide design using latent space approximate trajectory collector. arXiv preprint
arXiv:2302.01435 (2023)

[21] Gupta, A., Ong, Y.-S., Feng, L.: Insights on transfer optimization: Because
experience is the best teacher. IEEE Transactions on Emerging Topics in
Computational Intelligence 2(1), 51–64 (2018)

[22] Xue, X., Yang, C., Feng, L., Zhang, K., Song, L., Tan, K.C.: A scalable
test problem generator for sequential transfer optimization. arXiv preprint
arXiv:2304.08503 (2023)

[23] Gupta, A., Ong, Y.-S., Feng, L.: Multifactorial evolution: Toward evolutionary
multitasking. IEEE Transactions on Evolutionary Computation 20(3), 343–357
(2016)

[24] Wang, C., Liu, J., Wu, K., Wu, Z.: Solving multitask optimization problems

18
with adaptive knowledge transfer via anomaly detection. IEEE Transactions on
Evolutionary Computation 26(2), 304–318 (2022)

[25] Hong, H., Jiang, M.: Pre-evolved model for complex multi-objective optimization
problems. arXiv preprint arXiv:2312.06125 (2023)

[26] Brownlee, A.E., Callan, J., Even-Mendoza, K., Geiger, A., Hanna, C., Petke,
J., Sarro, F., Sobania, D.: Enhancing genetic improvement mutations using
large language models. In: International Symposium on Search Based Software
Engineering, pp. 153–159 (2023). Springer

[27] Lehman, J., Gordon, J., Jain, S., Ndousse, K., Yeh, C., Stanley, K.O.: In:
Banzhaf, W., Machado, P., Zhang, M. (eds.) Evolution Through Large Models,
pp. 331–366. Springer, Singapore (2024)

[28] Goldberg, D.E.: Genetic Algorithms. pearson education India, New York (1989)

[29] Zhao, J., Wang, Z., Yang, F.: Genetic prompt search via exploiting language
model probabilities. In: Proceedings of the Thirty-Second International Joint
Conference on Artificial Intelligence, pp. 5296–5305 (2023)

[30] Liu, J., Sarker, R., Elsayed, S., Essam, D., Siswanto, N.: Large-scale evolution-
ary optimization: A review and comparative study. Swarm and Evolutionary
Computation, 101466 (2024)

[31] Meyerson, E., Nelson, M.J., Bradley, H., Moradi, A., Hoover, A.K., Lehman,
J.: Language model crossover: Variation through few-shot prompting. arXiv
preprint arXiv:2302.12170 (2023)

[32] Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., Schmidhuber, J.:
Natural evolution strategies. The Journal of Machine Learning Research 15(1),
949–980 (2014)

[33] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced
transformer with rotary position embedding. Neurocomputing 568, 127063
(2024)

[34] Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C.,
Huang, F., Liu, Y.: Analogous to evolutionary algorithm: Designing a unified
sequence model. Advances in Neural Information Processing Systems 34, 26674–
26688 (2021)

[35] Li, X., Wu, K., Zhang, X., Wang, H., Liu, J.: B2opt: Learning to optimize black-
box optimization with little budget. arXiv preprint arXiv:2304.11787 (2023)

[36] Lange, R., Schaul, T., Chen, Y., Lu, C., Zahavy, T., Dalibard, V., Flennerhag, S.:
Discovering attention-based genetic algorithms via meta-black-box optimization.

19
In: Proceedings of the Genetic and Evolutionary Computation Conference, pp.
929–937 (2023)

[37] Lange, R.T., Schaul, T., Chen, Y., Zahavy, T., Dalibard, V., Lu, C., Singh, S.,
Flennerhag, S.: Discovering evolution strategies via meta-black-box optimiza-
tion. In: The Eleventh International Conference on Learning Representations
(2023)

[38] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural
information processing systems 30 (2017)

[39] Turing, A.M.: Computing Machinery and Intelligence. Springer, Dordrecht


(2009)

[40] Qian, H., Yu, Y.: Solving high-dimensional multi-objective optimization prob-
lems with low effective dimensions. In: Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 31 (2017)

[41] Tian, Y., Si, L., Zhang, X., Cheng, R., He, C., Tan, K.C., Jin, Y.: Evolutionary
large-scale multi-objective optimization: A survey. ACM Computing Surveys
(CSUR) 54(8), 1–34 (2021)

[42] Hansen, N.: The cma evolution strategy: A tutorial. arXiv preprint
arXiv:1604.00772 (2016)

[43] Niu, Z., Zhong, G., Yu, H.: A review on the attention mechanism of deep
learning. Neurocomputing 452, 48–62 (2021)

[44] Deb, K., Agrawal, R.B., et al.: Simulated binary crossover for continuous search
space. Complex systems 9(2), 115–148 (1995)

[45] Simon, D.: Evolutionary Optimization Algorithms. John Wiley & Sons, Hobo-
ken, New Jersey (2013)

[46] Dong, Y., Cordonnier, J.-B., Loukas, A.: Attention is not all you need:
Pure attention loses rank doubly exponentially with depth. In: International
Conference on Machine Learning, pp. 2793–2803 (2021). PMLR

[47] Zhou, Z.-H., Yu, Y., Qian, C.: Evolutionary Learning: Advances in Theories and
Algorithms. Springer, Singapore (2019)

[48] Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A.,
Prasath, V.S.: Choosing mutation and crossover ratios for genetic algorithms—a
review with a new dynamic approach. Information 10(12), 390 (2019)

[49] Yan, H., Deng, B., Li, X., Qiu, X.: Tener: adapting transformer encoder for
named entity recognition. arXiv preprint arXiv:1911.04474 (2019)

20
[50] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y.,
Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-
text transformer. The Journal of Machine Learning Research 21(1), 5485–5551
(2020)

[51] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.:
Transformer-XL: Attentive language models beyond a fixed-length context. In:
Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, pp. 2978–2988.
Association for Computational Linguistics, Florence, Italy (2019)

[52] He, P., Liu, X., Gao, J., Chen, W.: Deberta: Decoding-enhanced bert with dis-
entangled attention. In: International Conference on Learning Representations
(2020)

[53] Das, S., Suganthan, P.N.: Differential evolution: A survey of the state-of-the-art.
IEEE transactions on evolutionary computation 15(1), 4–31 (2010)

[54] Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a
scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864
(2017)

[55] Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective
genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation 6(2),
182–197 (2002)

[56] Wagner, A., Mitra, T., Iyer, M., Da Costa, G., Tremblay, M.: Position masking
for language models. arXiv preprint arXiv:2006.05676 (2020)

[57] Singh, G., Deb, K.: Comparison of multi-modal optimization algorithms based
on evolutionary algorithms. In: Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation. GECCO ’06, pp. 1305–1312. Associa-
tion for Computing Machinery, New York, NY, USA (2006)

[58] Das, S., Maity, S., Qu, B.-Y., Suganthan, P.N.: Real-parameter evolutionary mul-
timodal optimization—a survey of the state-of-the-art. Swarm and Evolutionary
Computation 1(2), 71–88 (2011)

[59] Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., Gupta,
A., Andreas, J.: Guiding pretraining in reinforcement learning with large lan-
guage models. In: International Conference on Machine Learning, pp. 8657–8677
(2023). PMLR

[60] Burke, E.K., Gendreau, M., Hyde, M., Kendall, G., Ochoa, G., Özcan, E., Qu,
R.: Hyper-heuristics: A survey of the state of the art. Journal of the Operational
Research Society 64, 1695–1724 (2013)

21
[61] Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization.
Advances in neural information processing systems 31 (2018)

[62] Lin, X., Zhen, H.-L., Li, Z., Zhang, Q.-F., Kwong, S.: Pareto multi-task learning.
Advances in neural information processing systems 32 (2019)

[63] Choong, H.X., Ong, Y.-S., Gupta, A., Chen, C., Lim, R.: Jack and masters of
all trades: One-pass learning sets of model sets from large pre-trained models.
IEEE Computational Intelligence Magazine 18(3), 29–40 (2023)

[64] Klein, A., Golebiowski, J., Ma, X., Perrone, V., Archambeau, C.: Structural
pruning of large language models via neural architecture search. In: AutoML
Conference 2023 (Workshop) (2023)

[65] Du, G., Li, J., Liu, H., Jiang, R., Yu, S., Guo, Y., Goh, S.K., Tang, H.-
K.: Knowledge fusion by evolving weights of language models. arXiv preprint
arXiv:2406.12208 (2024)

[66] Singh, C., Morris, J.X., Aneja, J., Rush, A., Gao, J.: Explaining data patterns
in natural language with language models. In: Belinkov, Y., Hao, S., Jumelet, J.,
Kim, N., McCarthy, A., Mohebbi, H. (eds.) Proceedings of the 6th BlackboxNLP
Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 31–55.
Association for Computational Linguistics, Singapore (2023)

[67] Fernando, C., Banarse, D., Michalewski, H., Osindero, S., Rocktäschel, T.:
Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv
preprint arXiv:2309.16797 (2023)

[68] Li, Y.B., Wu, K.: Spell: Semantic prompt evolution based on a llm. arXiv
preprint arXiv:2310.01260 (2023)

[69] Chen, A., Dohan, D., So, D.: Evoprompting: Language models for code-level
neural architecture search. In: Thirty-seventh Conference on Neural Information
Processing Systems (2023)

[70] Zhang, Z., Wang, S., Yu, W., Xu, Y., Iter, D., Zeng, Q., Liu, Y., Zhu, C., Jiang,
M.: Auto-instruct: Automatic instruction generation and ranking for black-box
language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Asso-
ciation for Computational Linguistics: EMNLP 2023, pp. 9850–9867. Association
for Computational Linguistics, Singapore (2023)

[71] Sun, T., He, Z., Qian, H., Zhou, Y., Huang, X.-J., Qiu, X.: Bbtv2: towards a
gradient-free future with large language models. In: Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, pp. 3916–
3930 (2022)

[72] Fei, Z., Fan, M., Huang, J.: Gradient-free textual inversion. In: Proceedings of

22
the 31st ACM International Conference on Multimedia. MM ’23, pp. 1364–1373.
Association for Computing Machinery, New York, NY, USA (2023)

[73] Shen, M., Ghosh, S., Sattigeri, P., Das, S., Bu, Y., Wornell, G.: Reliable
gradient-free and likelihood-free prompt tuning. In: Vlachos, A., Augenstein, I.
(eds.) Findings of the Association for Computational Linguistics: EACL 2023,
pp. 2416–2429. Association for Computational Linguistics, Dubrovnik, Croatia
(2023)

[74] Qi, S., Zhang, Y.: Prompt-calibrated tuning: Improving black-box optimization
for few-shot scenarios. In: 2023 4th International Seminar on Artificial Intelli-
gence, Networking and Information Technology (AINIT), pp. 402–407 (2023).
IEEE

[75] Sun, Q., Han, C., Chen, N., Zhu, R., Gong, J., Li, X., Gao, M.: Make prompt-
based black-box tuning colorful: Boosting model generalization from three
orthogonal perspectives. In: Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A.,
Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Con-
ference on Computational Linguistics, Language Resources and Evaluation
(LREC-COLING 2024), pp. 10958–10969. ELRA and ICCL, Torino, Italia
(2024)

[76] Zheng, Y., Tan, Z., Li, P., Liu, Y.: Black-box prompt tuning with subspace
learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing
32, 3002–3013 (2024)

[77] Han, C., Cui, L., Zhu, R., Wang, J., Chen, N., Sun, Q., Li, X., Gao, M.: When
gradient descent meets derivative-free optimization: A match made in black-box
scenario. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the
Association for Computational Linguistics: ACL 2023, pp. 868–880. Association
for Computational Linguistics, Toronto, Canada (2023)

[78] Sun, J., Xu, Z., Yin, H., Yang, D., Xu, D., Chen, Y., Roth, H.R.: Fedbpt: Efficient
federated black-box prompt tuning for large language models. arXiv preprint
arXiv:2310.01467 (2023)

[79] Prasad, A., Hase, P., Zhou, X., Bansal, M.: GrIPS: Gradient-free, edit-based
instruction search for prompting large language models. In: Vlachos, A., Augen-
stein, I. (eds.) Proceedings of the 17th Conference of the European Chapter of
the Association for Computational Linguistics, pp. 3845–3864. Association for
Computational Linguistics, Dubrovnik, Croatia (2023)

[80] Zhou, H., Wan, X., Vulić, I., Korhonen, A.: Survival of the most influen-
tial prompts: Efficient black-box prompt search via clustering and pruning.
In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for
Computational Linguistics: EMNLP 2023, pp. 13064–13077. Association for
Computational Linguistics, Singapore (2023)

23
[81] Lapid, R., Langberg, R., Sipper, M.: Open sesame! universal black box jailbreak-
ing of large language models. arXiv preprint arXiv:2309.01446 (2023)

[82] Yu, L., Chen, Q., Lin, J., He, L.: Black-box prompt tuning for vision-language
model as a service. In: Proceedings of the Thirty-Second International Joint
Conference on Artificial Intelligence, pp. 1686–1694 (2023)

[83] Fei, N., Lu, Z., Gao, Y., Yang, G., Huo, Y., Wen, J., Lu, H., Song, R., Gao,
X., Xiang, T., et al.: Towards artificial general intelligence via a multimodal
foundation model. Nature Communications 13(1), 3094 (2022)

[84] Liu, S., Chen, C., Qu, X., Tang, K., Ong, Y.-S.: Large language models as
evolutionary optimizers. arXiv preprint arXiv:2310.19046 (2023)

[85] Ma, Y.J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu,
Y., Fan, L., Anandkumar, A.: Eureka: Human-level reward design via coding
large language models. In: The Twelfth International Conference on Learning
Representations (2024)

[86] Nasir, M.U., Earle, S., Togelius, J., James, S., Cleghorn, C.: Llmatic: Neural
architecture search via large language models and quality-diversity optimization.
arXiv preprint arXiv:2306.01102 (2023)

[87] Zheng, M., Su, X., You, S., Wang, F., Qian, C., Xu, C., Albanie, S.: Can gpt-4
perform neural architecture search? arXiv preprint arXiv:2304.10970 (2023)

[88] Wang, H., Gao, Y., Zheng, X., Zhang, P., Chen, H., Bu, J.: Graph neural
architecture search with gpt-4. arXiv preprint arXiv:2310.01436 (2023)

[89] Zhang, M., Desai, N., Bae, J., Lorraine, J., Ba, J.: Using large language mod-
els for hyperparameter optimization. In: NeurIPS 2023 Foundation Models for
Decision Making Workshop (2023)

[90] Zhang, S., Gong, C., Wu, L., Liu, X., Zhou, M.: Automl-gpt: Automatic machine
learning with gpt. arXiv preprint arXiv:2305.02499 (2023)

[91] Liu, F., Lin, X., Wang, Z., Yao, S., Tong, X., Yuan, M., Zhang, Q.: Large
language model for multi-objective evolutionary optimization. arXiv preprint
arXiv:2310.12541 (2023)

[92] Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large
language models as optimizers. In: The Twelfth International Conference on
Learning Representations (2024)

[93] Xiao, L., Chen, X.: Enhancing llm with evolutionary fine tuning for news
summary generation. arXiv preprint arXiv:2307.02839 (2023)

[94] Bradley, H., Dai, A., Teufel, H., Zhang, J., Oostermeijer, K., Bellagente, M.,

24
Clune, J., Stanley, K., Schott, G., Lehman, J.: Quality-diversity through ai
feedback. arXiv preprint arXiv:2310.13032 (2023)

[95] Lanzi, P.L., Loiacono, D.: Chatgpt and other large language models as evolu-
tionary engines for online interactive collaborative game design. In: Proceedings
of the Genetic and Evolutionary Computation Conference. GECCO ’23, pp.
1383–1390. Association for Computing Machinery, New York, NY, USA (2023)

[96] Sudhakaran, S., González-Duque, M., Freiberger, M., Glanois, C., Najarro,
E., Risi, S.: MarioGPT: Open-ended text2level generation through large lan-
guage models. In: Thirty-seventh Conference on Neural Information Processing
Systems (2023)

[97] Jablonka, K.M., Ai, Q., Al-Feghali, A., Badhwar, S., Bocarsly, J.D., Bran, A.M.,
Bringuier, S., Brinson, L.C., Choudhary, K., Circi, D., et al.: 14 examples of
how llms can transform materials science and chemistry: a reflection on a large
language model hackathon. Digital Discovery 2(5), 1233–1250 (2023)

[98] Liu, F., Tong, X., Yuan, M., Lin, X., Luo, F., Wang, Z., Lu, Z., Zhang, Q.: An
example of evolutionary computation+ large language model beating human:
Design of efficient guided local search. arXiv preprint arXiv:2401.02051 (2024)

[99] CHen, Z., Cao, L., Madden, S., Fan, J., Tang, N., Gu, Z., Shang, Z., Liu, C.,
Cafarella, M., Kraska, T.: Seed: Simple, efficient, and effective data management
via large language models. arXiv preprint arXiv:2310.00749 (2023)

[100] Xia, C.S., Paltenghi, M., Tian, J.L., Pradel, M., Zhang, L.: Fuzz4all: Universal
fuzzing with large language models. In: 2024 IEEE/ACM 46th International
Conference on Software Engineering (ICSE) (2024)

[101] De Ath, G., Everson, R.M., Rahat, A.A.M., Fieldsend, J.E.: Greed is good:
Exploration and exploitation trade-offs in bayesian optimisation. ACM Trans.
Evol. Learn. Optim. 1(1) (2021)

25

You might also like