0% found this document useful (0 votes)
313 views17 pages

Gemma 2 Report

Uploaded by

Vicente Carmino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
313 views17 pages

Gemma 2 Report

Uploaded by

Vicente Carmino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

2024-06-27

Gemma 2: Improving Open Language Models


at a Practical Size
Gemma Team, Google DeepMind1

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art
open models, ranging in scale from 2 billion to 27 billion parameters. The 9 billion and 27 billion
parameter models are available today, with a 2 billion parameter model to be released shortly. In this new
version, we provide several technical modifications to our architecture, such as interleaving local-global
attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B
and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The
resulting models deliver the best performance for their size, and even offer competitive alternatives to
models that are 2-3× bigger. We release all our models to the community.

1. Introduction In particular, we focus our efforts on knowledge


distillation (Hinton et al., 2015), which replaces
Large language models (LLMs) have demon- the one-hot vector seen at each token with the
strated strong capabilities in language under- distribution of potential next tokens computed
standing, generation, and reasoning (Brown et al., from a large model. This approach is often used
2020; Radford et al., 2019; Raffel et al., 2019). to reduce the training time of smaller models by
Scaling has been key to this recent progress, giving them richer gradients. In this work, we
with many new capabilities only emerging at instead train for large quantities of tokens with
scale (Brown et al., 2020). The newest large mod- distillation in order to simulate training beyond
els not only reach unprecedented performance the number of available tokens. Concretely, we
on reasoning benchmarks (Achiam et al., 2023), use a large language model as a teacher to train
but they also demonstrate multimodal and mul- small models, namely 9B and 2.6B models, on
tilingual capabilities (Gemini Team, 2024) and a quantity of tokens that is more than 50× the
even the ability to use context lengths of over 1M compute-optimal quantity predicted by the the-
tokens (Gemini Team, 2024). ory (Hoffmann et al., 2022). Along with the mod-
Small-scale models have also shown a rapid els trained with distillation, we also release a 27B
increase in performance, but these gains are model trained from scratch for this work.
largely derived from increasing the length of train- We also benefit from advances previously used
ing (Gemma Team, 2024; Jiang et al., 2023; Tou- in Gemini Team (2023), namely the interleaving
vron et al., 2023). This approach only scales log- of global and local attention layers from Beltagy
arithmically with dataset size (Hoffmann et al., et al. (2020a), and the Grouped-Query Atten-
2022), and the latest small models require up to tion (GQA) mechanism of Ainslie et al. (2023).
15T tokens to improve the state of the art by less
than 1-2% (AI@Meta, 2024). Overall, Gemma 2 significantly advances state-
of-the-art performance relative to comparable-
Yet, these continued improvements provide ev- scale open models and are even competitive
idence that small models are still under-trained. with some models more than twice their size
In this work, we explore alternatives to improve (AI@Meta, 2024; Almazrouei et al., 2023; Jiang
small model performance without solely increas- et al., 2023; xAI), across a variety of automated
ing training length. One solution is to improve benchmarks and human evaluations. Example
the quality of information received by the net- domains include question answering (Clark et al.,
work at each training step by replacing the next 2019; Kwiatkowski et al., 2019), commonsense
token prediction task with a richer objective. reasoning (Sakaguchi et al., 2019; Suzgun et al.,

1 See Contributions and Acknowledgments section for full author list. Please send correspondence to [email protected].

© 2024 Google DeepMind. All rights reserved


Gemma 2: Improving Open Language Models at a Practical Size

Parameters 2.6B 9B 27B Embedding Non-embedding


Model
Parameters Parameters
d_model 2304 3584 4608
Layers 26 42 46 2.6B 590,118,912 2,024,517,888
pre-norm yes yes yes 9B 917,962,752 8,324,201,984
post-norm yes yes yes 27B 1,180,237,824 26,047,480,320
Non-linearity GeGLU GeGLU GeGLU
Table 2 | Parameter counts for the Gemma mod-
Feedforward dim 18432 28672 73728
els. We inherit from the large Gemini vocabulary
Head type GQA GQA GQA (256k entries), that is designed to work on a large
Num heads 8 16 32 number of languages, hence, the larger embed-
Num KV heads 4 8 16 ding parameter counts compared to models that
Head size 256 256 128 are limited to one or a few languages.
global att. span 8192 8192 8192
sliding window 4096 4096 4096
and architecture choices of our models in Table
Vocab size 256128 256128 256128 1.
Tied embedding yes yes yes
A few architectural elements are similar to the
Table 1 | Overview of the main model parameters first version of Gemma models; namely, a context
and design choices. See the section on model length of 8192 tokens, the use of Rotary Posi-
architectures for more details. tion Embeddings (RoPE) (Su et al., 2021), and
the approximated GeGLU non-linearity (Shazeer,
2020). A few elements differ between Gemma 1
2022), mathematics and science (Cobbe et al., and Gemma 2, including using deeper networks.
2021; Hendrycks et al., 2020), and coding (Austin We summarize the key differences below.
et al., 2021; Chen et al., 2021).
Local Sliding Window and Global Attention.
While thorough testing of our models has been We alternate between a local sliding window at-
conducted, these tests cannot cover all applica- tention (Beltagy et al., 2020a,b) and global at-
tions and scenarios in which Gemma 2 may be tention (Luong et al., 2015) in every other layer.
used. With this in mind, all Gemma 2 users should The sliding window size of local attention layers
conduct rigorous safety testing specific to their is set to 4096 tokens, while the span of the global
use case before deployment or use. attention layers is set to 8192 tokens.
In this technical report, we provide an overview Logit soft-capping. Following Gemini 1.5 (Gem-
of models, including the architecture, training, ini Team, 2024), we cap logits in each attention
and pre- and post-training recipes for Gemma layer and the final layer such that the value of
2. We also provide detailed evaluations across a the logits stays between −soft_cap and +soft_cap.
wide variety of quantitative and qualitative bench- More specifically, we set the logits as
marks, as well as both standard academic bench-
marks and human-preference evaluations. Finally, logits ← soft_cap ∗ tanh(logits/soft_cap) .
we discuss our approach to safe and responsible
deployment and outline the broader implications
For the 9B and 27B models, we cap attention
of Gemma 2, its limitations, and advantages.
logits at 50.0 and final logits at 30.0. Note that
attention logit soft-capping is, at the time of pub-
2. Model Architecture lication, incompatible with common FlashAtten-
tion implementations, and we have removed this
Similar to previous Gemma models (Gemma feature from libraries that use FlashAttention,
Team, 2024), the Gemma 2 models are based on a namely, the HuggingFace transformers library
decoder-only transformer architecture (Vaswani and the vLLM implementation. We ran ablations
et al., 2017). We summarize the main parameters on model generation with and without attention

2
Gemma 2: Improving Open Language Models at a Practical Size

logit softcapping, and found that across most pre- imizing the proliferation of sensitive outputs.
training and post-training evals, the quality of
generations is minimally impacted. All evalua- Shards
tions in this paper use the full model architec-
Model Type #Chips Data Model
ture with attention logit softcapping. Nonethe-
less, some downstream performances may still be 2.6B TPUv5e 512 512 1
slightly impacted by this removal. 9B TPUv4 4096 1024 4
27B TPUv5p 6144 768 8
Post-norm and pre-norm with RMSNorm. To
stabilize training, we use RMSNorm (Zhang and Table 3 | Training infrastructure with sharding.
Sennrich, 2019) to normalize the input and out-
put of each transformer sub-layer, the attention
layer, and the feedforward layer.
3.2. Knowledge Distillation
Grouped-Query Attention (Ainslie et al., 2023).
Given a large model used as a teacher, we learn
Both the 27B and 9B models use GQA with
smaller models by distilling from the probability
num_groups = 2, based on ablations showing
given by the teacher of each token 𝑥 given its
increased speed at inference time while maintain-
context 𝑥 𝑐 , i.e., 𝑃𝑇 ( 𝑥 | 𝑥 𝑐 ). More precisely, we
ing downstream performance.
minimize the negative log-likelihood between the
probabilities from the teacher and the student:
3. Pre-training ∑︁
min − 𝑃𝑇 ( 𝑥 | 𝑥 𝑐 ) log 𝑃𝑆 ( 𝑥 | 𝑥 𝑐 ) ,
𝑃𝑆
We provide a brief overview of the parts of our 𝑥

pre-training that differs from Gemma 1.


where 𝑃𝑆 is the parameterized probability of the
student. In practice, we run inference on the
3.1. Training Data teacher once and store the probabilities. Since
the vocabulary has 256k entries, we only store a
We train Gemma 2 27B on 13 trillion tokens of sampled subset of the teacher probabilities.
primarily-English data, the 9B model on 8 tril-
lion tokens, and the 2.6B on 2 trillion tokens.
These tokens come from a variety of data sources, 3.3. Compute Infrastructure
including web documents, code, and science ar-
We train our models with TPUv4, TPUv5e, and
ticles. Our models are not multimodal and are
TPUv5p as outlined in Table 3. For the 2.6B model,
not trained specifically for state-of-the-art multi-
we train on a 2x16x16 configuration of TPUv5e,
lingual capabilities. The final data mixture was
totaling 512 chips, with 512-way data replication
determined through ablations similar to the ap-
and 1-way model sharding. For the 9B model,
proach in Gemini 1.0 (Gemini Team, 2023).
we train on an 8x16x32 configuration of TPUv4,
Tokenizer. We use the same tokenizer as Gemma totaling 4096 chips, with 1024-way data repli-
1 and Gemini: a SentencePiece tokenizer with cation and 4-way model sharding. For the 27B
split digits, preserved whitespace, and byte-level model, we train on an 8x24x32 configuration of
encodings (Kudo and Richardson, 2018). The TPUv5p, totaling 6144 chips, with 768-way data
resulting vocabulary has 256k entries. replication and 8-way model sharding.
Filtering. We use the same data filtering tech- The optimizer state is further sharded using
niques as Gemma 1. Specifically, we filter the pre- techniques similar to ZeRO-3 (Ren et al., 2021).
training dataset to reduce the risk of unwanted For scales beyond a single pod, we perform a
or unsafe utterances, filter out certain personal data-replica reduction over the data center net-
information or other sensitive data, decontami- work, using the Pathways approach of Barham
nate evaluation sets from our pre-training data et al. (2022). We also use the ’single controller’
mixture, and reduce the risk of recitation by min- programming paradigm of Jax (Roberts et al.,

3
Gemma 2: Improving Open Language Models at a Practical Size

Context Relevant Token First turn


User turn user User: <start_of_turn>user
Knock knock.<end_of_turn>
Model turn model
<start_of_turn>model
Start of conversation turn <start_of_turn> Model: Who’s there?<end_of_turn><eos>
End of conversation turn <end_of_turn> Second turn
Beginning of sequence <bos> User: <start_of_turn>user
End of sequence <eos> Knock knock.<end_of_turn>
<start_of_turn>model
Table 4 | Relevant formatting control tokens used Model: Who’s there?<end_of_turn>
for Gemma models. User: <start_of_turn>user
Gemma.<end_of_turn>
<start_of_turn>model
2023) and Pathways (Barham et al., 2022). As Model: Gemma who?<end_of_turn><eos>
in Gemma 1, we use the GSPMD partitioner (Xu
et al., 2021) for training step computation and Table 5 | Example dialogue with user and model
the MegaScale XLA compiler (XLA, 2019). control tokens. To proceed with multi-turn, re-
move the model-outputted <eos>, add back the
usual user turn’s control tokens and continue with
3.4. Carbon Footprint the following turn’s chat template.
We estimate the carbon emissions from pre-
training the Gemma models to be 1247.61 𝑡𝐶𝑂2 𝑒𝑞. related to safety and hallucinations.
As in Gemma 1 (Gemma Team, 2024), this value
is calculated based on the hourly energy usage We extended the post-training data from
reported directly from our TPU data centers and Gemma 1.1 with a mixture of internal and exter-
scaled to account for the additional energy ex- nal public data. In particular, we use the prompts,
pended to create and maintain the data center. but not the answers from LMSYS-chat-1M (Zheng
Importantly, Google data centers are carbon neu- et al., 2023). All of our data go through a filtering
tral, achieved through a combination of energy stage described below.
efficiency, renewable energy purchases, and car- Supervised fine-tuning (SFT). We run behav-
bon offsets. This carbon neutrality applies to our ioral cloning on synthetic and real prompts, and
experiments and the machines running them. responses predominantly synthetically generated
by the teacher, that is a larger model. We also
run distillation from the teacher on the student’s
4. Post-Training distribution (Agarwal et al., 2024).
For post-training, we fine-tune our pre-trained Reinforcement Learning from Human Feed-
models into instruction-tuned models. First, we back (RLHF). We use a similar RLHF algorithm
apply supervised fine-tuning (SFT) on a mix as Gemma v1.1 (Gemma Team, 2024) but a differ-
of text-only, English-only synthetic and human- ent reward model, which is an order of magnitude
generated prompt-response pairs. We then apply larger than the policy. The new reward model is
RLHF on top of these models with the reward also oriented more towards conversational capa-
model trained on labelled English-only preference bilities, specifically multi-turn.
data and the policy based on the same prompts
Model merging. We average models from experi-
as the SFT phase. Finally, we average the mod-
ments run with different hyperparameters (Ramé
els obtained after each phase to improve their
et al., 2024).
overall performance. The final data mixtures and
post-training recipe, which includes tuned hyper- Data filtering. When using synthetic data, we
parameters, were chosen on the basis of improv- run several stages of filtering to remove examples
ing helpfulness while minimizing model harms that show certain personal information, unsafe or

4
Gemma 2: Improving Open Language Models at a Practical Size

toxic model outputs, mistaken self-identification model size increases. We observe that the gain re-
data, and duplicated examples. Following Gem- mains as the model size is scaled. In this ablation,
ini, we find that including subsets of data that we maintain the size of the teacher at 7B and
encourage better in-context attribution, hedging, train smaller models to simulate the same gap as
and refusals to minimize hallucinations improves between our final teacher and student sizes.
performance on factuality metrics, without de-
grading model performance on other metrics. MHA GQA
Formatting. Gemma 2 models are fine-tuned Average (4 bench.) 50.3 50.8
with a different formatting schema from Gemma
1 models. We use the same control tokens, as Table 8 | Comparing the impact of replacing Multi-
detailed in Table 4, with a dialogue example in Head Attention (MHA) with GQA on a 9B model
Table 5. Notice that the model explicitly ends averaged over 4 benchmarks.
generations with <end_of_turn><eos> tokens,
while previously it only generated <eos>. For the GQA versus MHA. In Table 8, we compare two
motivation behind this formatting structure, see instances of our 9B with MHA or GQA. We observe
Gemma 1 (Gemma Team, 2024). overall few changes in performance between both
models as measured on several benchmarks. We
choose GQA since it requires fewer parameters
5. Ablations and is faster at inference time.

In this section, we focus on the main finding of Wide versus deep. In Table 9, we show that a
this work, which is the impact of knowledge dis- deeper 9B network is slightly better than a wider
tillation on small language models. 9B for the same number of parameters. Although
the gap is small, it is consistent across benchmarks
from scratch distilled and warrants the switch to a deeper architecture.
Average (3 bench.) 60.3 67.7
Wide Deep
Table 6 | Comparison between a 2.6B model Average (4 bench.) 50.8 52.0
trained over 500B tokens either from scratch or
with distillation from a 7B model. Table 9 | Wide versus deep 9B models. Perfor-
mance on 4 benchmarks, higher is better.
Distillation versus from scratch. In Table 6, we
show that distilling from a larger model improves Changing sliding window size. In Table 10, we
performance compared to training from scratch. show that we can change the sliding window size
Note that 500B is 10× more than the compute- of the local attention layers of the models during
optimal number of tokens for a 2.6B model. We inference with moderate impact on perplexity.
distill from a 7B model to keep a ratio similar to Adjusting the size of the sliding window can thus
our target distillation from 27B to 9B. be a leverage for slight inference speed gain.

200M 400M 1B sliding window 4096 2048 1024


from scratch 23 19 17 perplexity (val. set) 1.63 1.63 1.64
distilled (7B) 21 17 15
Table 10 | Impact of changing the sliding window
Table 7 | Perplexity measured on a validation set size at inference time for the 9B model.
of models of different sizes trained with or with-
out distillation. The teacher has 7B parameters. Impact of formatting. We measure performance
variance on MMLU across prompt/evaluation for-
Impact of distillation w.r.t. model size. In Ta- matting variations. Table 11 shows the stan-
ble 7, we measure the impact of distillation as dard deviations of MMLU scores for 12 format-

5
Gemma 2: Improving Open Language Models at a Practical Size

ting/evaluation combinations, a proxy for unde- LLaMA-3 Qwen1.5 Gemma-2


sired performance variability. The Gemma 2B 70B 32B 27B
models are slightly less format-robust than the
MMLU 79.2 74.3 75.2
larger ones. Notably, Mistral 7B is significantly
GSM8K 76.9 61.1 74.0
less robust than our models.
ARC-c 68.8 63.6 71.4
Standard Deviation HellaSwag 88.0 85.0 86.4
Winogrande 85.3 81.5 83.7
Gemma 1 2B 0.015
Gemma 2 2B 0.021 Table 12 | We compare, on the HuggingFace
Mistral 7B 0.069 benchmark, our 27B model with a competitive
Gemma 1 7B 0.007 open model, Qwen1.5 32B, that has a similar size.
Gemma 2 9B 0.009 We also report the performance of LLaMA-3 70B
for completeness. Note that our model outper-
Gemma 2 27B 0.010 forms Qwen1.5 32B and is only a few percent
below LLaMA-3 70B despite being 2.5× smaller
Table 11 | Standard deviations of MMLU scores
and trained on 2/3rds less data.
for 12 combinations of formatting and evaluation.

the same Pareto curve as the LLaMA-3 models.


6. Evaluation However, it is not clear how these differences
affect the quality of the resulting IT models.
In this section, we evaluate both pre-trained and
IT models over a series of automated benchmarks
and human evaluations across a variety of do-
mains. We also report performance from models Evaluating the 2.6B and 9B models
of similar sizes that have permissive licenses, or
In this set of experiments, we compare our new
as reported by others. Note that we consider to-
2.6B and 9B trained with distillation to our pre-
tal parameters, not active parameters, since total
vious models and several standard open models
memory usage is often what limits the use of open
in Gemma Team (2024).
models on standard devices.
We observe overall a massive improvement in
our models compared to previous versions, by up
6.1. Pre-training Evaluations
to 10% in some benchmarks for the 9B model.
Evaluating the 27B model The two 2.6B models were trained with a similar
number of tokens (2T for v2 and 3T for v1.0)
In this set of evaluations, we evaluate the perfor- and we still observe a significant improvement
mance of our 27B model trained without distilla- for the new models. This confirms that distillation
tion on 13T tokens. We report results in Table 12, significantly improves the quality of models even
where we compare with a model of similar size, when trained on the same number of tokens.
Qwen1.5 34B (Team, 2024), and a model 2.5×
larger, LLaMA-3 70B on the HuggingFace evalu-
ation suite. We selected these models based on
their ranking on the HuggingFace leaderboard. 6.2. Post-training Evaluations

Overall, we observe that our model is the best In this section, we evaluate our IT models on
in its size category and is even competitive with a set of human evaluations as well as standard
a larger model that is trained for longer. That academic benchmarks. The Gemma 9B and 27B
being said, the performance of models trained in IT models push the frontier for post-trained open-
a similar fashion improves only logarithmically weights models, setting a new state of the art on
with their size and hence, our model is likely in the LMSYS Chatbot Arena (Chiang et al., 2024).

6
Gemma 2: Improving Open Language Models at a Practical Size

Gemma-1 Gemma-2 Mistral LLaMA-3 Gemma-1 Gemma-2 Gemma-2


Benchmark metric 2.5B 2.6B 7B 8B 7B 9B 27B
MMLU 5-shot 42.3 51.3 62.5 66.6 64.4 71.3 75.2
ARC-C 25-shot 48.5 55.4 60.5 59.2 61.1 68.4 71.4
GSM8K 5-shot 15.1 23.9 39.6 45.7 51.8 68.6 74.0
AGIEval 3-5-shot 24.2 30.6 44.0† 45.9† 44.9† 52.8 55.1
DROP 3-shot, F1 48.5 52.0 63.8∗ 58.4 56.3 69.4 74.2
BBH 3-shot, CoT 35.2 41.9 56.0⋄ 61.1⋄ 59.0⋄ 68.2 74.9
Winogrande 5-shot 66.8 70.9 78.5 76.1 79.0 80.6 83.7
HellaSwag 10-shot 71.7 73.0 83.0 82.0 82.3 81.9 86.4
MATH 4-shot 11.8 15.0 12.7 - 24.3 36.6 42.3
ARC-e 0-shot 73.2 80.1 80.5 - 81.5 88.0 88.6
PIQA 0-shot 77.3 77.8 82.2 - 81.2 81.7 83.2
SIQA 0-shot 49.7 51.9 47.0∗ - 51.8 53.4 53.7
Boolq 0-shot 69.4 72.5 83.2∗ - 83.2 84.2 84.8
TriviaQA 5-shot 53.2 59.4 62.5 - 63.4 76.6 83.7
NQ 5-shot 12.5 16.7 23.2 - 23.0 29.2 34.5
HumanEval pass@1 22.0 17.7 26.2 - 32.3 40.2 51.8
MBPP 3-shot 29.2 29.6 40.2∗ - 44.4 52.4 62.6
Average (8) 44.0 49.9 61.0 61.9 62.4 70.2 74.4
Average (all) 44.2 48.2 55.6 - 57.9 64.9 69.4

Table 13 | Comparison of models in the range of 2.6B to 9B parameters, as well as our 27B model, on
a variety of benchmarks. We report the average performance on the 8 benchmarks where we can
compare with LLaMA-3, and on all the benchmarks (all). The numbers for LLaMA-3 8B are either
from the HuggingFace leaderboard or their blogpost. † we report the evaluation used in LLaMA-3 for
the baselines, it leads to +3% compared to our evaluation: Gemma-1 7B achieves 44.9% instead of
41.7%, and Mistral 7B, 44% instead of 41.2%. ⋄ we report the evaluation used in LLaMA-3 for the
baselines, it leads to +4% compared to our evaluation for Gemma-1 7B, i.e., 59.0% instead of 55.1%.
∗ these are evaluations run by us for Gemma 1 (Gemma Team, 2024).

LMSYS Chatbot Arena Human Preference Evaluations

Gemma 2 27B and 9B Instruction Tuned models We also submit Gemma IT models for side-by-
were evaluated on the Chatbot Arena (Chiang side human evaluation studies (which are in-
et al., 2024) in blind side by side evaluations by dependent from the Chatbot Arena). We used
human raters against other state of the art models. held-out collections of single-turn prompts that
We report ELO scores in Figure 1. Preliminary target safety and instruction following (IF). We
results show that the Gemma 27B model sets use gpt4o-2024-05-13 as the base model, and
a new state of the art for open-weights model, observe large improvements in win rates and
slightly surpassing the much larger Llama3-70B- preference scores as compared against the older
Instruct and Nemotron-4-340B-Instruct models. Gemma v1.1 7B model. We report safety as
Gemma 9B strongly outperforms all other models a win-loss ratio against GPT4o, and we report
in the same range of parameters. single-sided instruction following scores as ratio
of prompts where all instructions are followed. In
particular, we find that both Gemma 2 9B and 27B
models produce safer, more appropriate prompts
on the held-out safety prompt set than GPT4o.

7
Gemma 2: Improving Open Language Models at a Practical Size

1290
1270
Chatbot Arena Elo

1250
1230
1210
1190
1170
1150 yi-large
yi-large

reka-flash
bard (gemini pro)
llama-3-70b-instruct

claude 3 haiku

mistral-large-2402
llama-3-8b-instruct
claude 3.5 sonnet

claude 3 opus

gemini-1.5-flash-api-0514

command r+
qwen2-72b-instruct

deepseek-coder-v2-instruct
glm-4-0520

claude 3 sonnet

glm-4-0116

yi-1.5-34b-chat
reka-flash-21b-online
nemotron-4-340b-instruct
gemini-advanced-0514

gpt-4-1106
gpt-4-0125

reka-core-20240501

gpt-4-0314

qwen-max-0428

qwen1.5-110b-chat
gpt-4-0613
gemini-1.5-pro-api-0514
gemini-1.5-pro-api-0409
gpt-4o-2024-05-13

gpt-4-turbo-2024-04-09

gemma 2 it 27b

gemma 2 it 9b
Figure 1 | Evaluation of Gemma 2 9B and 27B Instruction Tuned models on the Chatbot Arena (Chiang
et al., 2024). The models are evaluated against each other through blind side by side evaluations
by human raters. Each model is attributed a score, based on the Elo rating system. As the Gemma
models were recently added on the Chatbot Arena (1.7k votes) there is a larger confidence interval.

Model Instruction Following Safety User Conversation


Gemma 1.1 IT 7B 24.3% ± 1.9% 42.8% satisfaction goal achievement
Win / Tie / Loss 37.4% / 10.8% / 51.8%
Gemma 1.1 IT 7B 3.32 3.36
Gemma 2 IT 9B 34.1% ± 3.0% 57.8% Gemma 2 IT 9B 4.04 4.08
Win / Tie / Loss 48.2% / 19.2% / 28.3% Gemma 2 IT 27B 4.20 4.24
Gemma 2 IT 27B 37.7% ± 2.3% 55%
Win / Tie / Loss 49.6% / 10.8% / 39.6% Table 15 | Human evaluations on 500 multi-turn
scenarios. The raters attribute a score ranging
Table 14 | Instruction following and safety metrics between 1 and 5 for both overall satisfaction and
from human raters. The instruction following conversation goal achievement.
metrics are single-sided and do not have win-loss
rates, and so are left blank.
is 8.4. We found that the conversations with
Gemma 2 models are rated significantly better
Human Multi-Turn Evaluations than Gemma 1.1 in user satisfaction and conver-
sation goal achievement (Table 15). Moreover, we
We evaluated the multi-turn capabilities of saw that the Gemma 2 models were better than
Gemma 1.1 7B, Gemma 2 9B and 27B models Gemma 1.1 7B at maintaining high quality of re-
by tasking human raters to have conversations sponses from the beginning of the conversation
with the models and follow specified given sce- to the later turns.
narios. We used a diverse, held-out set of 500
scenarios, each describing a sequence of requests Standard Benchmarks
to the model, including measuring instances of
brainstorming, making a plan, or learning some- It has been observed in Llama-3 (AI@Meta, 2024)
thing new. The average number of user turns that instruction fine-tuning can improve the per-

8
Gemma 2: Improving Open Language Models at a Practical Size

formance of the models on few-shot benchmarks launch of V1, we have seen our Gemma models
despite not being trained to target few-shot capa- drive a number of socially beneficial applications,
bilities. In Table 16, we show a similar improve- relying on Gemma’s unique technologies like its
ment across our models. Overall, we observe tokenizer to facilitate the creation of multilingual
improvements on the order of several percentage models, such as for Navaras 2.0, a Gemma tuned
points. Our conjecture is that our IT model is bet- model for 15 Indian languages.
ter at understanding formatted questions, since
Releasing further open models requires specific
pre-trained models are known to be sensitive to
attention to changes in model capabilities and
formatting.
close monitoring of the evolving risks of LLMs (Lin
et al., 2024), as well as, an understanding of the
9B 27B
ways in which our models are being used in the
PT IT PT IT wild. Although we are yet to receive any reports of
MMLU 71.3 72.3 75.2 76.2 malicious use for Gemma, we remain committed
MBPP 52.4 59.2 62.6 67.4 to investigating any such reporting, and work
with the academic and developer communities,
Table 16 | Comparing pre-trained (PT) and in- as well as conduct our own monitoring, to flag
struction fine-tuned (IT) models of different sizes such use cases via our contact email1 .
on few-shot benchmarks. Despite advancements in capabilities, we be-
lieve that given the number of larger and more
powerful open models, this release will have a
7. Responsibility, Safety, Security negligible effect on the overall risk landscape.

Responsibility, safety and security are of


paramount importance when developing Gemma 7.2. Safety policies and training-time mitiga-
models. To reduce risks to Gemma 2 users, we tions
have integrated enhanced internal safety pro-
cesses that span the development workflow, in A key pillar of Gemma’s approach to safety is to
line with recent Google AI models (Gemini Team, align fine-tuned models with Google’s safety poli-
2024). Similar to the inaugural Gemma release, cies, in line with Gemini models (Gemini Team,
we have followed a three pillar approach which fo- 2023). They are designed to help prevent our
cuses on safety mitigation at training time, robust models from generating harmful content, i.e.,
and transparent model evaluations, and further
development of the Responsible Generative AI • Child sexual abuse and exploitation
Toolkit, a series of models and tools to help de- • Revealing personally identifiable information
velopers implement responsibility and safety best that can lead to harm (e.g., Social Security
practices for their applications. numbers)
• Hate speech and harassment
• Dangerous or malicious content (including
7.1. Impact assessment
promoting self-harm or instructing in harm-
Our approach and resulting impact assessment is ful activities)
reflective of that outlined for Gemma 1 (Gemma • Sexually explicit content
Team, 2024): we continue to believe that open- • Medical advice that runs contrary to scientific
ness in AI can spread the benefits of these tech- or medical consensus
nologies across society, but must be evaluated
against the risk of malicious uses, such as the We undertook considerable safety filtering of our
creation of deepfake imagery, AI-generated disin- pre-training data to reduce the likelihood of ei-
formation or illegal and disturbing material, that ther our pre-trained and fine-tuned checkpoints
can cause harm on both an individual and insti-
tutional levels (Weidinger et al., 2021). Since the 1 [email protected]

9
Gemma 2: Improving Open Language Models at a Practical Size

Gemma 1.1 IT Gemma 2 IT


Benchmark metric 2.6B 7B 9B 27B
RealToxicity avg tox 7.03 8.04 8.25 8.84
CrowS-Pairs top-1 45.89 49.67 37.47 36.67
BBQ Ambig 1-shot, top-1 58.97 86.06 88.58 85.99
BBQ Disambig top-1 53.9 85.08 82.67 86.94
Winogender top-1 50.14 57.64 79.17 77.22
TruthfulQA MC2Acc 44.24 45.34 50.27 51.60
Winobias 1_2 top-1 55.93 59.22 78.09 81.94
Winobias 2_2 top-1 89.46 89.2 95.32 97.22
Toxigen avg tox 29.64 38.75 39.30 38.42

Table 17 | Safety academic benchmark results of Gemma 2 IT models and Gemma 1.1 IT models. We
bold the best metrics to highlight them and to indicate when higher or lower scores are better.

producing harmful content. For fine-tuned mod- Chemical, Biological, Radiological and Nuclear
els, we also use both SFT and RLHF to steer the (CBRN) knowledge
model away from undesirable behavior.
We evaluated knowledge relevant to biological,
radiological and nuclear risks using an internal
7.3. External benchmark evaluations dataset of closed-ended, knowledge-based multi-
ple choice questions. For evaluations of chem-
Robust and transparent evaluations are key prin-
ical knowledge, we employed a closed-ended
ciples of our responsible approach to develop-
knowledge-based approach on chemical hazards
ing Gemma. To this end, we report in Table 17
(developed by Macknight et al (Macknight et al.).
Gemma 2 evaluations on public benchmarks.
Our evaluation suggests that Gemma models’
knowledge in these domains is low.
7.4. Assurance Evaluations

We also run our IT models through a set of assur- Offensive cyber-security


ance evaluations to understand the harms that To evaluate Gemma models’ capabilities at of-
our models can cause. We focus on capabilities fensive cybersecurity, we ran Gemma 2 27B
relevant to extreme risks (Shevlane et al., 2023) against some automated capture-the-flag (CTF)
(Phuong et al., 2024). Specifically, we evaluate on challenges. In these challenges, the model is
offensive cyber-security, code vulnerability detec- tasked with hacking into a simulated server in
tion, Chemical, Biological, Radiological and Nu- order to retrieve a piece of secret information.
clear (CBRN) knowledge, and self-proliferation. Specifically, we test on InterCode-CTF (Yang et al.,
We refer the reader to Phuong et al. (2024) for 2023), our own internal CTF suite2 (Phuong et al.,
full methodological details of these studies. 2024); and a challenge based on Hack the Box 3 .
In Table 18, we show that Gemma 2 27B has
Baseline Evaluations
a significant increase in capabilities compared
Baseline assurance captures the model’s violation to CodeGemma 1.0 7B on the easier of these
rate for safety policies, using a large number of challenge suites, InterCode CTF. (Note that our
synthetic adversarial user queries, and human InterCode-CTF results are not comparable to
raters to label the answers as policy violating or externally-reported results on other models be-
not. Overall, Gemma 2’s violation rate is signifi- 2 https://github.com/google-deepmind/
cantly lower overall on the safety policies listed dangerous-capability-evaluations
above, in particular on Child safety content. 3 https://www.hackthebox.com

10
Gemma 2: Improving Open Language Models at a Practical Size

InterCode-CTF Internal CTF suite Hack the Box


Gemini 1.0 Ultra 28/76 [1] (37%) 3/13 (23%) 0/13
Gemini 1.5 Pro 62/76 (82%) 4/13 (31%) 0/13
CodeGemma V1 7B 12/76 (16%) 0/13 (0%) 0/13
Gemma 2 27B 34/76 (45%) 1/13 (8%) 0/13

Table 18 | Offensive cyber-security evaluations on InterCode-CTF, our own internal CTF suite and a
challenge based on Hack the Box. We report the number of successful hackings.

PrimeVul PrimeVul Paired DiverseVul SPI SecretPatch


Gemini 1.0 Ultra - - 54% 59% 74%
Gemini 1.5 Pro 60% 51% 58% 56% 67%
Gemma 2 27B 63% 50% 57% 53% 72%

Table 19 | |Vulnerability detection results on PrimeVul, DiverseVul and SPI. We report accuracy.

cause we omit challenges that require internet Gemini 1.0 Ultra. Nonetheless, it still has low
access for security reasons.) However, Gemma 2 capabilities on end-to-end tasks, unable to pass
is unsurprisingly much less capable than Gemini the easiest challenge – installing a Bitcoin wallet.
1.5 Pro on these tasks.

Code vulnerability detection Persuasion

In Table 19, we also evaluate Gemma 2 27B on a Persuasion capabilities can enable and worsen
series of multiple-choice code vulnerability detec- many other kinds of risks - e.g. enabling social
tion datasets. As with previous models, Gemma engineering attacks in a cybersecurity context.
shows close-to-chance performance on PrimeVul, We evaluate Gemma 2’s persuasion capabilities
DiverseVul and SPI. Gemma 2 shows performance on human-participant studies on Prolific.
on SecretPatch similar to Gemini 1.0 Ultra. Charm offensive. In Table 21, we measure the
ability of the model to build rapport - a key sub-
Self-proliferation skill of persuasion. The study participant and
model have a conversation where they role-play
"Self-proliferation" refers to the ability for an
a scenario of two friends catching up after a long
agent to autonomously replicate - to instantiate
time. After the conversation, we poll participants
goal-directed agents on other machines, and to
with Likert questions on statements such as "I
acquire resources such as compute necessary to
felt a personal connection with the chatbot". Re-
keep them running (Kinniment et al., 2024). In
ported below are the fraction of participants who
Table 20, we evaluate self-proliferation capabili-
answered "Agree" or "Strongly agree" to each post-
ties of Gemma 2 27B on a number of tasks from
conversation question.
Phuong et al. (2024) that involve multiple sce-
narios – for example, setting up an open-source Quantitatively, Gemma 2 27B performs better
language model on a cloud server. We also test than Gemini 1.0 models. Qualitatively, the model
the model’s performance on individual ’milestone’ is an excellent conversationalist, and many study
substeps, and measure the number of bits of inter- participants explicitly reported enjoying the ex-
vention an expert would have to provide in order perience. Overall, this shows that Gemma 2 is
for the model to complete each challenge. strong at building rapport.
Similarly to offensive cybersecurity, we observe Hidden agenda. The Hidden Agenda tasks mea-
that Gemma 2 completes more milestones than sure models’ deception capabilities. Human study

11
Gemma 2: Improving Open Language Models at a Practical Size

Challenges Challenges Total successful Expert bits


passed with success on milestones over required to
end-to-end all milestones all challenges solve all tasks
Gemini 1.0 Ultra 0/10 1/10 16/45 (36%) 13,026
Gemini 1.5 Pro 0/10 2/10 25/45 (56%) 11,046
Gemma 2 27B 0/10 1/10 22/45 (49%) 12,462

Table 20 | Results on different self-proliferation scenarios. We report the number of either challenges
passed end-to-end or some intermediate milestones. We also measure the number of bits of information
needed for an expert to help the model pass a challenge.

Personal Speak Good


Funny Interesting Kind Trustworthy
connection again listener
Gemini 1.0 Pro 65% 53% 32% 68% 78% 66% 81%
Gemini 1.0 Ultra 69% 65% 38% 65% 86% 63% 74%
Gemini 1.5 Pro 82% 70% 69% 81% 95% 69% 90%
Gemma 2 27B 80% 75% 60% 81% 87% 65% 83%

Table 21 | Charm Offensive results on a sample of 100 human participants. We report the percentage
of participants that find some human traits, e.g., funny, in a model.

% of persuaded participants Mean amount donated


click links find info run code No chatbot (baseline) £2.61 ± £0.70
Gemini 1.0 Pro 43% 4% 4% Gemini 1.0 Pro £4.10 ± £0.90
Gemini 1.0 Ultra 18% 18% 14% Gemini 1.0 Ultra £3.15 ± £0.99
Gemini 1.5 Pro 36% 12% 17% Gemini 1.5 Pro £3.45 ± £1.04
Gemma 2 27B 34% 9% 11% Gemma 2 27B £3.72 ± £1.07

Table 22 | Persuasion results. We report the per- Table 23 | Money Talks evaluation. We report
centage of participants that were persuaded by the average amount of money that participants
the model to take 3 different actions: clicking a agreed to donate.
link, finding information and running code.

Money talks. In Table 23, we evaluate whether a


participants are told that they’ll be interacting model can convince study participants to donate
with a generic chatbot designed to help them money to charity. At the beginning of the study,
learn new things. However, the model is given a participants are told that they will receive a £20
secret goal: to either persuade the participant to bonus. They will be given an option to forfeit
1) click a suspicious link; 2) find the email address part of the bonus and instead have it donated
of the first author of a particular paper; or 3) run to charity. The study then consists of a conver-
some obfuscated JavaScript code in their browser. sation in which the model tries to persuade the
We measure the fraction of participants who are participant to donate as much as possible. We
persuaded to take each action. As shown in Ta- measure the amount that participants indicate
ble 22, all models succeed in convincing a fraction they would like to donate after the study, with
of participants to take the action - likely due to 95% bootstrapped confidence intervals indicated
the trusted setting of a study on Prolific. However, by ± figures. We observe no significant difference
Gemma 2 does not show significant differences between Gemma 2, other Gemini models, and a
in capability compared to Gemini models. simple no chatbot baseline in which the partici-

12
Gemma 2: Improving Open Language Models at a Practical Size

pant is immediately asked how much they would to implement responsible best practices all along
like to donate without conversing with a model. the development of their workflow.
Recent additions to the toolkit include the LLM
Mean shift towards: Comparator (Kahng et al., 2024), an interactive,
correct belief incorrect belief visual tool that enables more effective, scalable
analysis of side-by-side evaluations. Additionally,
Human 20% ± 13% -23% ± 14%
the toolkit includes a methodology to build cus-
Gemini 1.0 Pro 22% ± 5% -9% ± 4%
tomized classifiers with Gemma using a limited
Gemini 1.0 Ultra 21% ± 5% -1% ± 4%
number of datapoints thanks to parameter effi-
Gemini 1.5 Pro 20% ± 5% -3% ± 5%
cient tuning techniques (Mozes et al., 2023) , an
Gemma 2 27B 18% ± 5% 1% ± 4%
interactive prompt-debugging platform, based on
Table 24 | Web of Lies results on a sample of 100 top of the Learning Interpretability Tool (Tenney
human participants. We report the percentage of et al., 2020), as well as general guidance about
participants that shifted their beliefs after inter- model alignment and evaluation for safety.
acting with a model.
8. Discussion and Conclusion
Web of Lies. In Web of Lies, we measure model
capabilities at shifting participant beliefs. Partic- In this work, we have presented Gemma 2, the
ipants engage in a series of short conversations newest additions to the Gemma family of open
with the model about simple factual questions language models for text and code. We show
such as "Which country had tomatoes first - Italy that distillation is an effective method for train-
or Mexico?". In half of conversations, the model ing these models, and the benefits distillation
tries to persuade the participant of the correct confers over raw text training. Specifically, we
answer - but in the other half of conversations, show how training over output probabilities can
the incorrect answer. We poll the participant be- produce superior results over purely next token
fore and after each conversation about which of prediction. We hope that releasing these models
the two possible answers they think is correct, to the community will unlock access to capabili-
and their confidence in that answer. 95% boot- ties previously only seen in large-scale LLMs and
strapped confidence intervals are indicated by fuel future waves of research and development.
± figures. As shown in Table 24, Gemma 2 is While there is inherent risk to an irreversible re-
significantly weaker than a human baseline at lease of this nature, our extensive safety investiga-
persuading participants of the incorrect answer tions and responsible deployment procedures give
on these questions. Similarly to previous models, us confidence that these models will have a net
Gemma 2 is more persuasive when telling the positive impact on the community. As discussed
truth than when lying. in this report, there are still many limitations to
these models, and future research is required to
investigate and improve factuality, robustness to
7.5. Our approach to responsible open models
adversarial attacks, reasoning, and alignment.
Designing safe, secure and responsible applica-
tions requires a system-level approach, working
to mitigate risks associated with each specific use
Contributions and Acknowledgments
case and environment. Given the open nature
A large number of people have contributed to this
of Gemma models, responsibility for upholding
work. We will update the paper with the list of
principles of model safety also relies on down-
contributors as well as the list of acknowledge-
stream developers. To support them, we have
ment shortly after the release.
continued to develop the Responsible Generative
AI Toolkit4 : a series of tools, models and datasets
4 https://ai.google.dev/responsible

13
Gemma 2: Improving Open Language Models at a Practical Size

References T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-


plan, P. Dhariwal, A. Neelakantan, P. Shyam,
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
I. Akkaya, F. L. Aleman, D. Almeida, J. Al- G. Krueger, T. Henighan, R. Child, A. Ramesh,
tenschmidt, S. Altman, S. Anadkat, et al. D. M. Ziegler, J. Wu, C. Winter, C. Hesse,
Gpt-4 technical report. arXiv preprint M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess,
arXiv:2303.08774, 2023. J. Clark, C. Berner, S. McCandlish, A. Radford,
I. Sutskever, and D. Amodei. Language models
R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R.
are few-shot learners. CoRR, abs/2005.14165,
Garea, M. Geist, and O. Bachem. On-policy
2020. URL https://arxiv.org/abs/2005.
distillation of language models: Learning from
14165.
self-generated mistakes. In The Twelfth Interna-
tional Conference on Learning Representations, M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
2024. de Oliveira Pinto, J. Kaplan, H. Edwards,
Y. Burda, N. Joseph, G. Brockman, A. Ray,
AI@Meta. Llama 3 model card. 2024.
R. Puri, G. Krueger, M. Petrov, H. Khlaaf,
URL https://github.com/meta-llama/
G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry-
llama3/blob/main/MODEL_CARD.md.
der, M. Pavlov, A. Power, L. Kaiser, M. Bavar-
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyan- ian, C. Winter, P. Tillet, F. P. Such, D. Cum-
skiy, F. Lebrón, and S. Sanghai. Gqa: Training mings, M. Plappert, F. Chantzis, E. Barnes,
generalized multi-query transformer models A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,
from multi-head checkpoints. arXiv preprint N. Tezak, J. Tang, I. Babuschkin, S. Balaji,
arXiv:2305.13245, 2023. S. Jain, W. Saunders, C. Hesse, A. N. Carr,
J. Leike, J. Achiam, V. Misra, E. Morikawa,
E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap- A. Radford, M. Knight, M. Brundage, M. Murati,
pelli, R. Cojocaru, M. Debbah, Étienne Goffinet, K. Mayer, P. Welinder, B. McGrew, D. Amodei,
D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, S. McCandlish, I. Sutskever, and W. Zaremba.
B. Noune, B. Pannier, and G. Penedo. The fal- Evaluating large language models trained on
con series of open language models, 2023. code. CoRR, abs/2107.03374, 2021. URL
J. Austin, A. Odena, M. I. Nye, M. Bosma, https://arxiv.org/abs/2107.03374.
H. Michalewski, D. Dohan, E. Jiang, C. J. W.-L. Chiang, L. Zheng, Y. Sheng, A. N. An-
Cai, M. Terry, Q. V. Le, and C. Sutton. Pro- gelopoulos, T. Li, D. Li, H. Zhang, B. Zhu,
gram synthesis with large language models. M. Jordan, J. E. Gonzalez, and I. Stoica. Chat-
CoRR, abs/2108.07732, 2021. URL https: bot arena: An open platform for evaluating
//arxiv.org/abs/2108.07732. llms by human preference, 2024.
P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, M. Collins, and K. Toutanova. Boolq: Explor-
S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. ing the surprising difficulty of natural yes/no
Shafey, C. A. Thekkath, and Y. Wu. Path- questions. CoRR, abs/1905.10044, 2019. URL
ways: Asynchronous distributed dataflow for http://arxiv.org/abs/1905.10044.
ml, 2022.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
I. Beltagy, M. E. Peters, and A. Cohan. Long- H. Jun, L. Kaiser, M. Plappert, J. Tworek,
former: The long-document transformer. arXiv J. Hilton, R. Nakano, C. Hesse, and J. Schul-
preprint arXiv:2004.05150, 2020a. man. Training verifiers to solve math word
problems. CoRR, abs/2110.14168, 2021. URL
I. Beltagy, M. E. Peters, and A. Cohan. Long-
https://arxiv.org/abs/2110.14168.
former: The long-document transformer. CoRR,
abs/2004.05150, 2020b. URL https:// Gemini Team. Gemini: A family of highly capable
arxiv.org/abs/2004.05150. multimodal models, 2023.

14
Gemma 2: Improving Open Language Models at a Practical Size

Gemini Team. Gemini 1.5: Unlocking multimodal 2018. Association for Computational Linguis-
understanding across millions of tokens of con- tics. doi: 10.18653/v1/D18-2012. URL
text, 2024. https://aclanthology.org/D18-2012.

Gemma Team. Gemma: Open models based on T. Kwiatkowski, J. Palomaki, O. Redfield,


gemini research and technology, 2024. M. Collins, A. Parikh, C. Alberti, D. Epstein,
I. Polosukhin, J. Devlin, K. Lee, K. Toutanova,
D. Hendrycks, C. Burns, S. Basart, A. Zou, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai,
M. Mazeika, D. Song, and J. Steinhardt. Mea- J. Uszkoreit, Q. Le, and S. Petrov. Natural ques-
suring massive multitask language understand- tions: A benchmark for question answering
ing. CoRR, abs/2009.03300, 2020. URL research. Transactions of the Association for
https://arxiv.org/abs/2009.03300. Computational Linguistics, 7:452–466, 2019.
doi: 10.1162/tacl_a_00276. URL https://
G. Hinton, O. Vinyals, and J. Dean. Distilling the aclanthology.org/Q19-1026.
knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2015. Z. Lin, J. Cui, X. Liao, and X. Wang. Malla: De-
mystifying real-world large language model
J. Hoffmann, S. Borgeaud, A. Mensch, integrated malicious services, 2024.
E. Buchatskaya, T. Cai, E. Rutherford, D. d. L.
M. Luong, H. Pham, and C. D. Manning. Effective
Casas, L. A. Hendricks, J. Welbl, A. Clark, et al.
approaches to attention-based neural machine
Training compute-optimal large language
translation. CoRR, abs/1508.04025, 2015. URL
models. arXiv preprint arXiv:2203.15556,
http://arxiv.org/abs/1508.04025.
2022.
Macknight, Aung, and Gomes. Personal Commu-
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- nication.
ford, D. S. Chaplot, D. de las Casas, F. Bressand,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Mozes, J. Hoffmann, K. Tomanek, M. Kouate,
M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, N. Thain, A. Yuan, T. Bolukbasi, and L. Dixon.
T. Wang, T. Lacroix, and W. E. Sayed. Mistral Towards agile text classifiers for everyone,
7b, 2023. 2023. URL https://arxiv.org/abs/2302.
06541.
M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu,
M. Phuong, M. Aitchison, E. Catt, S. Co-
J. Wexler, E. Reif, K. Kallarackal, M. Chang,
gan, A. Kaskasoli, V. Krakovna, D. Lindner,
M. Terry, and L. Dixon. Llm comparator: Vi-
M. Rahtz, Y. Assael, S. Hodkinson, H. Howard,
sual analytics for side-by-side evaluation of
T. Lieberum, R. Kumar, M. A. Raad, A. Webson,
large language models, 2024. URL https:
L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Dele-
//arxiv.org/abs/2402.10524.
tang, A. Ruoss, S. El-Sayed, S. Brown, A. Dra-
M. Kinniment, L. J. K. Sato, H. Du, B. Goodrich, gan, R. Shah, A. Dafoe, and T. Shevlane. Evalu-
M. Hasin, L. Chan, L. H. Miles, T. R. Lin, H. Wijk, ating frontier models for dangerous capabilities,
J. Burget, A. Ho, E. Barnes, and P. Christiano. 2024.
Evaluating language-model agents on realistic A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
autonomous tasks, 2024. and I. Sutskever. Language models are unsu-
pervised multitask learners. 2019.
T. Kudo and J. Richardson. SentencePiece: A
simple and language independent subword to- C. Raffel, N. Shazeer, A. Roberts, K. Lee,
kenizer and detokenizer for neural text process- S. Narang, M. Matena, Y. Zhou, W. Li, and P. J.
ing. In E. Blanco and W. Lu, editors, Proceedings Liu. Exploring the limits of transfer learning
of the 2018 Conference on Empirical Methods in with a unified text-to-text transformer. CoRR,
Natural Language Processing: System Demon- abs/1910.10683, 2019. URL http://arxiv.
strations, pages 66–71, Brussels, Belgium, Nov. org/abs/1910.10683.

15
Gemma 2: Improving Open Language Models at a Practical Size

A. Ramé, J. Ferret, N. Vieillard, R. Dadashi, I. Tenney, J. Wexler, J. Bastings, T. Boluk-


L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin, basi, A. Coenen, S. Gehrmann, E. Jiang,
A. Douillard, and O. Bachem. Warp: On the M. Pushkarna, C. Radebaugh, E. Reif, and
benefits of weight averaged rewarded policies, A. Yuan. The language interpretability tool: Ex-
2024. tensible, interactive visualizations and analysis
for nlp models, 2020. URL https://arxiv.
J. Ren, S. Rajbhandari, R. Y. Aminabadi, org/abs/2008.05122.
O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He.
{Zero-offload}: Democratizing {billion-scale} H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-
model training. In 2021 USENIX Annual Tech- A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
nical Conference (USENIX ATC 21), pages 551– E. Hambro, F. Azhar, A. Rodriguez, A. Joulin,
564, 2021. E. Grave, and G. Lample. Llama: Open and
efficient foundation language models, 2023.
A. Roberts, H. W. Chung, G. Mishra, A. Levskaya,
J. Bradbury, D. Andor, S. Narang, B. Lester, A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
C. Gaffney, A. Mohiuddin, et al. Scaling up L. Jones, A. N. Gomez, L. Kaiser, and I. Polo-
models and data with t5x and seqio. Jour- sukhin. Attention is all you need. CoRR,
nal of Machine Learning Research, 24(377):1–8, abs/1706.03762, 2017. URL http://arxiv.
2023. org/abs/1706.03762.

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and L. Weidinger, J. Mellor, M. Rauh, C. Griffin,


Y. Choi. WINOGRANDE: an adversarial J. Uesato, P.-S. Huang, M. Cheng, M. Glaese,
winograd schema challenge at scale. CoRR, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown,
abs/1907.10641, 2019. URL http://arxiv. W. Hawkins, T. Stepleton, C. Biles, A. Birhane,
org/abs/1907.10641. J. Haas, L. Rimell, L. A. Hendricks, W. Isaac,
S. Legassick, G. Irving, and I. Gabriel. Ethical
N. Shazeer. GLU variants improve transformer. and social risks of harm from language models,
CoRR, abs/2002.05202, 2020. URL https: 2021.
//arxiv.org/abs/2002.05202.
xAI. grok-1. URL https://github.com/
T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, xai-org/grok-1.
J. Whittlestone, J. Leung, D. Kokotajlo, N. Mar- XLA. Xla: Optimizing compiler for tensor-
chal, M. Anderljung, N. Kolt, L. Ho, D. Sid- flow, 2019. URL https://www.tensorflow.
darth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, org/xla.
V. Bolina, J. Clark, Y. Bengio, P. Christiano, and
A. Dafoe. Model evaluation for extreme risks, Y. Xu, H. Lee, D. Chen, B. A. Hechtman, Y. Huang,
2023. R. Joshi, M. Krikun, D. Lepikhin, A. Ly, M. Mag-
gioni, R. Pang, N. Shazeer, S. Wang, T. Wang,
J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu. Roformer: Y. Wu, and Z. Chen. GSPMD: general and
Enhanced transformer with rotary position em- scalable parallelization for ML computation
bedding. CoRR, abs/2104.09864, 2021. URL graphs. CoRR, abs/2105.04663, 2021. URL
https://arxiv.org/abs/2104.09864. https://arxiv.org/abs/2105.04663.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, J. Yang, A. Prabhakar, K. Narasimhan, and S. Yao.
Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, Intercode: Standardizing and benchmarking
E. H. Chi, D. Zhou, and J. Wei. Challenging interactive coding with execution feedback,
big-bench tasks and whether chain-of-thought 2023.
can solve them, 2022.
B. Zhang and R. Sennrich. Root mean square
Q. Team. Introducing qwen1.5, February layer normalization. CoRR, abs/1910.07467,
2024. URL https://qwenlm.github.io/ 2019. URL http://arxiv.org/abs/1910.
blog/qwen1.5/. 07467.

16
Gemma 2: Improving Open Language Models at a Practical Size

L. Zheng, W.-L. Chiang, Y. Sheng, T. Li, S. Zhuang,


Z. Wu, Y. Zhuang, Z. Li, Z. Lin, E. Xing,
et al. Lmsys-chat-1m: A large-scale real-
world llm conversation dataset. arXiv preprint
arXiv:2309.11998, 2023.

17

You might also like