Gemma 2 Report
Gemma 2 Report
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art
open models, ranging in scale from 2 billion to 27 billion parameters. The 9 billion and 27 billion
parameter models are available today, with a 2 billion parameter model to be released shortly. In this new
version, we provide several technical modifications to our architecture, such as interleaving local-global
attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B
and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The
resulting models deliver the best performance for their size, and even offer competitive alternatives to
models that are 2-3× bigger. We release all our models to the community.
1 See Contributions and Acknowledgments section for full author list. Please send correspondence to [email protected].
2
Gemma 2: Improving Open Language Models at a Practical Size
logit softcapping, and found that across most pre- imizing the proliferation of sensitive outputs.
training and post-training evals, the quality of
generations is minimally impacted. All evalua- Shards
tions in this paper use the full model architec-
Model Type #Chips Data Model
ture with attention logit softcapping. Nonethe-
less, some downstream performances may still be 2.6B TPUv5e 512 512 1
slightly impacted by this removal. 9B TPUv4 4096 1024 4
27B TPUv5p 6144 768 8
Post-norm and pre-norm with RMSNorm. To
stabilize training, we use RMSNorm (Zhang and Table 3 | Training infrastructure with sharding.
Sennrich, 2019) to normalize the input and out-
put of each transformer sub-layer, the attention
layer, and the feedforward layer.
3.2. Knowledge Distillation
Grouped-Query Attention (Ainslie et al., 2023).
Given a large model used as a teacher, we learn
Both the 27B and 9B models use GQA with
smaller models by distilling from the probability
num_groups = 2, based on ablations showing
given by the teacher of each token 𝑥 given its
increased speed at inference time while maintain-
context 𝑥 𝑐 , i.e., 𝑃𝑇 ( 𝑥 | 𝑥 𝑐 ). More precisely, we
ing downstream performance.
minimize the negative log-likelihood between the
probabilities from the teacher and the student:
3. Pre-training ∑︁
min − 𝑃𝑇 ( 𝑥 | 𝑥 𝑐 ) log 𝑃𝑆 ( 𝑥 | 𝑥 𝑐 ) ,
𝑃𝑆
We provide a brief overview of the parts of our 𝑥
3
Gemma 2: Improving Open Language Models at a Practical Size
4
Gemma 2: Improving Open Language Models at a Practical Size
toxic model outputs, mistaken self-identification model size increases. We observe that the gain re-
data, and duplicated examples. Following Gem- mains as the model size is scaled. In this ablation,
ini, we find that including subsets of data that we maintain the size of the teacher at 7B and
encourage better in-context attribution, hedging, train smaller models to simulate the same gap as
and refusals to minimize hallucinations improves between our final teacher and student sizes.
performance on factuality metrics, without de-
grading model performance on other metrics. MHA GQA
Formatting. Gemma 2 models are fine-tuned Average (4 bench.) 50.3 50.8
with a different formatting schema from Gemma
1 models. We use the same control tokens, as Table 8 | Comparing the impact of replacing Multi-
detailed in Table 4, with a dialogue example in Head Attention (MHA) with GQA on a 9B model
Table 5. Notice that the model explicitly ends averaged over 4 benchmarks.
generations with <end_of_turn><eos> tokens,
while previously it only generated <eos>. For the GQA versus MHA. In Table 8, we compare two
motivation behind this formatting structure, see instances of our 9B with MHA or GQA. We observe
Gemma 1 (Gemma Team, 2024). overall few changes in performance between both
models as measured on several benchmarks. We
choose GQA since it requires fewer parameters
5. Ablations and is faster at inference time.
In this section, we focus on the main finding of Wide versus deep. In Table 9, we show that a
this work, which is the impact of knowledge dis- deeper 9B network is slightly better than a wider
tillation on small language models. 9B for the same number of parameters. Although
the gap is small, it is consistent across benchmarks
from scratch distilled and warrants the switch to a deeper architecture.
Average (3 bench.) 60.3 67.7
Wide Deep
Table 6 | Comparison between a 2.6B model Average (4 bench.) 50.8 52.0
trained over 500B tokens either from scratch or
with distillation from a 7B model. Table 9 | Wide versus deep 9B models. Perfor-
mance on 4 benchmarks, higher is better.
Distillation versus from scratch. In Table 6, we
show that distilling from a larger model improves Changing sliding window size. In Table 10, we
performance compared to training from scratch. show that we can change the sliding window size
Note that 500B is 10× more than the compute- of the local attention layers of the models during
optimal number of tokens for a 2.6B model. We inference with moderate impact on perplexity.
distill from a 7B model to keep a ratio similar to Adjusting the size of the sliding window can thus
our target distillation from 27B to 9B. be a leverage for slight inference speed gain.
5
Gemma 2: Improving Open Language Models at a Practical Size
Overall, we observe that our model is the best In this section, we evaluate our IT models on
in its size category and is even competitive with a set of human evaluations as well as standard
a larger model that is trained for longer. That academic benchmarks. The Gemma 9B and 27B
being said, the performance of models trained in IT models push the frontier for post-trained open-
a similar fashion improves only logarithmically weights models, setting a new state of the art on
with their size and hence, our model is likely in the LMSYS Chatbot Arena (Chiang et al., 2024).
6
Gemma 2: Improving Open Language Models at a Practical Size
Table 13 | Comparison of models in the range of 2.6B to 9B parameters, as well as our 27B model, on
a variety of benchmarks. We report the average performance on the 8 benchmarks where we can
compare with LLaMA-3, and on all the benchmarks (all). The numbers for LLaMA-3 8B are either
from the HuggingFace leaderboard or their blogpost. † we report the evaluation used in LLaMA-3 for
the baselines, it leads to +3% compared to our evaluation: Gemma-1 7B achieves 44.9% instead of
41.7%, and Mistral 7B, 44% instead of 41.2%. ⋄ we report the evaluation used in LLaMA-3 for the
baselines, it leads to +4% compared to our evaluation for Gemma-1 7B, i.e., 59.0% instead of 55.1%.
∗ these are evaluations run by us for Gemma 1 (Gemma Team, 2024).
Gemma 2 27B and 9B Instruction Tuned models We also submit Gemma IT models for side-by-
were evaluated on the Chatbot Arena (Chiang side human evaluation studies (which are in-
et al., 2024) in blind side by side evaluations by dependent from the Chatbot Arena). We used
human raters against other state of the art models. held-out collections of single-turn prompts that
We report ELO scores in Figure 1. Preliminary target safety and instruction following (IF). We
results show that the Gemma 27B model sets use gpt4o-2024-05-13 as the base model, and
a new state of the art for open-weights model, observe large improvements in win rates and
slightly surpassing the much larger Llama3-70B- preference scores as compared against the older
Instruct and Nemotron-4-340B-Instruct models. Gemma v1.1 7B model. We report safety as
Gemma 9B strongly outperforms all other models a win-loss ratio against GPT4o, and we report
in the same range of parameters. single-sided instruction following scores as ratio
of prompts where all instructions are followed. In
particular, we find that both Gemma 2 9B and 27B
models produce safer, more appropriate prompts
on the held-out safety prompt set than GPT4o.
7
Gemma 2: Improving Open Language Models at a Practical Size
1290
1270
Chatbot Arena Elo
1250
1230
1210
1190
1170
1150 yi-large
yi-large
reka-flash
bard (gemini pro)
llama-3-70b-instruct
claude 3 haiku
mistral-large-2402
llama-3-8b-instruct
claude 3.5 sonnet
claude 3 opus
gemini-1.5-flash-api-0514
command r+
qwen2-72b-instruct
deepseek-coder-v2-instruct
glm-4-0520
claude 3 sonnet
glm-4-0116
yi-1.5-34b-chat
reka-flash-21b-online
nemotron-4-340b-instruct
gemini-advanced-0514
gpt-4-1106
gpt-4-0125
reka-core-20240501
gpt-4-0314
qwen-max-0428
qwen1.5-110b-chat
gpt-4-0613
gemini-1.5-pro-api-0514
gemini-1.5-pro-api-0409
gpt-4o-2024-05-13
gpt-4-turbo-2024-04-09
gemma 2 it 27b
gemma 2 it 9b
Figure 1 | Evaluation of Gemma 2 9B and 27B Instruction Tuned models on the Chatbot Arena (Chiang
et al., 2024). The models are evaluated against each other through blind side by side evaluations
by human raters. Each model is attributed a score, based on the Elo rating system. As the Gemma
models were recently added on the Chatbot Arena (1.7k votes) there is a larger confidence interval.
8
Gemma 2: Improving Open Language Models at a Practical Size
formance of the models on few-shot benchmarks launch of V1, we have seen our Gemma models
despite not being trained to target few-shot capa- drive a number of socially beneficial applications,
bilities. In Table 16, we show a similar improve- relying on Gemma’s unique technologies like its
ment across our models. Overall, we observe tokenizer to facilitate the creation of multilingual
improvements on the order of several percentage models, such as for Navaras 2.0, a Gemma tuned
points. Our conjecture is that our IT model is bet- model for 15 Indian languages.
ter at understanding formatted questions, since
Releasing further open models requires specific
pre-trained models are known to be sensitive to
attention to changes in model capabilities and
formatting.
close monitoring of the evolving risks of LLMs (Lin
et al., 2024), as well as, an understanding of the
9B 27B
ways in which our models are being used in the
PT IT PT IT wild. Although we are yet to receive any reports of
MMLU 71.3 72.3 75.2 76.2 malicious use for Gemma, we remain committed
MBPP 52.4 59.2 62.6 67.4 to investigating any such reporting, and work
with the academic and developer communities,
Table 16 | Comparing pre-trained (PT) and in- as well as conduct our own monitoring, to flag
struction fine-tuned (IT) models of different sizes such use cases via our contact email1 .
on few-shot benchmarks. Despite advancements in capabilities, we be-
lieve that given the number of larger and more
powerful open models, this release will have a
7. Responsibility, Safety, Security negligible effect on the overall risk landscape.
9
Gemma 2: Improving Open Language Models at a Practical Size
Table 17 | Safety academic benchmark results of Gemma 2 IT models and Gemma 1.1 IT models. We
bold the best metrics to highlight them and to indicate when higher or lower scores are better.
producing harmful content. For fine-tuned mod- Chemical, Biological, Radiological and Nuclear
els, we also use both SFT and RLHF to steer the (CBRN) knowledge
model away from undesirable behavior.
We evaluated knowledge relevant to biological,
radiological and nuclear risks using an internal
7.3. External benchmark evaluations dataset of closed-ended, knowledge-based multi-
ple choice questions. For evaluations of chem-
Robust and transparent evaluations are key prin-
ical knowledge, we employed a closed-ended
ciples of our responsible approach to develop-
knowledge-based approach on chemical hazards
ing Gemma. To this end, we report in Table 17
(developed by Macknight et al (Macknight et al.).
Gemma 2 evaluations on public benchmarks.
Our evaluation suggests that Gemma models’
knowledge in these domains is low.
7.4. Assurance Evaluations
10
Gemma 2: Improving Open Language Models at a Practical Size
Table 18 | Offensive cyber-security evaluations on InterCode-CTF, our own internal CTF suite and a
challenge based on Hack the Box. We report the number of successful hackings.
Table 19 | |Vulnerability detection results on PrimeVul, DiverseVul and SPI. We report accuracy.
cause we omit challenges that require internet Gemini 1.0 Ultra. Nonetheless, it still has low
access for security reasons.) However, Gemma 2 capabilities on end-to-end tasks, unable to pass
is unsurprisingly much less capable than Gemini the easiest challenge – installing a Bitcoin wallet.
1.5 Pro on these tasks.
In Table 19, we also evaluate Gemma 2 27B on a Persuasion capabilities can enable and worsen
series of multiple-choice code vulnerability detec- many other kinds of risks - e.g. enabling social
tion datasets. As with previous models, Gemma engineering attacks in a cybersecurity context.
shows close-to-chance performance on PrimeVul, We evaluate Gemma 2’s persuasion capabilities
DiverseVul and SPI. Gemma 2 shows performance on human-participant studies on Prolific.
on SecretPatch similar to Gemini 1.0 Ultra. Charm offensive. In Table 21, we measure the
ability of the model to build rapport - a key sub-
Self-proliferation skill of persuasion. The study participant and
model have a conversation where they role-play
"Self-proliferation" refers to the ability for an
a scenario of two friends catching up after a long
agent to autonomously replicate - to instantiate
time. After the conversation, we poll participants
goal-directed agents on other machines, and to
with Likert questions on statements such as "I
acquire resources such as compute necessary to
felt a personal connection with the chatbot". Re-
keep them running (Kinniment et al., 2024). In
ported below are the fraction of participants who
Table 20, we evaluate self-proliferation capabili-
answered "Agree" or "Strongly agree" to each post-
ties of Gemma 2 27B on a number of tasks from
conversation question.
Phuong et al. (2024) that involve multiple sce-
narios – for example, setting up an open-source Quantitatively, Gemma 2 27B performs better
language model on a cloud server. We also test than Gemini 1.0 models. Qualitatively, the model
the model’s performance on individual ’milestone’ is an excellent conversationalist, and many study
substeps, and measure the number of bits of inter- participants explicitly reported enjoying the ex-
vention an expert would have to provide in order perience. Overall, this shows that Gemma 2 is
for the model to complete each challenge. strong at building rapport.
Similarly to offensive cybersecurity, we observe Hidden agenda. The Hidden Agenda tasks mea-
that Gemma 2 completes more milestones than sure models’ deception capabilities. Human study
11
Gemma 2: Improving Open Language Models at a Practical Size
Table 20 | Results on different self-proliferation scenarios. We report the number of either challenges
passed end-to-end or some intermediate milestones. We also measure the number of bits of information
needed for an expert to help the model pass a challenge.
Table 21 | Charm Offensive results on a sample of 100 human participants. We report the percentage
of participants that find some human traits, e.g., funny, in a model.
Table 22 | Persuasion results. We report the per- Table 23 | Money Talks evaluation. We report
centage of participants that were persuaded by the average amount of money that participants
the model to take 3 different actions: clicking a agreed to donate.
link, finding information and running code.
12
Gemma 2: Improving Open Language Models at a Practical Size
pant is immediately asked how much they would to implement responsible best practices all along
like to donate without conversing with a model. the development of their workflow.
Recent additions to the toolkit include the LLM
Mean shift towards: Comparator (Kahng et al., 2024), an interactive,
correct belief incorrect belief visual tool that enables more effective, scalable
analysis of side-by-side evaluations. Additionally,
Human 20% ± 13% -23% ± 14%
the toolkit includes a methodology to build cus-
Gemini 1.0 Pro 22% ± 5% -9% ± 4%
tomized classifiers with Gemma using a limited
Gemini 1.0 Ultra 21% ± 5% -1% ± 4%
number of datapoints thanks to parameter effi-
Gemini 1.5 Pro 20% ± 5% -3% ± 5%
cient tuning techniques (Mozes et al., 2023) , an
Gemma 2 27B 18% ± 5% 1% ± 4%
interactive prompt-debugging platform, based on
Table 24 | Web of Lies results on a sample of 100 top of the Learning Interpretability Tool (Tenney
human participants. We report the percentage of et al., 2020), as well as general guidance about
participants that shifted their beliefs after inter- model alignment and evaluation for safety.
acting with a model.
8. Discussion and Conclusion
Web of Lies. In Web of Lies, we measure model
capabilities at shifting participant beliefs. Partic- In this work, we have presented Gemma 2, the
ipants engage in a series of short conversations newest additions to the Gemma family of open
with the model about simple factual questions language models for text and code. We show
such as "Which country had tomatoes first - Italy that distillation is an effective method for train-
or Mexico?". In half of conversations, the model ing these models, and the benefits distillation
tries to persuade the participant of the correct confers over raw text training. Specifically, we
answer - but in the other half of conversations, show how training over output probabilities can
the incorrect answer. We poll the participant be- produce superior results over purely next token
fore and after each conversation about which of prediction. We hope that releasing these models
the two possible answers they think is correct, to the community will unlock access to capabili-
and their confidence in that answer. 95% boot- ties previously only seen in large-scale LLMs and
strapped confidence intervals are indicated by fuel future waves of research and development.
± figures. As shown in Table 24, Gemma 2 is While there is inherent risk to an irreversible re-
significantly weaker than a human baseline at lease of this nature, our extensive safety investiga-
persuading participants of the incorrect answer tions and responsible deployment procedures give
on these questions. Similarly to previous models, us confidence that these models will have a net
Gemma 2 is more persuasive when telling the positive impact on the community. As discussed
truth than when lying. in this report, there are still many limitations to
these models, and future research is required to
investigate and improve factuality, robustness to
7.5. Our approach to responsible open models
adversarial attacks, reasoning, and alignment.
Designing safe, secure and responsible applica-
tions requires a system-level approach, working
to mitigate risks associated with each specific use
Contributions and Acknowledgments
case and environment. Given the open nature
A large number of people have contributed to this
of Gemma models, responsibility for upholding
work. We will update the paper with the list of
principles of model safety also relies on down-
contributors as well as the list of acknowledge-
stream developers. To support them, we have
ment shortly after the release.
continued to develop the Responsible Generative
AI Toolkit4 : a series of tools, models and datasets
4 https://ai.google.dev/responsible
13
Gemma 2: Improving Open Language Models at a Practical Size
14
Gemma 2: Improving Open Language Models at a Practical Size
Gemini Team. Gemini 1.5: Unlocking multimodal 2018. Association for Computational Linguis-
understanding across millions of tokens of con- tics. doi: 10.18653/v1/D18-2012. URL
text, 2024. https://aclanthology.org/D18-2012.
15
Gemma 2: Improving Open Language Models at a Practical Size
16
Gemma 2: Improving Open Language Models at a Practical Size
17