0% found this document useful (0 votes)
496 views25 pages

Gemma 3 Report

The Gemma 3 Technical Report introduces a new multimodal model with enhanced vision understanding, longer context support of up to 128K tokens, and improved multilingual capabilities, while maintaining lightweight architecture. The model architecture has been optimized to reduce memory usage during long context processing and is trained with knowledge distillation for superior performance across various benchmarks. All models are released to the community, showcasing significant advancements over previous versions in math, chat, and instruction-following abilities.

Uploaded by

fort212121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
496 views25 pages

Gemma 3 Report

The Gemma 3 Technical Report introduces a new multimodal model with enhanced vision understanding, longer context support of up to 128K tokens, and improved multilingual capabilities, while maintaining lightweight architecture. The model architecture has been optimized to reduce memory usage during long context processing and is trained with knowledge distillation for superior performance across various benchmarks. All models are released to the community, showcasing significant advancements over previous versions in math, chat, and instruction-following abilities.

Uploaded by

fort212121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

2025-03-12

Gemma 3 Technical Report


Gemma Team, Google DeepMind1

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging
in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider
coverage of languages and longer context – at least 128K tokens. We also change the architecture of
the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by
increasing the ratio of local to global attention layers, and keeping the span on local attention short.
The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2
for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe
significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-
4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across
benchmarks. We release all our models to the community.

1. Introduction layer, and assign a smaller span of only 1024


tokens to the local layers. Therefore, only the
We present the newest version of Gemma open global layers attend to long context, and we have
language models (Gemma Team, 2024a), co- 1 global for every 5 local layers.
designed with the family of Gemini frontier mod-
els (Gemini Team, 2023). This new version The pre-training optimization recipe is similar
comes in sizes comparable to Gemma 2 (Gemma to Gemma 2, with some modifications in the ar-
Team, 2024b), with the addition of a 1B model. chitecture design. We use the same tokenizer as
These models are designed to run on standard Gemini 2.0, and we also revisit our data mixture
consumer-grade hardware such as phones, lap- to improve the multilingual capabilities of the
tops, and high-end GPUs. This version comes models, while introducing image understanding.
with several new abilities to the Gemma family; All Gemma 3 models are trained with knowledge
namely, multimodality, long context, and mul- distillation (Hinton et al., 2015).
tilinguality, while preserving or surpassing the In post-training, we focus our efforts on im-
performance of prior versions. proving mathematics, reasoning, and chat abili-
In terms of multimodality, most Gemma 3 mod- ties, as well as integrating the new capabilities of
els are compatible with a tailored version of the Gemma 3, long-context, and image inputs. We
SigLIP vision encoder (Zhai et al., 2023). The use a novel post-training approach that brings
language models treat images as a sequence of gains across all capabilities, including math, cod-
soft tokens encoded by SigLIP. We reduce the in- ing, chat, instruction following, and multilingual.
ference cost of image processing by condensing The resulting Gemma 3 instruction-tuned models
the vision embeddings into a fixed size of 256 are both powerful and versatile, outperforming
vectors. The encoder works at a fixed resolution their predecessors by a wide margin.
and we take inspiration from LLaVA (Liu et al., In the following sections, we provide a brief
2024) to enable flexible resolutions with a Pan overview of our models, including the architec-
and Scan (P&S) method. ture and pre- and post-training recipes. We also
The second main architectural improvement is provide detailed evaluations across a wide vari-
an increase in context size to 128K tokens, with- ety of quantitative and qualitative benchmarks.
out reducing performance. A challenge with long We discuss our approach to safe and responsible
context is the memory explosion of the KV cache deployment and outline the broader implications
during inference. To reduce this issue, we inter- of Gemma 3, its limitations, and advantages.
leave multiple local layers between each global

1 See Contributions and Acknowledgments section for full author list. Please send correspondence to [email protected].

© 2025 Google DeepMind. All rights reserved


Gemma 3 Technical Report

Vision Embedding Non-embedding


Model
Encoder Parameters Parameters
1B 0 302M 698M
4B 417M 675M 3,209M
12B 417M 1,012M 10,759M
27B 417M 1,416M 25,600M

Table 1 | Parameter counts for the Gemma 3 mod-


els. Our vocabulary has 256k entries.

attention (Luong et al., 2015), with a pattern of


5 local layers for every global layer, starting with
a local layer as the first layer of the model.
Long context. Gemma 3 models support context
length of 128K tokens, with the exception of the
1B model that has 32K. We increase RoPE base
frequency from 10k to 1M on global self-attention
layers, and keep the frequency of the local lay-
ers at 10k. We follow a process similar to the
positional interpolation of Chen et al. (2023) to
extend the span of the global self-attention layers.

2.1. Vision modality

Vision encoder. We use a 400M variant of the


SigLIP encoder (Zhai et al., 2023), a Vision Trans-
Figure 1 | Example of visual interaction with former (Dosovitskiy, 2020) trained with a varia-
Gemma 3 27B IT model. tion of the CLIP loss (Radford et al., 2021). The
Gemma vision encoder takes as input square im-
ages resized to 896 x 896, and is finetuned on
2. Model Architecture data from visual assistant tasks. For simplicity, we
share the vision encoder across our 4B, 12B, and
Gemma 3 models follow the same general 27B models, keeping it frozen during training.
decoder-only transformer architecture as previ-
Pan & Scan (P&S). The Gemma vision encoder
ous iterations (Vaswani et al., 2017), with most
operates at a fixed resolution of 896 × 896. This
architecture elements similar to the first two
results in artifacts when processing non-square
Gemma versions. We use a Grouped-Query Atten-
aspect ratios and high-resolution images, leading
tion (GQA) (Ainslie et al., 2023) with post-norm
to unreadable text, or small object disappeared.
and pre-norm with RMSNorm (Zhang and Sen-
We address this issue with an adaptive windowing
nrich, 2019). Inspired by Dehghani et al. (2023),
algorithm during inference. This algorithm seg-
Wortsman et al. (2023) and Chameleon Team
ments images into non-overlapping crops of equal
(2024), we replace the soft-capping of Gemma 2
size, covering the whole image, and resize them
with QK-norm. In this section, we focus on some
to 896×896 pixels to pass them to the encoder.
key differences from previous versions below.
This windowing is applied only when necessary,
5:1 interleaving of local/global layers. We and control for the maximum number of crops.
alternate between a local sliding window self- It is an inference-time only optimization and can
attention (Beltagy et al., 2020) and global self- be disabled for faster inference.

2
Gemma 3 Technical Report

Shards Raw (GB) Quantized (GB)


Model Type #Chips Data Seq. Replica Model bf16 Int4 Int4blocks=32 SFP8
1B TPUv5e 512 16 16 2 1B 2.0 0.5 0.7 1.0
4B TPUv5e 2048 16 16 8 +KV 2.9 1.4 1.6 1.9
12B TPUv4 6144 16 16 24 4B 8.0 2.6 2.9 4.4
27B TPUv5p 6144 24 8 32 +KV 12.7 7.3 7.6 9.1
Table 2 | Training infrastructure with sharding by 12B 24.0 6.6 7.1 12.4
data, sequence (Seq.), and replica. +KV 38.9 21.5 22.0 27.3
27B 54.0 14.1 15.3 27.4
+KV 72.7 32.8 34.0 46.1
2.2. Pre-training
Table 3 | Memory footprints (in GB) comparison
We follow a similar recipe as in Gemma 2 for between raw (bfloat16) and quantized check-
pre-training with knowledge distillation. points for weights and KV caching (+KV) at
Training data. We pre-train our models on a 32,768 context size, quantized in 8 bits.
slightly larger token budget than Gemma 2, i.e.,
we train on 14T tokens for Gemma 3 27B, 12T 2.3. Quantization Aware Training
for the 12B version, 4T for the 4B, and 2T to-
kens for the 1B. The increase in tokens accounts Along with the raw checkpoints, we also provide
for the mix of images and text used during pre- quantized versions of our models in different stan-
training. We also increase the amount of multi- dard formats. These versions are obtained by fine-
lingual data to improve language coverage. We tuning each model for a small number of steps,
add both monolingual and parallel data, and we typically 5,000, using Quantization Aware Train-
handle the imbalance in language representation ing (QAT) (Jacob et al., 2018). We use prob-
using a strategy inspired by Chung et al. (2023). abilities from the non-quantized checkpoint as
targets, and adapt the data to match the pre-
Tokenizer. We use the same tokenizer as Gem-
training and post-training distributions. Based
ini 2.0: a SentencePiece tokenizer with split dig-
on the most popular open source quantization
its, preserved whitespace, and byte-level encod-
inference engines (e.g. llama.cpp), we focus on
ings (Kudo and Richardson, 2018). The resulting
three weight representations: per-channel int4,
vocabulary has 262k entries. This tokenizer is
per-block int4, and switched fp8. In Table 3, we
more balanced for non-English languages.
report the memory filled by raw and quantized
Filtering. We use filtering techniques that reduce models for each weight representation with and
the risk of unwanted or unsafe utterances and without a KV-cache for a sequence of 32k tokens.
remove certain personal information and other
sensitive data. We decontaminate evaluation sets 2.4. Compute Infrastructure
from our pre-training data mixture, and reduce
the risk of recitation by minimizing the prolifer- We train our models with TPUv4, TPUv5e, and
ation of sensitive outputs. We also apply a qual- TPUv5p as outlined in Table 2. Each model con-
ity reweighing step inspired by Sachdeva et al. figuration is optimized to minimize training step
(2024) to reduce occurrences of low quality data. time. For the vision encoder, we pre-compute
the embeddings for each image and directly train
Distillation. We sample 256 logits per token,
with the embeddings, adding no cost to the train-
weighted by teacher probabilities. The student
ing of the language models.
learns the teacher’s distribution within these sam-
ples via cross-entropy loss. The teacher’s target The optimizer state is sharded using an im-
distribution is set to zero probability for non- plementation of ZeRO-3 (Ren et al., 2021). For
sampled logits, and renormalized. multi-pod training, we perform a data replica re-

3
Gemma 3 Technical Report

Context Formatting following, and multilingual abilities, while mini-


User turn <start_of_turn>user mizing model harmfulness. This includes learn-
ing from weight averaged reward models (Ramé
Model turn <start_of_turn>model
et al., 2024b) trained with human feedback data,
End of turn <end_of_turn> code execution feedback (Gehring et al., 2024),
Example of discussion: and ground-truth rewards for solving math prob-
User: Who are you? lems (DeepSeek-AI, 2025; Lambert et al., 2024).
Model: My name is Gemma!
User: What is 2+2? Data filtering. We carefully optimize the data
Model: 2+2=4. used in post-training to maximize model perfor-
Model input: mance. We filter examples that show certain per-
sonal information, unsafe or toxic model outputs,
[BOS]<start_of_turn>user
Who are you?<end_of_turn> mistaken self-identification data, and duplicated
<start_of_turn>model examples. Including subsets of data that encour-
My name is Gemma!<end_of_turn> age better in-context attribution, hedging, and
<start_of_turn>user refusals to minimize hallucinations also improves
What is 2+2?<end_of_turn>
<start_of_turn>model performance on factuality metrics, without de-
grading model performance on other metrics.
Model output:
2+2=4.<end_of_turn> [BOS] token. For both PT and IT models, text
starts with a [BOS] token, that needs to be added
Table 4 | Formatting for Gemma IT models. Explic- explicitly since the text “[BOS]” does not map to
itly add the [BOS] token after tokenization, or the [BOS] token. For instance, Flax has an option,
use the add_bos=True option in the tokenizer. add_bos=True, to add this token automatically
Do not tokenize the text "[BOS]". when tokenizing. An example of the formatting
for an IT model is shown in Table 4,
duction over the data center network, using the
Pathways approach of Barham et al. (2022). We PT versus IT Formatting. All models share the
use the ‘single controller’ programming paradigm same tokenizer, with some control tokens dedi-
of Jax (Roberts et al., 2023) and Pathways cated to IT formatting. A key difference is that PT
(Barham et al., 2022), along with the GSPMD models output a <eos> token at the end of gener-
partitioner (Xu et al., 2021) and the MegaScale ation, while IT models output a <end_of_turn>
XLA compiler (XLA, 2019). at the end of the generation, as shown for IT in
Table 4. Fine-tuning either model type thus also
requires to add their respective end token.
3. Instruction-Tuning
Pre-trained models are turned into instruction- 4. Evaluation of final models
tuned models with an improved post-training ap-
proach compared to our prior recipe (see Table 6). In this section, we evaluate the IT models over
a series of automated benchmarks and human
Techniques. Our post-training approach relies evaluations across a variety of domains, as well
on an improved version of knowledge distilla- as static benchmarks such as MMLU.
tion (Agarwal et al., 2024; Anil et al., 2018; Hin-
ton et al., 2015) from a large IT teacher, along
with a RL finetuning phase based on improved ver- 4.1. LMSYS Chatbot Arena
sions of BOND (Sessa et al., 2024), WARM (Ramé
In this section, we report the performance of our
et al., 2024b), and WARP (Ramé et al., 2024a).
IT 27B model on LMSys Chatbot Arena (Chiang
Reinforcement learning objectives. We use et al., 2024) in blind side-by-side evaluations by
a variety of reward functions to improve help- human raters against other state-of-the-art mod-
fulness, math, coding, reasoning, instruction- els. We report Elo scores in Table 5. Gemma 3 27B

4
Gemma 3 Technical Report

Rank Model Elo 95% CI Open Type #params/#activated


1 Grok-3-Preview-02-24 1412 +8/-10 - - -
1 GPT-4.5-Preview 1411 +11/-11 - - -
3 Gemini-2.0-Flash-Thinking-Exp-01-21 1384 +6/-5 - - -
3 Gemini-2.0-Pro-Exp-02-05 1380 +5/-6 - - -
3 ChatGPT-4o-latest (2025-01-29) 1377 +5/-4 - - -
6 DeepSeek-R1 1363 +8/-6 yes MoE 671B/37B
6 Gemini-2.0-Flash-001 1357 +6/-5 - - -
8 o1-2024-12-17 1352 +4/-6 - - -
9 Gemma-3-27B-IT 1338 +8/-9 yes Dense 27B
9 Qwen2.5-Max 1336 +7/-5 - - -
9 o1-preview 1335 +4/-3 - - -
9 o3-mini-high 1329 +8/-6 - - -
13 DeepSeek-V3 1318 +8/-6 yes MoE 671B/37B
14 GLM-4-Plus-0111 1311 +8/-8 - - -
14 Qwen-Plus-0125 1310 +7/-5 - - -
14 Claude 3.7 Sonnet 1309 +9/-11 - - -
14 Gemini-2.0-Flash-Lite 1308 +5/-5 - - -
18 Step-2-16K-Exp 1305 +7/-6 - - -
18 o3-mini 1304 +5/-4 - - -
18 o1-mini 1304 +4/-3 - - -
18 Gemini-1.5-Pro-002 1302 +3/-3 - - -
...
28 Meta-Llama-3.1-405B-Instruct-bf16 1269 +4/-3 yes Dense 405B
...
38 Llama-3.3-70B-Instruct 1257 +5/-3 yes Dense 70B
...
39 Qwen2.5-72B-Instruct 1257 +3/-3 yes Dense 72B
...
59 Gemma-2-27B-it 1220 +3/-2 yes Dense 27B

Table 5 | Evaluation of Gemma 3 27B IT model in the Chatbot Arena (Chiang et al., 2024). All the
models are evaluated against each other through blind side-by-side evaluations by human raters. Each
model is attributed a score, based on the Elo rating system. Gemma-3-27B-IT numbers are preliminary
results received on March 8, 2025.

IT (1338) is among the top 10 best models, with a reader to follow third-party static leaderboards
score above other non-thinking open models, such for a fairer comparisons across models. We in-
as DeepSeek-V3 (1318), LLaMA 3 405B (1257), clude additional evaluations of our models on
and Qwen2.5-70B (1257), which are much larger other benchmarks in the appendix.
models. Finally, the Elo of Gemma 3 is signifi-
cantly higher than Gemma 2, at 1220. Note that
Elo scores do not take into account visual abilities,
5. Ablations
which none of the aforementioned models have. In this section, we focus on the impact of our
architecture changes, as well as some of the vision
4.2. Standard benchmarks abilities new to this model.

In Table 6, we show the performance of our final 5.1. Pre-training ability probing
models across a variety of benchmarks compared
to our previous model iteration, and Gemini 1.5. We use several standard benchmarks as probes
We do not compare directly with external mod- during pre-training to ensure our models capture
els that often report their own evaluation set- general abilities, and in Figure 2, we compare the
tings, since running them in our setting does not quality of pre-trained models from Gemma 2 and
guarantee a fair comparison. We encourage the 3 across these general abilities, namely, science,

5
Gemma 3 Technical Report

Gemini 1.5 Gemini 2.0 Gemma 2 Gemma 3


Flash Pro Flash Pro 2B 9B 27B 1B 4B 12B 27B
MMLU-Pro 67.3 75.8 77.6 79.1 15.6 46.8 56.9 14.7 43.6 60.6 67.5
LiveCodeBench 30.7 34.2 34.5 36.0 1.2 10.8 20.4 1.9 12.6 24.6 29.7
Bird-SQL (dev) 45.6 54.4 58.7 59.3 12.2 33.8 46.7 6.4 36.3 47.9 54.4
GPQA Diamond 51.0 59.1 60.1 64.7 24.7 28.8 34.3 19.2 30.8 40.9 42.4
SimpleQA 8.6 24.9 29.9 44.3 2.8 5.3 9.2 2.2 4.0 6.3 10.0
FACTS Grounding 82.9 80.0 84.6 82.8 43.8 62.0 62.4 36.4 70.1 75.8 74.9
Global MMLU-Lite 73.7 80.8 83.4 86.5 41.9 64.8 68.6 34.2 54.5 69.5 75.1
MATH 77.9 86.5 90.9 91.8 27.2 49.4 55.6 48.0 75.6 83.8 89.0
HiddenMath 47.2 52.0 63.5 65.2 1.8 10.4 14.8 15.8 43.0 54.5 60.3
MMMU (val) 62.3 65.9 71.7 72.7 - - - - 48.8 59.6 64.9

Table 6 | Performance of instruction fine-tuned (IT) models compared to Gemini 1.5, Gemini 2.0, and
Gemma 2 on zero-shot benchmarks across different abilities.

Figure 2 | Summary of the performance of different pre-trained models from Gemma 2 and 3 across
general abilities. This plots are meant to give an simplified summary and details are in the appendix.

code, factuality, multilinguality, reasoning, and 0.1 2B


vision. The details of the performance across the 9B
Perplexity

different public benchmarks used in these plots 0.0


are summarized in the appendix. Overall, we see
that the new versions improve in most categories, 0.1
despite the addition of vision. We particularly 1:1 3:1 5:1 7:1
focus on multilinguality in this version, and this Local:Global
directly impacts the quality of our models. How-
ever, despite the use of decontamination tech- Figure 3 | Impact of Local:Global ratio on the
niques, there is always a risk of contamination perplexity on a validation set. The impact is mini-
of these probes (Mirzadeh et al., 2024), making mal, even with 7-to-1 local to global. This ablation
more definitive conclusions harder to assess. is run with text-only models.

5.2. Local:Global attention layers ent ratios of local to global attention layers. 1:1
is used in Gemma 2 models, and 5:1 is used in
We measure the impact of changes to local and Gemma 3. We observe minimal impact on per-
global self-attention layers on performance and plexity when changing this ratio.
memory consumption during inference.
Sliding window size. In Fig. 4, we compare
Local:Global ratio. In Fig. 3, we compare differ- different sliding window sizes for the local at-

6
Gemma 3 Technical Report

tention layers in different global:local ratio con-

KV Cache memory (MB)


6000 2B L:G=5:1, sw=1024
figurations. The sliding window can be reduced 2B global only
significantly without impacting perplexity. 4000

0.01 2000
0
Perplexity

0.00 1K 4K 8K 16K 32K 64K 128K


Context length
0.01 2B L:G=1:1
2B L:G=3:1 Figure 6 | KV cache memory versus context
0.02
512 1024 2048 4096 length. We show the memory usage of the KV
Sliding Window cache for our architecture (L:G=5:1, sw=1024)
and a transformer with global attention only – as
Figure 4 | Impact of Sliding Window size on per- used in LLaMa or Gemma 1.
plexity measured on a validation set. We consider
2 2B models, with 1:1 and 1:3 local to global layer
ratios. This ablation is run with text-only models. quences and then scale the 4B, 12B, and 27B mod-
els up to 128K tokens at the end of pre-training
Impact on KV cache memory. In Fig. 5, we show while rescaling RoPE (Chen et al., 2023). We
the balance between the memory used by the find a scaling factor of 8 to work well in practice.
model and the KV cache during inference with a Note that compared to Gemma 2, we have also
context of 32k tokens. The “global only” configu- increased the RoPE base frequency of global self-
ration is the standard configuration used across attention layers from 10k to 1M, while keeping
most dense models. The “1:1, sw=4096” is used 10k for the local self-attention layers. In Figure 7,
in Gemma 2. We observe that the “global only” we show the impact on perplexity for different
configuration results in a memory overhead of context lengths. Our models generalize to 128K,
60%, while this is reduced to less than 15% with but rapidly degrade as we continue to scale.
1:3 and sliding window of 1024 (“sw=1024”).
In Fig. 6, we compute the memory used by the
KV cache as a function of the context length with
either our 2B architecture (L:G=5:1, sw=1024)
versus a “global only” 2B model.
5000 model
Inference memory (MB)

4000 kv cache

3000
2000
1000
0
global only 1:1, sw=4096 1:1 sw=1024 1:3 sw=4096 1:3 sw=1024

Figure 5 | Model versus KV cache memory dur-


ing inference with a pre-fill KV cache of size 32k.
We consider a 2B model with different local to
global ratios and sliding window sizes (sw). We
compare to global only, which is the standard
Figure 7 | Long context performance of pre-
used in Gemma 1 and Llama. This ablation is run
trained models before and after RoPE rescaling.
with a text-only model.

5.3. Enabling long context 5.4. Small versus large teacher

Instead of training with 128K sequences from A common finding is that, to train a small model,
scratch, we pre-train our models with 32K se- it is preferable to distill from a smaller teacher.

7
Gemma 3 Technical Report

0.002 resolution encoder has a 4x4 average pooling on


its output. As shown in Table 7, higher resolution
0.000 encoders perform than smaller ones.
Perplexity

0.002
DocVQA InfoVQA TextVQA
0.004
4B 72.8 44.1 58.9
0.006 4B w/ P&S 81.0 57.0 60.8
101 102 Δ (+8.2) (+12.9) (+1.9)
Total training tokens (B) 27B 85.6 59.4 68.6
27B w/ P&S 90.4 76.4 70.2
Figure 8 | Small versus large teacher. Relative Δ (+4.8) (+17.0) (+1.6)
difference of perplexity when using a small and
large teacher as a function of the token size of Table 8 | Impact of P&S. 4-shot evaluation re-
training. Smaller numbers means distilling from sults on the valid set, with and without P&S on a
a larger teacher is better. pre-trained checkpoint. Boosts are on tasks asso-
ciated with images with varying aspect ratios, or
involving reading text on images.
We suspect this is because these studies are often
performed in settings where the regularization ef- Pan & Scan. P&S enables capturing images at
fect of using a worse teacher surpasses the benefit close to their native aspect ratio and image reso-
of using a better teacher. We train a student with lution. In Table 8, we compare our 27B IT model
2 teachers of different sizes, one large and one with and without P&S. As expected, the ability
small, for different training horizons. In Fig. 8, to treat images with close to native resolution
we observe that for short training horizons, the greatly helps with tasks that require some form
smaller teacher is better, but the trend is reversed of reading text on images, which is particularly
for longer training. important for visual language models.

5.5. Vision encoder


6. Memorization and Privacy

Resolution DocVQA InfoVQA TextVQA Large language models may produce near-copies
of some text used in training (Biderman et al.,
256 31.9 23.1 44.1
2023; Carlini et al., 2021, 2022; Ippolito et al.,
448 45.4 31.6 53.5
2022; Nasr et al., 2023). Several prior reports
896 59.8 33.7 58.0
have released audits that quantify this risk by
measuring the memorization rate (Anil et al.,
Table 7 | Impact of image encoder input reso-
2023; Chowdhery et al., 2022; Gemini Team,
lution. We measure performance using a short
2023, 2024; Gemma Team, 2024a,b; LLaMa
schedule 2B Gemma model on a few evaluation
Team, 2024). This “memorization rate”1 is de-
benchmarks to observe the effect of input image
fined as the ratio of generations from the model
resolution on vision encoder pre-training.
that match its training data compared to all model
Impact of image resolution. We use a vision generations using the following setup. We fol-
encoder based on SigLIP (Zhai et al., 2023). The low the methodology described in Gemma Team
vision encoder is frozen, and only the language 1 "We do not state or imply [here] that a model "contains"

model is trained. Each image in this multimodal its training data in the sense that there is a copy of that data
data is represented by 256 image tokens from in the model. Rather, a model memorizes attributes of its
training data such that in certain cases it is statistically able
the respective vision encoder. The higher resolu- to generate such training data when following rules and
tion encoders thus use average pooling to reduce using information about features of its training data that it
their output to 256 tokens. For instance, the 896 does contain."

8
Gemma 3 Technical Report

Total Memorization Rate designed to have high recall and does not con-
10
Memorization Type
Exact Approximate
sider the context in which the information may
1 appear, which leads to many false positives. Thus,
we are likely overestimating the true amount of
% Memorized

0.1
potentially personal information contained in the
0.01
outputs classified as memorized. SDP also pro-
0.001 vides broad severity levels: low, medium, and
high. We classify text as personal if SDP clas-
0.0001
sifies it as personal information at any severity
Ge 3

Ge 3

Ge 3
B 3

B 2

sh 1.5

2Bmma

7Bmma

SmPaLM
2B ma

9B ma

27 ma
1B ma

4B ma

12mma

27mma

all
Fla ini level. We observed no personal information in
m

Ge

Ge
B
m

m
Ge

Ge

Ge
Ge

Ge
Model
the outputs characterized as memorization for all
Figure 9 | Total memorization rates for both ex- Gemma 3 models. This indicates a low rate of
act and approximate memorization. Gemma 3 personal data, below our detection thresholds, in
models memorize significantly less than all prior outputs classified as memorization.
models. *No results for approximate memoriza-
tion on these models.
7. Responsibility, Safety, Security

(2024b) to measure it. Specifically, we subsam- Responsibility, safety, and security are of utmost
ple a large portion of training data distributed importance in the development of Gemma mod-
uniformly across different corpora and test for els. To reduce risks to Gemma 3 users, we have
discoverable extraction (Nasr et al., 2023) of this continued to integrate enhanced internal safety
content using a prefix of length 50 and a suffix of processes that span the development workflow,
length 50. We denote text as either “exactly mem- in line with recent Google AI models (Gemini
orized” if all tokens in the continuation match Team, 2024). This focuses on safety mitigation at
the source suffix or “approximately memorized” training time, and robust and transparent model
if they match up to an edit distance of 10%. evaluations for the new image-to-text capabilities
we have introduced.
Figure 9 compares the memorization rates
across Gemma and Gemini models; these models
are ordered in reverse chronological order, with 7.1. Governance & Assessment
the newest Gemma 3 models on the left. We find
that Gemma 3 models memorize long-form text Our approach to assessing the benefits and risks
at a much lower rate than prior models (note the of Gemma is reflective of that outlined for Gemma
log y-axis). We observe only a marginal differ- 1 (Gemma Team, 2024a), taking into account the
ence in the memorization rates between the 4B, changes in supported modalities. We continue to
12B, and 27B models, with 1B memorizing less believe that openness in AI can spread the bene-
than these larger models. Further, we find that a fits of these technologies across society, but must
larger proportion of text is characterized as ap- be evaluated against the risk of malicious uses
proximately memorized, with a relative increase that can cause harm on both individual and in-
in approximate memorization compared to exact stitutional levels (Weidinger et al., 2021). Since
memorization of roughly 24x on average. the inaugural Gemma launch, we have seen these
models drive a number of socially beneficial ap-
We also study the rate at which the generations plications, such as our own ShieldGemma 2, a 4B
may contain personal information. To identify po- image safety classifier built with Gemma 3, which
tentially personal information, we use the Google provides a ready-made solution for image safety,
Cloud Sensitive Data Protection (SDP) service.2 outputting safety labels across dangerous content,
SDP uses broad detection rules to identify text sexually explicit, and violence categories.
that may contain personal information. SDP is
Releasing Gemma 3 models required specific
2 https://cloud.google.com/sensitive-data-protection attention to changes in model capabilities and

9
Gemma 3 Technical Report

close monitoring of the evolving risks of existing rigorous risk assessment. Our internal safety pro-
multimodal LLMs (Lin et al., 2024), as well as an cesses are designed accordingly, and for previ-
understanding of the ways in which models are ous Gemma models we have also undertaken
being used in the wild. Although we are yet to evaluations of capabilities relevant to extreme
receive any reports of malicious use for Gemma, risks (Phuong et al., 2024; Shevlane et al., 2023).
we remain committed to investigating any such As we continue to develop and share open mod-
reporting, and work with the academic and de- els, we will follow the heuristic that thoroughly
veloper communities, as well as conduct our own evaluating a more capable model often provides
monitoring, to flag such cases. sufficient assurance for less capable ones. As such,
we prioritised a streamlined set of evaluations for
Despite advancements in capabilities, we be-
Gemma 3, reserving in-depth dangerous capabil-
lieve that, given the number of larger powerful
ity assessments for cases where a specific model
open models available, this release will have a
may present a potentially heightened risk (as de-
negligible effect on the overall risk landscape.
scribed below on CBRN evaluations). We balance
development speed with targeted safety testing,
7.2. Safety policies and train-time mitigations ensuring our evaluations are well-focused and
efficient, while upholding the commitments laid
A key pillar of Gemma’s approach to safety is to out in our Frontier Safety Framework.
align fine-tuned models with Google’s safety poli-
cies, in line with Gemini models (Gemini Team,
Baseline Evaluations
2023). They are designed to help prevent our
models from generating harmful content, i.e., Baseline assurance captures the model violation
rate for safety policies, using a large number of
• Child sexual abuse and exploitation synthetic adversarial user queries, and human
• Revealing personally identifiable information raters to label the answers as policy violating or
that can lead to harm (e.g., Social Security not. Overall, Gemma 3 violation rate is signifi-
numbers) cantly low overall on these safety policies.
• Hate speech and harassment
• Dangerous or malicious content (including Chemical, Biological, Radiological and Nuclear
promoting self-harm or instructing in harm- (CBRN) knowledge
ful activities)
Owing to enhanced performance on STEM-
• Sexually explicit content
related tasks, we evaluated knowledge relevant
• Medical advice that runs contrary to scientific
to biological, radiological, and nuclear risks using
or medical consensus
an internal dataset of closed-ended, knowledge-
based multiple choice questions. For evaluations
We undertook considerable safety filtering of our of chemical knowledge, we employed a closed-
pre-training data to reduce the likelihood of our ended knowledge-based approach on chemical
pre-trained and fine-tuned checkpoints producing hazards developed by Macknight et al. Our eval-
harmful content. For fine-tuned models, we also uation suggests that the knowledge of Gemma 3
use both SFT and RLHF to steer the model away models in these domains is low.
from undesirable behavior.

7.4. Our approach to responsible open models


7.3. Assurance Evaluations
Designing safe, secure, and responsible applica-
We also run our IT models through a set of base- tions requires a system-level approach, working
line assurance evaluations to understand the po- to mitigate risks associated with each specific use
tential harms that our models can cause. As we case and environment. We will continue to adopt
champion open models, we also recognize that assessments and safety mitigations proportion-
the irreversible nature of weight releases requires ate to the potential risks from our models, and

10
Gemma 3 Technical Report

will only share these with the community when A. Asai, J. Kasai, J. H. Clark, K. Lee, E. Choi,
we are confident that the benefits significantly and H. Hajishirzi. Xor qa: Cross-lingual open-
outweigh the foreseeable risks. retrieval question answering. arXiv preprint
arXiv:2010.11856, 2020.

8. Discussion and Conclusion J. Austin, A. Odena, M. I. Nye, M. Bosma,


H. Michalewski, D. Dohan, E. Jiang, C. J. Cai,
In this work, we have presented Gemma 3, the M. Terry, Q. V. Le, and C. Sutton. Program
latest addition to the Gemma family of open lan- synthesis with large language models. CoRR,
guage models for text, image, and code. In this abs/2108.07732, 2021.
version, we focus on adding image understanding
and long context while improving multilinguality P. Barham, A. Chowdhery, J. Dean, S. Ghemawat,
and STEM-related abilities. Our model sizes and S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang,
architectures are designed to be compatible with S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E.
standard hardware, and most of our architecture Shafey, C. A. Thekkath, and Y. Wu. Path-
improvements are tailored to fit this hardware ways: Asynchronous distributed dataflow for
while maintaining performance. ml, 2022.

I. Beltagy, M. E. Peters, and A. Cohan. Long-


References former: The long-document transformer. arXiv
preprint arXiv:2004.05150, 2020.
Realworldqa. https://x.ai/news/grok-1.
5v. S. Biderman, U. Prashanth, L. Sutawika,
H. Schoelkopf, Q. Anthony, S. Purohit, and
M. Acharya, K. Kafle, and C. Kanan. Tallyqa: An- E. Raff. Emergent and predictable memoriza-
swering complex counting questions. In AAAI, tion in large language models. NeurIPS, 36:
2018. 28072–28090, 2023.

R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi.
Garea, M. Geist, and O. Bachem. On-policy PIQA: reasoning about physical commonsense
distillation of language models: Learning from in natural language. CoRR, abs/1911.11641,
self-generated mistakes. In ICLR, 2024. 2019.

J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyan- N. Carlini, F. Tramer, E. Wallace, M. Jagielski,


skiy, F. Lebrón, and S. Sanghai. Gqa: Training A. Herbert-Voss, K. Lee, A. Roberts, T. Brown,
generalized multi-query transformer models D. Song, U. Erlingsson, et al. Extracting train-
from multi-head checkpoints. arXiv preprint ing data from large language models. In
arXiv:2305.13245, 2023. USENIX, 2021.

R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. N. Carlini, D. Ippolito, M. Jagielski, K. Lee,


Dahl, and G. E. Hinton. Large scale distributed F. Tramer, and C. Zhang. Quantifying memo-
neural network training through online distil- rization across neural language models. arXiv
lation. arXiv preprint arXiv:1804.03235, 2018. preprint arXiv:2202.07646, 2022.

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- Chameleon Team. Chameleon: Mixed-modal


ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, early-fusion foundation models. arXiv preprint
Z. Chen, et al. Palm 2 technical report. arXiv arXiv:2405.09818, 2024.
preprint arXiv:2305.10403, 2023.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
M. Artetxe, S. Ruder, and D. Yogatama. On the de Oliveira Pinto, J. Kaplan, H. Edwards,
cross-lingual transferability of monolingual rep- Y. Burda, N. Joseph, G. Brockman, A. Ray,
resentations. In ACL, 2020. R. Puri, G. Krueger, M. Petrov, H. Khlaaf,

11
Gemma 3 Technical Report

G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry- H. W. Chung, N. Constant, X. Garcia, A. Roberts,


der, M. Pavlov, A. Power, L. Kaiser, M. Bavar- Y. Tay, S. Narang, and O. Firat. Unimax: Fairer
ian, C. Winter, P. Tillet, F. P. Such, D. Cum- and more effective language sampling for large-
mings, M. Plappert, F. Chantzis, E. Barnes, scale multilingual pretraining, 2023.
A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,
N. Tezak, J. Tang, I. Babuschkin, S. Balaji, C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
S. Jain, W. Saunders, C. Hesse, A. N. Carr, M. Collins, and K. Toutanova. Boolq: Explor-
J. Leike, J. Achiam, V. Misra, E. Morikawa, ing the surprising difficulty of natural yes/no
A. Radford, M. Knight, M. Brundage, M. Murati, questions. CoRR, abs/1905.10044, 2019.
K. Mayer, P. Welinder, B. McGrew, D. Amodei,
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
S. McCandlish, I. Sutskever, and W. Zaremba.
H. Jun, L. Kaiser, M. Plappert, J. Tworek,
Evaluating large language models trained on
J. Hilton, R. Nakano, C. Hesse, and J. Schul-
code. CoRR, abs/2107.03374, 2021.
man. Training verifiers to solve math word
S. Chen, S. Wong, L. Chen, and Y. Tian. Extend- problems. CoRR, abs/2110.14168, 2021.
ing context window of large language mod-
DeepSeek-AI. Deepseek-r1: Incentivizing reason-
els via positional interpolation. arXiv preprint
ingt learning, 2025.
arXiv:2306.15595, 2023.
M. Dehghani, J. Djolonga, B. Mustafa,
X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,
P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner,
P. Dollár, and C. L. Zitnick. Microsoft coco
M. Caron, R. Geirhos, I. Alabdulmohsin, et al.
captions: Data collection and evaluation server.
Scaling vision transformers to 22 billion
ArXiv, abs/1504.00325, 2015.
parameters. In ICML, 2023.
W.-L. Chiang, L. Zheng, Y. Sheng, A. N. An-
D. Deutsch, E. Briakou, I. Caswell, M. Finkelstein,
gelopoulos, T. Li, D. Li, H. Zhang, B. Zhu,
R. Galor, J. Juraska, G. Kovacs, A. Lui, R. Rei,
M. Jordan, J. E. Gonzalez, and I. Stoica. Chat-
J. Riesa, S. Rijhwani, P. Riley, E. Salesky, F. Tra-
bot arena: An open platform for evaluating
belsi, S. Winkler, B. Zhang, and M. Freitag.
llms by human preference, 2024.
Wmt24++: Expanding the language coverage
F. Chollet. On the measure of intelligence. arXiv of wmt24 to 55 languages & dialects, 2025.
preprint arXiv:1911.01547, 2019.
A. Dosovitskiy. An image is worth 16x16 words:
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, Transformers for image recognition at scale.
G. Mishra, A. Roberts, P. Barham, H. W. arXiv preprint arXiv:2010.11929, 2020.
Chung, C. Sutton, S. Gehrmann, P. Schuh,
D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
and M. Gardner. DROP: A reading comprehen-
P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran,
sion benchmark requiring discrete reasoning
E. Reif, N. Du, B. Hutchinson, R. Pope, J. Brad-
over paragraphs. In ACL, 2019.
bury, J. Austin, M. Isard, G. Gur-Ari, P. Yin,
T. Duke, A. Levskaya, S. Ghemawat, S. Dev, B. Fatemi, M. Kazemi, A. Tsitsulin, K. Malkan,
H. Michalewski, X. Garcia, V. Misra, K. Robin- J. Yim, J. Palowitch, S. Seo, J. Halcrow, and
son, L. Fedus, D. Zhou, D. Ippolito, D. Luan, B. Perozzi. Test of time: A benchmark for
H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, evaluating llms on temporal reasoning. arXiv
D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, preprint arXiv:2406.09170, 2024.
T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira,
R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, X. Fu, Y. Hu, B. Li, Y. Feng, H. Wang, X. Lin,
B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna.
K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, Blink: Multimodal large language models can
and N. Fiedel. Palm: Scaling language model- see but not perceive. ArXiv, abs/2404.12390,
ing with pathways, 2022. 2024.

12
Gemma 3 Technical Report

J. Gehring, K. Zheng, J. Copet, V. Mella, T. Cohen, G. Hinton, O. Vinyals, and J. Dean. Distilling the
and G. Synnaeve. Rlef: Grounding code llms in knowledge in a neural network. arXiv preprint
execution feedback with reinforcement learn- arXiv:1503.02531, 2015.
ing. arXiv preprint arXiv:2410.02089, 2024.
C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya,
Gemini Team. Gemini: A family of highly capable D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg.
multimodal models, 2023. Ruler: What’s the real context size of your
long-context language models? arXiv preprint
Gemini Team. Gemini 1.5: Unlocking multimodal arXiv:2404.06654, 2024.
understanding across millions of tokens of con-
text, 2024. D. Ippolito, F. Tramèr, M. Nasr, C. Zhang,
M. Jagielski, K. Lee, C. A. Choquette-Choo, and
Gemma Team. Gemma: Open models based on
N. Carlini. Preventing verbatim memorization
gemini research and technology, 2024a.
in language models gives a false sense of pri-
Gemma Team. Gemma 2: Improving open lan- vacy. arXiv preprint arXiv:2210.17546, 2022.
guage models at a practical size. arXiv preprint
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
arXiv:2408.00118, 2024b.
A. Howard, H. Adam, and D. Kalenichenko.
O. Goldman, U. Shaham, D. Malkin, S. Eiger, Quantization and training of neural networks
A. Hassidim, Y. Matias, J. Maynez, A. M. Gi- for efficient integer-arithmetic-only inference.
lady, J. Riesa, S. Rijhwani, L. Rimell, I. Szpektor, In CVPR, 2018.
R. Tsarfaty, and M. Eyal. Eclektic: a novel chal-
lenge set for evaluation of cross-lingual knowl- M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer.
edge transfer, 2025. Triviaqa: A large scale distantly supervised
challenge dataset for reading comprehension.
N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, CoRR, abs/1705.03551, 2017.
G. Wenzek, D. Ju, S. Krishnan, M. Ranzato,
F. Guzmán, and A. Fan. The flores-101 evalua- M. Kazemi, H. Alvari, A. Anand, J. Wu, X. Chen,
tion benchmark for low-resource and multilin- and R. Soricut. Geomverse: A systematic eval-
gual machine translation. ACL, 2022. uation of large models for geometric reasoning.
arXiv preprint arXiv:2312.12241, 2023.
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and
D. Parikh. Making the V in VQA matter: Elevat- M. Kazemi, N. Dikkala, A. Anand, P. Dević, I. Das-
ing the role of image understanding in Visual gupta, F. Liu, B. Fatemi, P. Awasthi, D. Guo,
Question Answering. In CVPR, 2017. S. Gollapudi, and A. Qureshi. Remi: A dataset
for reasoning with multiple images. ArXiv,
D. Hendrycks, C. Burns, S. Basart, A. Zou, abs/2406.09175, 2024a.
M. Mazeika, D. Song, and J. Steinhardt. Mea-
suring massive multitask language understand- M. Kazemi, Q. Yuan, D. Bhatia, N. Kim,
ing. CoRR, abs/2009.03300, 2020. X. Xu, V. Imbrasaite, and D. Ramachandran.
Boardgameqa: A dataset for natural lan-
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, guage reasoning with contradictory informa-
S. Basart, E. Tang, D. Song, and J. Steinhardt. tion. NeurIPS, 36, 2024b.
Measuring mathematical problem solving with
the math dataset. NeurIPS, 2021. M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch,
C. Anastasiou, S. V. Mehta, L. K. Jain, V. Aglietti,
J. Hessel, A. Marasović, J. D. Hwang, L. Lee, J. Da, D. Jindal, P. Chen, et al. Big-bench extra hard.
R. Zellers, R. Mankoff, and Y. Choi. Do an- arXiv preprint arXiv:2502.19187, 2025.
droids laugh at electric sheep? humor" under-
standing" benchmarks from the new yorker cap- A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha-
tion contest. arXiv preprint arXiv:2209.06293, jishirzi, and A. Farhadi. A diagram is worth a
2022. dozen images. ArXiv, abs/1603.07396, 2016.

13
Gemma 3 Technical Report

E. Kıcıman, R. Ness, A. Sharma, and C. Tan. M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Val-
Causal reasoning and large language models: veny, and C. Jawahar. Infographicvqa. In WACV,
Opening a new frontier for causality. arXiv 2022.
preprint arXiv:2305.00050, 2023.
I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel,
T. Kudo and J. Richardson. SentencePiece: A S. Bengio, and M. Farajtabar. Gsm-symbolic:
simple and language independent subword to- Understanding the limitations of mathemati-
kenizer and detokenizer for neural text pro- cal reasoning in large language models. arXiv
cessing. 2018. preprint arXiv:2410.05229, 2024.

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F.


M. Collins, A. Parikh, C. Alberti, D. Epstein, Cooper, D. Ippolito, C. A. Choquette-Choo,
I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, E. Wallace, F. Tramèr, and K. Lee. Scal-
L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, able extraction of training data from (pro-
J. Uszkoreit, Q. Le, and S. Petrov. Natural ques- duction) language models. arXiv preprint
tions: A benchmark for question answering re- arXiv:2311.17035, 2023.
search. ACL, 2019.
A. Nie, Y. Zhang, A. S. Amdekar, C. Piech, T. B.
N. Lambert, J. Morrison, V. Pyatkin, S. Huang, Hashimoto, and T. Gerstenberg. Moca: Mea-
H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, suring human-language model alignment on
N. Dziri, S. Lyu, et al. T\" ulu 3: Pushing causal and moral judgment tasks. NeurIPS, 36,
frontiers in open language model post-training. 2024.
arXiv preprint arXiv:2411.15124, 2024.
R. Paiss, A. Ephrat, O. Tov, S. Zada, I. Mosseri,
Z. Lin, J. Cui, X. Liao, and X. Wang. Malla: De- M. Irani, and T. Dekel. Teaching clip to count
mystifying real-world large language model to ten. ICCV, 2023.
integrated malicious services, 2024.
M. Phuong, M. Aitchison, E. Catt, S. Co-
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruc- gan, A. Kaskasoli, V. Krakovna, D. Lindner,
tion tuning. NeurIPS, 36, 2024. M. Rahtz, Y. Assael, S. Hodkinson, H. Howard,
T. Lieberum, R. Kumar, M. A. Raad, A. Webson,
LLaMa Team. The llama 3 herd of models. arXiv L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Dele-
preprint arXiv:2407.21783, 2024. tang, A. Ruoss, S. El-Sayed, S. Brown, A. Dra-
M. Luong, H. Pham, and C. D. Manning. Effective gan, R. Shah, A. Dafoe, and T. Shevlane. Evalu-
approaches to attention-based neural machine ating frontier models for dangerous capabilities,
translation. 2015. 2024.

Macknight, Aung, and Gomes. Personal Commu- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
nication. G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al. Learning transferable
K. Marino, M. Rastegari, A. Farhadi, and R. Mot- visual models from natural language supervi-
taghi. Ok-vqa: A visual question answering sion. In ICML, pages 8748–8763. PMLR, 2021.
benchmark requiring external knowledge. In
CVPR, 2019. A. Ramé, J. Ferret, N. Vieillard, R. Dadashi,
L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin,
A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque. A. Douillard, and O. Bachem. WARP: On the
ChartQA: A benchmark for question answering benefits of weight averaged rewarded policies,
about charts with visual and logical reasoning. 2024a.
ACL, 2022.
A. Ramé, N. Vieillard, L. Hussenot, R. Dadashi,
M. Mathew, D. Karatzas, R. Manmatha, and C. V. G. Cideron, O. Bachem, and J. Ferret. WARM:
Jawahar. Docvqa: A dataset for vqa on docu- On the benefits of weight averaged reward mod-
ment images. WACV, 2020. els. In ICML, 2024b.

14
Gemma 3 Technical Report

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. V. Bolina, J. Clark, Y. Bengio, P. Christiano, and


Pang, J. Dirani, J. Michael, and S. R. Bow- A. Dafoe. Model evaluation for extreme risks,
man. Gpqa: A graduate-level google-proof q&a 2023.
benchmark. ArXiv, abs/2311.12022, 2023.
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
J. Ren, S. Rajbhandari, R. Y. Aminabadi, vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
O. Ruwase, S. Yang, M. Zhang, D. Li, and D. Zhou, D. Das, and J. Wei. Language models
Y. He. Zero-offload: Democratizing billion- are multilingual chain-of-thought reasoners. In
scale model training. In USENIX, 2021. ICLR, 2023.
A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen,
J. Bradbury, D. Andor, S. Narang, B. Lester, D. Parikh, and M. Rohrbach. Towards vqa mod-
C. Gaffney, A. Mohiuddin, et al. Scaling up els that can read. In CVPR, 2019.
models and data with t5x and seqio. JMLR,
2023. H. Singh, N. Gupta, S. Bharadwaj, D. Tewari,
and P. Talukdar. Indicgenbench: a multilin-
N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni, gual benchmark to evaluate generation capabil-
L. Hong, E. H. Chi, J. Caverlee, J. McAuley, and ities of llms on indic languages. arXiv preprint
D. Z. Cheng. How to train data-efficient llms. arXiv:2404.16816, 2024a.
arXiv preprint arXiv:2402.09668, 2024.
S. Singh, A. Romanou, C. Fourrier, D. I. Adelani,
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and J. G. Ngui, D. Vila-Suero, P. Limkonchotiwat,
Y. Choi. WINOGRANDE: an adversarial K. Marchisio, W. Q. Leong, Y. Susanto, R. Ng,
winograd schema challenge at scale. CoRR, S. Longpre, W.-Y. Ko, M. Smith, A. Bosselut,
abs/1907.10641, 2019. A. Oh, A. F. T. Martins, L. Choshen, D. Ippolito,
E. Sánchez, B. Alastruey, C. Ropers, P. Stenetorp, E. Ferrante, M. Fadaee, B. Ermis, and S. Hooker.
M. Artetxe, and M. R. Costa-jussà. Linguini: Global mmlu: Understanding and addressing
A benchmark for language-agnostic linguistic cultural and linguistic biases in multilingual
reasoning. arXiv preprint arXiv:2409.12126, evaluation, 2024b.
2024. A. Steiner, A. S. Pinto, M. Tschannen, D. Key-
M. Sap, H. Rashkin, D. Chen, R. L. Bras, sers, X. Wang, Y. Bitton, A. Gritsenko, M. Min-
and Y. Choi. Socialiqa: Commonsense derer, A. Sherbondy, S. Long, S. Qin, R. In-
reasoning about social interactions. CoRR, gle, E. Bugliarello, S. Kazemzadeh, T. Mes-
abs/1904.09728, 2019. nard, I. Alabdulmohsin, L. Beyer, and X. Zhai.
PaliGemma 2: A Family of Versatile VLMs
P. G. Sessa, R. Dadashi, L. Hussenot, J. Ferret, for Transfer. arXiv preprint arXiv:2412.03555,
N. Vieillard, A. Ramé, B. Shariari, S. Perrin, 2024.
A. Friesen, G. Cideron, S. Girgin, P. Stanczyk,
A. Michi, D. Sinopalnikov, S. Ramos, A. Héliou, M. Suzgun, N. Scales, N. Schärli, S. Gehrmann,
A. Severyn, M. Hoffman, N. Momchev, and Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le,
O. Bachem. Bond: Aligning llms with best-of-n E. H. Chi, D. Zhou, and J. Wei. Challenging
distillation, 2024. big-bench tasks and whether chain-of-thought
can solve them, 2022.
K. Shah, N. Dikkala, X. Wang, and R. Panigrahy.
Causal language modeling can elicit search and G. Tyen, H. Mansoor, P. Chen, T. Mak, and
reasoning capabilities on logic puzzles. arXiv V. Cărbune. Llms cannot find reasoning er-
preprint arXiv:2409.10502, 2024. rors, but can correct them! arXiv preprint
arXiv:2311.08516, 2023.
T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong,
J. Whittlestone, J. Leung, D. Kokotajlo, N. Mar- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
chal, M. Anderljung, N. Kolt, L. Ho, D. Sid- L. Jones, A. N. Gomez, L. Kaiser, and I. Polo-
darth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, sukhin. Attention is all you need. 2017.

15
Gemma 3 Technical Report

K. Vodrahalli, S. Ontanon, N. Tripuraneni, K. Xu, benchmark for spatial relation recognition.


S. Jain, R. Shivanna, J. Hui, N. Dikkala, ICCV, 2019.
M. Kazemi, B. Fatemi, et al. Michelangelo:
Long context evaluations beyond haystacks X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang,
via latent structure queries. arXiv preprint S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei,
arXiv:2409.12640, 2024. B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng,
Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su,
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, and W. Chen. Mmmu: A massive multi-
S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, discipline multimodal understanding and rea-
et al. Mmlu-pro: A more robust and challenging soning benchmark for expert agi. CVPR, 2023.
multi-task language understanding benchmark.
In NeurIPS, 2024. R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and
Y. Choi. HellaSwag: Can a machine really finish
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, your sentence? In ACL, 2019.
J. Uesato, P.-S. Huang, M. Cheng, M. Glaese,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer.
B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown,
Sigmoid loss for language image pre-training.
W. Hawkins, T. Stepleton, C. Biles, A. Birhane,
In CVPR, 2023.
J. Haas, L. Rimell, L. A. Hendricks, W. Isaac,
S. Legassick, G. Irving, and I. Gabriel. Ethical B. Zhang and R. Sennrich. Root mean square
and social risks of harm from language models, layer normalization. 2019.
2021.
J. Zhang, L. Jain, Y. Guo, J. Chen, K. L. Zhou,
C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Suresh, A. Wagenmaker, S. Sievert, T. Rogers,
S. Jain, R. Shwartz-Ziv, N. Jain, K. Saiful- K. Jamieson, et al. Humor in ai: Massive
lah, S. Naidu, et al. Livebench: A challeng- scale crowd-sourced preferences and bench-
ing, contamination-free llm benchmark. arXiv marks for cartoon captioning. arXiv preprint
preprint arXiv:2406.19314, 2024. arXiv:2406.10522, 2024.
M. Wortsman, P. J. Liu, L. Xiao, K. Everett, W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,
A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Ku- A. Saied, W. Chen, and N. Duan. Agieval: A
mar, R. Novak, et al. Small-scale proxies for human-centric benchmark for evaluating foun-
large-scale transformer training instabilities. dation models, 2023.
arXiv preprint arXiv:2309.14322, 2023.

XLA. Xla: Optimizing compiler for tensor-


flow, 2019. URL https://www.tensorflow.
org/xla.
Y. Xu, H. Lee, D. Chen, B. A. Hechtman, Y. Huang,
R. Joshi, M. Krikun, D. Lepikhin, A. Ly, M. Mag-
gioni, R. Pang, N. Shazeer, S. Wang, T. Wang,
Y. Wu, and Z. Chen. GSPMD: general and scal-
able parallelization for ML computation graphs.
2021.

Y. Yamada, Y. Bao, A. K. Lampinen, J. Kasai,


and I. Yildirim. Evaluating spatial understand-
ing of large language models. arXiv preprint
arXiv:2310.14540, 2023.

K. Yang, O. Russakovsky, and J. Deng. Spa-


tialsense: An adversarially crowdsourced

16
Gemma 3 Technical Report

Core contributors Contributors (alphabetical order)


Aishwarya Kamath∗ Abe Friesen
Johan Ferret∗ Abhanshu Sharma
Shreya Pathak∗ Abheesht Sharma
Nino Vieillard∗ Adi Mayrav Gilady
Ramona Merhej∗ Adrian Goedeckemeyer
Sarah Perrin∗ Alex Feng
Tatiana Matejovicova∗ Alexander Kolesnikov
Alexandre Ramé∗ Alexei Bendebury
Morgane Rivière∗ Alvin Abdagic
Louis Rouillard∗ Amit Vadi
Thomas Mesnard∗ André Susano Pinto
Geoffrey Cideron∗ Anil Das
Jean-bastien Grill∗ Ankur Bapna
Sabela Ramos∗ Antoine Miech
Edouard Yvinec∗ Antoine Yang
Michelle Casbon∗ Antonia Paterson
Etienne Pot Ashish Shenoy
Ivo Penchev Ayan Chakrabarti
Gaël Liu Bilal Piot
Francesco Visin Bo Wu
Kathleen Kenealy Bobak Shahriari
Lucas Beyer Bryce Petrini
Xiaohai Zhai Charlie Chen
Anton Tsitsulin Charline Le Lan
Robert Busa-Fekete Christopher A. Choquette-Choo
Alex Feng CJ Carey
Noveen Sachdeva Cormac Brick
Benjamin Coleman Daniel Deutsch
Yi Gao Danielle Eisenbud
Basil Mustafa Dee Cattle
Iain Barr Derek Cheng
Emilio Parisotto Dimitris Paparas
David Tian Divyashree Shivakumar Sreepathihalli
Matan Eyal Doug Reid
Colin Cherry Dustin Tran
Jan-Thorsten Peter Dustin Zelle
Danila Sinopalnikov Eric Noland
Surya Bhupatiraju Erwin Huizenga
Rishabh Agarwal Eugene Kharitonov
Mehran Kazemi Frederick Liu
Dan Malkin Gagik Amirkhanyan
David Vilar Glenn Cameron
Idan Brusilovsky Hadi Hashemi
Jiaming Luo Hanna Klimczak-Plucińska
Andreas Steiner Harman Singh
Harsh Mehta
Harshal Tushar Lehri
Hussein Hazimeh
∗ co-first authors.

17
Gemma 3 Technical Report

Ian Ballantyne Rohith Vallu


Idan Szpektor Ryan Mullins
Ivan Nardini Sammy Jerome
Jean Pouget-Abadie Sara Smoot
Jetha Chan Sertan Girgin
Joe Stanton Shariq Iqbal
John Wieting Shashir Reddy
Jonathan Lai Shruti Sheth
Jordi Orbay Siim Põder
Joseph Fernandez Sijal Bhatnagar
Josh Newlan Sindhu Raghuram Panyam
Ju-yeong Ji Sivan Eiger
Jyotinder Singh Susan Zhang
Kat Black Tianqi Liu
Kathy Yu Trevor Yacovone
Kevin Hui Tyler Liechty
Kiran Vodrahalli Uday Kalra
Klaus Greff Utku Evci
Linhai Qiu Vedant Misra
Marcella Valentine Vincent Roseberry
Marina Coelho Vlad Feinberg
Marvin Ritter Vlad Kolesnikov
Matt Hoffman Woohyun Han
Matthew Watson Woosuk Kwon
Mayank Chaturvedi Yinlam Chow
Michael Moynihan Zichuan Wei
Min Ma Zoltan Egyed
Nabila Babar
Natasha Noy
Support
Nathan Byrd
Victor Cotruta
Nick Roy
Minh Giang
Nikola Momchev
Phoebe Kirk
Nilay Chauhan
Anand Rao
Noveen Sachdeva
Kat Black
Oskar Bunyan
Nabila Babar
Pankil Botarda
Jessica Lo
Paul Kishan Rubenstein
Erica Moreira
Phil Culliton
Luiz Gustavo Martins
Philipp Schmid
Omar Sanseviero
Pier Giuseppe Sessa
Lucas Gonzalez
Pingmei Xu
Zach Gleicher
Piotr Stanczyk
Tris Warkentin
Pouya Tafti
Rakesh Shivanna
Ravin Kumar Sponsors
Renjie Wu Vahab Mirrokni
Renke Pan Evan Senter
Reza Rokni Eli Collins
Rob Willoughby Joelle Barral

18
Gemma 3 Technical Report

Zoubin Ghahramani
Raia Hadsell
D. Sculley
Slav Petrov
Noah Fiedel
Noam Shazeer
Oriol Vinyals
Jeff Dean
Demis Hassabis
Koray Kavukcuoglu
Clement Farabet

Technical advisors
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry (Dima) Lepikhin
Sebastian Borgeaud
Olivier Bachem

Lead
Armand Joulin

Technical leads
Alek Andreev
Cassidy Hardin
Robert Dadashi
Léonard Hussenot

19
Gemma 3 Technical Report

Appendix Gemma 2 Gemma 3

Details of pre-trained performances. 2B 9B 27B 4B 12B 27B


MMLU 52.2 71.2 75.2 59.6 74.5 78.6
MMLUpro 22.2 43.7 49.4 29.2 45.3 52.2
Gemma 2 Gemma 3
AGIE 31.6 53.1 55.1 42.1 57.4 66.2
2B 9B 27B 1B 4B 12B 27B MATH 16.4 36.4 42.1 24.2 43.3 50.0
HellaS 72.9 81.9 86.4 62.3 77.2 84.2 85.6 GSM8K 25.0 70.2 74.6 38.4 71.0 82.6
BoolQ 75.6 77.5 76.2 63.2 72.3 78.8 82.4 GPQA 12.5 24.8 26.3 15.0 25.4 24.3
MBPP 31.0 51.2 60.8 46.0 60.4 65.6
PIQA 78.1 81.9 83.5 73.8 79.6 81.8 83.3 HumanE 19.5 40.2 51.2 36.0 45.7 48.8
SIQA 51.8 53.3 53.8 48.9 51.9 53.4 54.9
TQA 60.2 76.5 83.8 39.8 65.8 78.2 85.5 Table 10 | STEM and code performance after pre-
NQ 17.2 29.2 34.7 9.48 20.0 31.4 36.1 training phase.
ARC-C 55.8 69.1 71.4 38.4 56.2 68.9 70.6
ARC-E 80.6 88.3 88.6 73.0 82.4 88.3 89.0 pre-trained models. On code, we see a similar
WinoG 65.4 73.9 79.4 58.2 64.7 74.3 78.8 improvement for the 4B and 12B models but not
BBH 42.4 69.4 74.8 28.4 50.9 72.6 77.7 on the 27B.
Drop 53.2 71.5 75.2 42.4 60.1 72.2 77.2
4B 12B 27B
Table 9 | Factuality, common-sense performance
and reasoning after pre-training phase. COCO caption 102 111 116
DocVQA 72.8 82.3 85.6
InfoVQA 44.1 54.8 59.4
Factuality and common-sense. In Table 9, we
MMMU 39.2 50.3 56.1
report the performance of our new pre-trained
TextVQA 58.9 66.5 68.6
benchmarks compared to previous versions. We
RealWorldQA 45.5 52.2 53.9
consider several standard benchmarks, namely
ReMI 27.3 38.5 44.8
HellaSwag (Zellers et al., 2019), BoolQ (Clark
AI2D 63.2 75.2 79.0
et al., 2019), PIQA (Bisk et al., 2019), SIQA (Sap
ChartQA 63.6 74.7 76.3
et al., 2019), TriviaQA (Joshi et al., 2017), Natu-
VQAv2 63.9 71.2 72.9
ral Questions (Kwiatkowski et al., 2019), ARC-C
BLINK 38.0 35.9 39.6
and ARC-E (Chollet, 2019), WinoGrande (Sak-
OK-VQA 51.0 58.7 60.2
aguchi et al., 2019), BBH (Suzgun et al., 2022),
TallyQA 42.5 51.8 54.3
DROP (Dua et al., 2019). Evaluation details are
SpatialSense VQA 50.9 60.0 59.4
described in Table 19. Overall, our models are in
CountBench VQA 26.1 17.8 68.0
the same ballpark as Gemma 2, which is encour-
aging since these abilities are not the focus of the Table 11 | Multimodal performance after pre-
improvements brought in this version. training phase. The scores are on the val split
STEM and code. The details of our per- of each dataset without P&S.
formance on STEM and Code are in Ta-
ble 10. We consider several standard bench- Image understanding. In Table 11, we re-
marks, namely MMLU (Hendrycks et al., 2020), port performance across a variety of visual
MMLU-Pro (Wang et al., 2024), AGIEval (Zhong question answer benchmarks for the different
et al., 2023), MATH (Hendrycks et al., 2021), models that were trained with a vision en-
GSM8K (Cobbe et al., 2021), GPQA (Rein coder, namely COCO Caption (Chen et al.,
et al., 2023), MBPP (Austin et al., 2021), Hu- 2015), DocVQA (Mathew et al., 2020), Info-
manEval (Chen et al., 2021). Evaluation details graphicVQA (Mathew et al., 2022), MMMU (Yue
are described in Table 19. Overall we see a consis- et al., 2023), TextVQA (Singh et al., 2019), Re-
tent improvement over STEM abilities across our alWorldQA (Rea), ReMI (Kazemi et al., 2024a),

20
Gemma 3 Technical Report

AI2D (Kembhavi et al., 2016), ChartQA (Masry Gemma 2 Gemma 3


et al., 2022), VQA v2 (Goyal et al., 2017),
2B 9B 27B 1B 4B 12B 27B
BLINK (Fu et al., 2024), OK-VQA (Marino et al.,
2019), TallyQA (Acharya et al., 2018), Spa- MGSM 18.7 57.3 68.0 2.04 34.7 64.3 74.3
tialSense VQA (Yang et al., 2019), CountBench GMMLU 43.3 64.0 69.4 24.9 57.0 69.4 75.7
VQA (Paiss et al., 2023). Evaluation details are WMT24++ 38.8 50.3 53.0 36.7 48.4 53.9 55.7
described in Table 20. Flores 30.2 41.3 44.3 29.5 39.2 46.0 48.8
XQuAD 53.7 72.2 73.9 43.9 68.0 74.5 76.8
PaliGemma 2 Gemma 3 ECLeKTic 8.29 14.0 17.1 4.69 11.0 17.2 24.4
2B 9B 27B 4B 12B 27B IndicGB 47.4 59.3 62.1 41.4 57.2 61.7 63.4
DocVQA 81.6 86.3 85.1 86.1 89.0 89.5
InfoVQA 41.4 53.1 50.2 55.6 61.6 64.6
Table 13 | Multilingual performance after the pre-
TextVQA 76.3 76.3 75.1 79.1 81.6 83.2 training phase. IndicGenBench is an average over
ChartQA 70.7 79.1 71.3 79.8 83.5 83.4 benchmarks reported in Table 14.
AI2D 76.0 84.4 84.6 80.9 85.6 86.5
OKVQA 64.1 68.6 70.6 65.2 69.3 71.1 et al., 2022), XQuAD (Artetxe et al., 2020),
CountBenchQA 82.0 85.3 87.4 79.4 83.5 87.8
ECLeKTic (Goldman et al., 2025), IndicGen-
COCO caption 143. 145. 145. 143. 143. 144. Bench (Singh et al., 2024a), XOR QA (Asai et al.,
VQAv2 84.8 85.8 85.8 84.1 84.9 85.1 2020). Evaluation details are described in Ta-
Tally QA 80.6 82.4 82.1 79.0 81.3 81.7
ble 19.
Table 12 | Performance of pre-trained checkpoints
Gemma 2 Gemma 3
after fine-tuning on multi-modal benchmarks
(without P&S). PaliGemma 2 was transfered at 2B 9B 27B 1B 4B 12B 27B
896x896 resolution for the first four benchmarks, XQuAD Indic 54.3 73.1 74.9 43.1 68.3 75.2 77.8
and at 448x448 resolution for the others. XORQA in-en 66.2 69.3 72.5 56.3 68.3 69.8 70.4
XORQA in-xx 31.2 40.8 44.3 27.1 39.8 43.8 46.0
Comparison to PaliGemma 2. We fine-tune mul- Flores Indic 38.1 54.0 56.9 39.0 52.3 58.0 59.5
timodal Gemma 3 pre-trained checkpoints fol-
lowing the protocol from Steiner et al. (2024) – Table 14 | Detailed IndicGenBench performance
only learning rate is swept, otherwise same trans- after the pre-training phase.
fer settings are used. The results in Table 12
show that Gemma 3 excels at benchmarks in- Long context. In Table 15 we report the per-
volving document understanding, even outper- formance of pre-trained and fine-tuned mod-
forming the larger PaliGemma 2 variant. Note els on long context benchmarks. We include
that due to average pooling in the vision en- RULER (Hsieh et al., 2024) and MRCR (Vodra-
coder the Gemma 3 4B and 12B models are halli et al., 2024) benchmarks evaluating at 32K
about 10x cheaper to transfer compared with the and 128K sequence lengths.
PaliGemma 2 9B and 27B models at the same 896
x 896 resolution. Gemma 3 also performs better
on AI2D and OKVQA, but PaliGemma 2 performs 8.1. Performance of IT models
slightly better on VQAv2 and COCO caption.
We report in Table 18, additional benchmarks
Multilinguality. In Table 13 we report the per- on our IT models. Note that N2C refers to
formance of the pre-trained models on multilin- Natural2Code, the Gemini 1.0 internal held-out
gual tasks. We apply in-context learning with dataset, which uses author-generated sources in-
multi-shot prompting and present results on stead of web-based information. BBEH refers to
the following benchmarks: MGSM (Shi et al., BIG-Bench Extra Hard (Kazemi et al., 2025), a
2023), Global-MMLU-Lite (Singh et al., 2024b), challenging LLM reasoning benchmark that aggre-
WMT24++ (Deutsch et al., 2025), FLoRes (Goyal gates several reasoning tasks (Fatemi et al., 2024;

21
Gemma 3 Technical Report

Gemma 3 PT Gemma 3 IT
Context 4B 12B 27B 4B 12B 27B
RULER 32K 67.1 90.6 85.9 61.4 80.3 91.1
RULER 128K 51.7 80.7 72.9 46.8 57.1 66.0
MRCR 32K 44.7 59.8 63.2 49.8 53.7 63.2
MRCR 128K 40.6 56.9 60.0 44.6 49.8 59.3

Table 15 | Performance of pre-trained (PT) and


instruction fine-tuned (IT) models on long context
benchmarks at different context lengths.

4B 12B 27B
MMMU (val) 48.8 59.6 64.9
DocVQA 75.8 87.1 86.6
InfoVQA 50.0 64.9 70.6
TextVQA 57.8 67.7 65.1
AI2D 74.8 84.2 84.5
ChartQA 68.8 75.7 78.0 4B 12B 27B
VQAv2 (val) 62.4 71.6 71.0 Perception Test MCVQA 50.6 54.9 58.1
MathVista (testmini) 50.0 62.9 67.6 ActivityNet-QA 46.3 50.4 52.8

Table 16 | Performance of instruction fine-tuned Table 17 | Performance of instruction fine-tuned


(IT) models on multimodal benchmarks. If not (IT) models on vision understanding benchmarks
mentioned, these results are on the final test set using 0 shot with 16 frames linspace. Per-
of each dataset with P&S applied. ception Test consists of real-world videos de-
signed to show perceptually interesting situa-
tions and we report results on the multiple choice
Hessel et al., 2022; Kazemi et al., 2023, 2024b;
video QA benchmark in terms of top-1 accuracy.
Kıcıman et al., 2023; Nie et al., 2024; Sánchez
ActivityNet-QA reports standard gpt-evaluation.
et al., 2024; Shah et al., 2024; Tyen et al., 2023;
White et al., 2024; Yamada et al., 2023; Zhang
et al., 2024). ECLeKTic refers to Goldman et al.
(2025). We report the micro average score. More
evaluation details are described in Table 21.

8.2. Performance of IT models on video under-


standing

Additional multimodal evaluations. Gemma


3 IT models were evaluated on common vision
benchmarks following the evaluation protocol of
Gemini 1.5 (Gemini Team, 2024). The results are
given in Table 16 when P&S is activated.

22
Gemma 3 Technical Report

Gemma 2 Gemma 3
2B 9B 27B 1B 4B 12B 27B
MMLU 56.1 71.3 76.2 38.8 58.1 71.9 76.9
MBPP 36.6 59.2 67.4 35.2 63.2 73.0 74.4
HumanEval 20.1 40.2 51.8 41.5 71.3 85.4 87.8
N2C 46.8 68.3 77.3 56.0 70.3 80.7 84.5
LiveCodeBench 7.0 20.0 29.0 5.0 23.0 32.0 39.0
GSM8K 62.6 88.1 91.1 62.8 89.2 94.4 95.9
MATH 27.2 49.4 55.6 48.0 75.6 83.8 89.0
HiddenMath 2.0 8.0 12.0 15.0 42.0 51.0 56.0
BBH 41.4 69.0 74.9 39.1 72.2 85.7 87.6
BBEH 5.9 9.8 14.8 7.2 11.0 16.3 19.3
IFEval 80.4 88.4 91.1 80.2 90.2 88.9 90.4
GMMLU-Lite 41.9 64.8 68.6 34.2 54.5 69.5 75.1
ECLeKTic 5.3 11.8 17.6 1.4 4.6 10.3 16.7
WMT24++ 37.4 48.7 51.7 35.9 46.8 51.6 53.4

Table 18 | Performance of instruction fine-tuned (IT) models of different sizes on more internal and
external benchmarks.

23
Gemma 3 Technical Report

Evaluation Metric Type n-shot COT Norm


MBPP pass@1 sampling 3-shot
HumanEval pass@1 sampling 0-shot
HellaSwag Accuracy scoring 10-shot Char-Len
BoolQ Accuracy scoring 0-shot Char-Len
PIQA Accuracy scoring 0-shot Char-Len
SIQA Accuracy scoring 0-shot Char-Len
TriviaQA Accuracy sampling 5-shot
Natural Questions Accuracy sampling 5-shot
ARC-C Accuracy scoring 25-shot Char-Len
ARC-E Accuracy scoring 0-shot Char-Len
WinoGrande Accuracy scoring 5-shot Char-Len
BBH Accuracy sampling few-shot Yes
DROP Token F1 score sampling 1-shot
AGIEval Accuracy sampling 3-5-shot
MMLU Accuracy scoring 5-shot Char-Len
MATH Accuracy sampling 4-shot Yes
GSM8K Accuracy sampling 8-shot Yes
GPQA Accuracy sampling 5-shot Yes
MMLU-Pro Accuracy sampling 5-shot Yes
MGSM Accuracy sampling 8-shot
FLoRes CHaRacter-level F-score sampling 1-shot
Global-MMLU-Lite Accuracy scoring 5-shot Char-Len
XQuAD CHaRacter-level F-score sampling 5-shot
WMT24++ CHaRacter-level F-score sampling 5-shot
ECLeKTic ECLeKTic score sampling 2-shot First-line/strip
XQuAD Indic CHaRacter-level F-score sampling 5-shot
XOR QA IN-EN CHaRacter-level F-score sampling 5-shot
XOR QA IN-XX CHaRacter-level F-score sampling 5-shot
FLoRes Indic CHaRacter-level F-score sampling 5-shot
RULER Accuracy sampling 0-shot
MRCR MRCR score sampling few-shot

Table 19 | Details on text benchmarks. Char-Len stands for Character Length Normalization and COT
stands for Chain-Of-Thought prompting.

24
Gemma 3 Technical Report

Evaluation Metric Type n-shot


COCO Caption Cider score sampling 4-shot
DocVQA ANLS score sampling 4-shot
InfographicVQA ANLS score sampling 4-shot
MMMU Accuracy sampling 3-shot text only
TextVQA Accuracy sampling 4-shot
RealWorldQA Accuracy sampling 4-shot text only
ReMI Accuracy sampling 4-shot
AI2D Accuracy sampling 4-shot
ChartQA Accuracy sampling 4-shot
VQA v2 Accuracy sampling 4-shot
BLINK Accuracy sampling 0-shot
OK-VQA Accuracy sampling 4-shot
TallyQA Accuracy sampling 4-shot
SpatialSense VQA Accuracy sampling 4-shot
CountBench VQA Accuracy sampling 0-shot

Table 20 | Details on vision benchmarks. No Chain-Of-Thought prompting nor normalization.

Evaluation Metric Type n-shot COT


MMLU Accuracy sampling 0-shot
MBPP pass@1 sampling 3-shot
HumanEval pass@1 sampling 0-shot
N2C pass@1 sampling 0-shot
LiveCodeBench Average over 8 samples sampling 0-shot Yes
GSM8K Accuracy sampling 0-shot Yes
MATH Accuracy sampling 0-shot
HiddenMath Accuracy sampling 0-shot
BBH Accuracy sampling 0-shot
BBEH Accuracy sampling 0-shot
IFEval Accuracy sampling 0-shot
Global-MMLU-lite Accuracy sampling 0-shot Yes
ECLeKTic ECLeKTic score sampling 0-shot
WMT24++ CHaRacter-level F-score sampling 0-shot

Table 21 | Details on instruction fine-tuned (IT) benchmarks. No normalization.

25

You might also like