Gemma 3 Report
Gemma 3 Report
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging
in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider
coverage of languages and longer context – at least 128K tokens. We also change the architecture of
the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by
increasing the ratio of local to global attention layers, and keeping the span on local attention short.
The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2
for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe
significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-
4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across
benchmarks. We release all our models to the community.
1 See Contributions and Acknowledgments section for full author list. Please send correspondence to [email protected].
2
Gemma 3 Technical Report
3
Gemma 3 Technical Report
4
Gemma 3 Technical Report
Table 5 | Evaluation of Gemma 3 27B IT model in the Chatbot Arena (Chiang et al., 2024). All the
models are evaluated against each other through blind side-by-side evaluations by human raters. Each
model is attributed a score, based on the Elo rating system. Gemma-3-27B-IT numbers are preliminary
results received on March 8, 2025.
IT (1338) is among the top 10 best models, with a reader to follow third-party static leaderboards
score above other non-thinking open models, such for a fairer comparisons across models. We in-
as DeepSeek-V3 (1318), LLaMA 3 405B (1257), clude additional evaluations of our models on
and Qwen2.5-70B (1257), which are much larger other benchmarks in the appendix.
models. Finally, the Elo of Gemma 3 is signifi-
cantly higher than Gemma 2, at 1220. Note that
Elo scores do not take into account visual abilities,
5. Ablations
which none of the aforementioned models have. In this section, we focus on the impact of our
architecture changes, as well as some of the vision
4.2. Standard benchmarks abilities new to this model.
In Table 6, we show the performance of our final 5.1. Pre-training ability probing
models across a variety of benchmarks compared
to our previous model iteration, and Gemini 1.5. We use several standard benchmarks as probes
We do not compare directly with external mod- during pre-training to ensure our models capture
els that often report their own evaluation set- general abilities, and in Figure 2, we compare the
tings, since running them in our setting does not quality of pre-trained models from Gemma 2 and
guarantee a fair comparison. We encourage the 3 across these general abilities, namely, science,
5
Gemma 3 Technical Report
Table 6 | Performance of instruction fine-tuned (IT) models compared to Gemini 1.5, Gemini 2.0, and
Gemma 2 on zero-shot benchmarks across different abilities.
Figure 2 | Summary of the performance of different pre-trained models from Gemma 2 and 3 across
general abilities. This plots are meant to give an simplified summary and details are in the appendix.
5.2. Local:Global attention layers ent ratios of local to global attention layers. 1:1
is used in Gemma 2 models, and 5:1 is used in
We measure the impact of changes to local and Gemma 3. We observe minimal impact on per-
global self-attention layers on performance and plexity when changing this ratio.
memory consumption during inference.
Sliding window size. In Fig. 4, we compare
Local:Global ratio. In Fig. 3, we compare differ- different sliding window sizes for the local at-
6
Gemma 3 Technical Report
0.01 2000
0
Perplexity
4000 kv cache
3000
2000
1000
0
global only 1:1, sw=4096 1:1 sw=1024 1:3 sw=4096 1:3 sw=1024
Instead of training with 128K sequences from A common finding is that, to train a small model,
scratch, we pre-train our models with 32K se- it is preferable to distill from a smaller teacher.
7
Gemma 3 Technical Report
0.002
DocVQA InfoVQA TextVQA
0.004
4B 72.8 44.1 58.9
0.006 4B w/ P&S 81.0 57.0 60.8
101 102 Δ (+8.2) (+12.9) (+1.9)
Total training tokens (B) 27B 85.6 59.4 68.6
27B w/ P&S 90.4 76.4 70.2
Figure 8 | Small versus large teacher. Relative Δ (+4.8) (+17.0) (+1.6)
difference of perplexity when using a small and
large teacher as a function of the token size of Table 8 | Impact of P&S. 4-shot evaluation re-
training. Smaller numbers means distilling from sults on the valid set, with and without P&S on a
a larger teacher is better. pre-trained checkpoint. Boosts are on tasks asso-
ciated with images with varying aspect ratios, or
involving reading text on images.
We suspect this is because these studies are often
performed in settings where the regularization ef- Pan & Scan. P&S enables capturing images at
fect of using a worse teacher surpasses the benefit close to their native aspect ratio and image reso-
of using a better teacher. We train a student with lution. In Table 8, we compare our 27B IT model
2 teachers of different sizes, one large and one with and without P&S. As expected, the ability
small, for different training horizons. In Fig. 8, to treat images with close to native resolution
we observe that for short training horizons, the greatly helps with tasks that require some form
smaller teacher is better, but the trend is reversed of reading text on images, which is particularly
for longer training. important for visual language models.
Resolution DocVQA InfoVQA TextVQA Large language models may produce near-copies
of some text used in training (Biderman et al.,
256 31.9 23.1 44.1
2023; Carlini et al., 2021, 2022; Ippolito et al.,
448 45.4 31.6 53.5
2022; Nasr et al., 2023). Several prior reports
896 59.8 33.7 58.0
have released audits that quantify this risk by
measuring the memorization rate (Anil et al.,
Table 7 | Impact of image encoder input reso-
2023; Chowdhery et al., 2022; Gemini Team,
lution. We measure performance using a short
2023, 2024; Gemma Team, 2024a,b; LLaMa
schedule 2B Gemma model on a few evaluation
Team, 2024). This “memorization rate”1 is de-
benchmarks to observe the effect of input image
fined as the ratio of generations from the model
resolution on vision encoder pre-training.
that match its training data compared to all model
Impact of image resolution. We use a vision generations using the following setup. We fol-
encoder based on SigLIP (Zhai et al., 2023). The low the methodology described in Gemma Team
vision encoder is frozen, and only the language 1 "We do not state or imply [here] that a model "contains"
model is trained. Each image in this multimodal its training data in the sense that there is a copy of that data
data is represented by 256 image tokens from in the model. Rather, a model memorizes attributes of its
training data such that in certain cases it is statistically able
the respective vision encoder. The higher resolu- to generate such training data when following rules and
tion encoders thus use average pooling to reduce using information about features of its training data that it
their output to 256 tokens. For instance, the 896 does contain."
8
Gemma 3 Technical Report
Total Memorization Rate designed to have high recall and does not con-
10
Memorization Type
Exact Approximate
sider the context in which the information may
1 appear, which leads to many false positives. Thus,
we are likely overestimating the true amount of
% Memorized
0.1
potentially personal information contained in the
0.01
outputs classified as memorized. SDP also pro-
0.001 vides broad severity levels: low, medium, and
high. We classify text as personal if SDP clas-
0.0001
sifies it as personal information at any severity
Ge 3
Ge 3
Ge 3
B 3
B 2
sh 1.5
2Bmma
7Bmma
SmPaLM
2B ma
9B ma
27 ma
1B ma
4B ma
12mma
27mma
all
Fla ini level. We observed no personal information in
m
Ge
Ge
B
m
m
Ge
Ge
Ge
Ge
Ge
Model
the outputs characterized as memorization for all
Figure 9 | Total memorization rates for both ex- Gemma 3 models. This indicates a low rate of
act and approximate memorization. Gemma 3 personal data, below our detection thresholds, in
models memorize significantly less than all prior outputs classified as memorization.
models. *No results for approximate memoriza-
tion on these models.
7. Responsibility, Safety, Security
(2024b) to measure it. Specifically, we subsam- Responsibility, safety, and security are of utmost
ple a large portion of training data distributed importance in the development of Gemma mod-
uniformly across different corpora and test for els. To reduce risks to Gemma 3 users, we have
discoverable extraction (Nasr et al., 2023) of this continued to integrate enhanced internal safety
content using a prefix of length 50 and a suffix of processes that span the development workflow,
length 50. We denote text as either “exactly mem- in line with recent Google AI models (Gemini
orized” if all tokens in the continuation match Team, 2024). This focuses on safety mitigation at
the source suffix or “approximately memorized” training time, and robust and transparent model
if they match up to an edit distance of 10%. evaluations for the new image-to-text capabilities
we have introduced.
Figure 9 compares the memorization rates
across Gemma and Gemini models; these models
are ordered in reverse chronological order, with 7.1. Governance & Assessment
the newest Gemma 3 models on the left. We find
that Gemma 3 models memorize long-form text Our approach to assessing the benefits and risks
at a much lower rate than prior models (note the of Gemma is reflective of that outlined for Gemma
log y-axis). We observe only a marginal differ- 1 (Gemma Team, 2024a), taking into account the
ence in the memorization rates between the 4B, changes in supported modalities. We continue to
12B, and 27B models, with 1B memorizing less believe that openness in AI can spread the bene-
than these larger models. Further, we find that a fits of these technologies across society, but must
larger proportion of text is characterized as ap- be evaluated against the risk of malicious uses
proximately memorized, with a relative increase that can cause harm on both individual and in-
in approximate memorization compared to exact stitutional levels (Weidinger et al., 2021). Since
memorization of roughly 24x on average. the inaugural Gemma launch, we have seen these
models drive a number of socially beneficial ap-
We also study the rate at which the generations plications, such as our own ShieldGemma 2, a 4B
may contain personal information. To identify po- image safety classifier built with Gemma 3, which
tentially personal information, we use the Google provides a ready-made solution for image safety,
Cloud Sensitive Data Protection (SDP) service.2 outputting safety labels across dangerous content,
SDP uses broad detection rules to identify text sexually explicit, and violence categories.
that may contain personal information. SDP is
Releasing Gemma 3 models required specific
2 https://cloud.google.com/sensitive-data-protection attention to changes in model capabilities and
9
Gemma 3 Technical Report
close monitoring of the evolving risks of existing rigorous risk assessment. Our internal safety pro-
multimodal LLMs (Lin et al., 2024), as well as an cesses are designed accordingly, and for previ-
understanding of the ways in which models are ous Gemma models we have also undertaken
being used in the wild. Although we are yet to evaluations of capabilities relevant to extreme
receive any reports of malicious use for Gemma, risks (Phuong et al., 2024; Shevlane et al., 2023).
we remain committed to investigating any such As we continue to develop and share open mod-
reporting, and work with the academic and de- els, we will follow the heuristic that thoroughly
veloper communities, as well as conduct our own evaluating a more capable model often provides
monitoring, to flag such cases. sufficient assurance for less capable ones. As such,
we prioritised a streamlined set of evaluations for
Despite advancements in capabilities, we be-
Gemma 3, reserving in-depth dangerous capabil-
lieve that, given the number of larger powerful
ity assessments for cases where a specific model
open models available, this release will have a
may present a potentially heightened risk (as de-
negligible effect on the overall risk landscape.
scribed below on CBRN evaluations). We balance
development speed with targeted safety testing,
7.2. Safety policies and train-time mitigations ensuring our evaluations are well-focused and
efficient, while upholding the commitments laid
A key pillar of Gemma’s approach to safety is to out in our Frontier Safety Framework.
align fine-tuned models with Google’s safety poli-
cies, in line with Gemini models (Gemini Team,
Baseline Evaluations
2023). They are designed to help prevent our
models from generating harmful content, i.e., Baseline assurance captures the model violation
rate for safety policies, using a large number of
• Child sexual abuse and exploitation synthetic adversarial user queries, and human
• Revealing personally identifiable information raters to label the answers as policy violating or
that can lead to harm (e.g., Social Security not. Overall, Gemma 3 violation rate is signifi-
numbers) cantly low overall on these safety policies.
• Hate speech and harassment
• Dangerous or malicious content (including Chemical, Biological, Radiological and Nuclear
promoting self-harm or instructing in harm- (CBRN) knowledge
ful activities)
Owing to enhanced performance on STEM-
• Sexually explicit content
related tasks, we evaluated knowledge relevant
• Medical advice that runs contrary to scientific
to biological, radiological, and nuclear risks using
or medical consensus
an internal dataset of closed-ended, knowledge-
based multiple choice questions. For evaluations
We undertook considerable safety filtering of our of chemical knowledge, we employed a closed-
pre-training data to reduce the likelihood of our ended knowledge-based approach on chemical
pre-trained and fine-tuned checkpoints producing hazards developed by Macknight et al. Our eval-
harmful content. For fine-tuned models, we also uation suggests that the knowledge of Gemma 3
use both SFT and RLHF to steer the model away models in these domains is low.
from undesirable behavior.
10
Gemma 3 Technical Report
will only share these with the community when A. Asai, J. Kasai, J. H. Clark, K. Lee, E. Choi,
we are confident that the benefits significantly and H. Hajishirzi. Xor qa: Cross-lingual open-
outweigh the foreseeable risks. retrieval question answering. arXiv preprint
arXiv:2010.11856, 2020.
R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi.
Garea, M. Geist, and O. Bachem. On-policy PIQA: reasoning about physical commonsense
distillation of language models: Learning from in natural language. CoRR, abs/1911.11641,
self-generated mistakes. In ICLR, 2024. 2019.
11
Gemma 3 Technical Report
12
Gemma 3 Technical Report
J. Gehring, K. Zheng, J. Copet, V. Mella, T. Cohen, G. Hinton, O. Vinyals, and J. Dean. Distilling the
and G. Synnaeve. Rlef: Grounding code llms in knowledge in a neural network. arXiv preprint
execution feedback with reinforcement learn- arXiv:1503.02531, 2015.
ing. arXiv preprint arXiv:2410.02089, 2024.
C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya,
Gemini Team. Gemini: A family of highly capable D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg.
multimodal models, 2023. Ruler: What’s the real context size of your
long-context language models? arXiv preprint
Gemini Team. Gemini 1.5: Unlocking multimodal arXiv:2404.06654, 2024.
understanding across millions of tokens of con-
text, 2024. D. Ippolito, F. Tramèr, M. Nasr, C. Zhang,
M. Jagielski, K. Lee, C. A. Choquette-Choo, and
Gemma Team. Gemma: Open models based on
N. Carlini. Preventing verbatim memorization
gemini research and technology, 2024a.
in language models gives a false sense of pri-
Gemma Team. Gemma 2: Improving open lan- vacy. arXiv preprint arXiv:2210.17546, 2022.
guage models at a practical size. arXiv preprint
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
arXiv:2408.00118, 2024b.
A. Howard, H. Adam, and D. Kalenichenko.
O. Goldman, U. Shaham, D. Malkin, S. Eiger, Quantization and training of neural networks
A. Hassidim, Y. Matias, J. Maynez, A. M. Gi- for efficient integer-arithmetic-only inference.
lady, J. Riesa, S. Rijhwani, L. Rimell, I. Szpektor, In CVPR, 2018.
R. Tsarfaty, and M. Eyal. Eclektic: a novel chal-
lenge set for evaluation of cross-lingual knowl- M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer.
edge transfer, 2025. Triviaqa: A large scale distantly supervised
challenge dataset for reading comprehension.
N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, CoRR, abs/1705.03551, 2017.
G. Wenzek, D. Ju, S. Krishnan, M. Ranzato,
F. Guzmán, and A. Fan. The flores-101 evalua- M. Kazemi, H. Alvari, A. Anand, J. Wu, X. Chen,
tion benchmark for low-resource and multilin- and R. Soricut. Geomverse: A systematic eval-
gual machine translation. ACL, 2022. uation of large models for geometric reasoning.
arXiv preprint arXiv:2312.12241, 2023.
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and
D. Parikh. Making the V in VQA matter: Elevat- M. Kazemi, N. Dikkala, A. Anand, P. Dević, I. Das-
ing the role of image understanding in Visual gupta, F. Liu, B. Fatemi, P. Awasthi, D. Guo,
Question Answering. In CVPR, 2017. S. Gollapudi, and A. Qureshi. Remi: A dataset
for reasoning with multiple images. ArXiv,
D. Hendrycks, C. Burns, S. Basart, A. Zou, abs/2406.09175, 2024a.
M. Mazeika, D. Song, and J. Steinhardt. Mea-
suring massive multitask language understand- M. Kazemi, Q. Yuan, D. Bhatia, N. Kim,
ing. CoRR, abs/2009.03300, 2020. X. Xu, V. Imbrasaite, and D. Ramachandran.
Boardgameqa: A dataset for natural lan-
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, guage reasoning with contradictory informa-
S. Basart, E. Tang, D. Song, and J. Steinhardt. tion. NeurIPS, 36, 2024b.
Measuring mathematical problem solving with
the math dataset. NeurIPS, 2021. M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch,
C. Anastasiou, S. V. Mehta, L. K. Jain, V. Aglietti,
J. Hessel, A. Marasović, J. D. Hwang, L. Lee, J. Da, D. Jindal, P. Chen, et al. Big-bench extra hard.
R. Zellers, R. Mankoff, and Y. Choi. Do an- arXiv preprint arXiv:2502.19187, 2025.
droids laugh at electric sheep? humor" under-
standing" benchmarks from the new yorker cap- A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha-
tion contest. arXiv preprint arXiv:2209.06293, jishirzi, and A. Farhadi. A diagram is worth a
2022. dozen images. ArXiv, abs/1603.07396, 2016.
13
Gemma 3 Technical Report
E. Kıcıman, R. Ness, A. Sharma, and C. Tan. M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Val-
Causal reasoning and large language models: veny, and C. Jawahar. Infographicvqa. In WACV,
Opening a new frontier for causality. arXiv 2022.
preprint arXiv:2305.00050, 2023.
I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel,
T. Kudo and J. Richardson. SentencePiece: A S. Bengio, and M. Farajtabar. Gsm-symbolic:
simple and language independent subword to- Understanding the limitations of mathemati-
kenizer and detokenizer for neural text pro- cal reasoning in large language models. arXiv
cessing. 2018. preprint arXiv:2410.05229, 2024.
Macknight, Aung, and Gomes. Personal Commu- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
nication. G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al. Learning transferable
K. Marino, M. Rastegari, A. Farhadi, and R. Mot- visual models from natural language supervi-
taghi. Ok-vqa: A visual question answering sion. In ICML, pages 8748–8763. PMLR, 2021.
benchmark requiring external knowledge. In
CVPR, 2019. A. Ramé, J. Ferret, N. Vieillard, R. Dadashi,
L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin,
A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque. A. Douillard, and O. Bachem. WARP: On the
ChartQA: A benchmark for question answering benefits of weight averaged rewarded policies,
about charts with visual and logical reasoning. 2024a.
ACL, 2022.
A. Ramé, N. Vieillard, L. Hussenot, R. Dadashi,
M. Mathew, D. Karatzas, R. Manmatha, and C. V. G. Cideron, O. Bachem, and J. Ferret. WARM:
Jawahar. Docvqa: A dataset for vqa on docu- On the benefits of weight averaged reward mod-
ment images. WACV, 2020. els. In ICML, 2024b.
14
Gemma 3 Technical Report
15
Gemma 3 Technical Report
16
Gemma 3 Technical Report
17
Gemma 3 Technical Report
18
Gemma 3 Technical Report
Zoubin Ghahramani
Raia Hadsell
D. Sculley
Slav Petrov
Noah Fiedel
Noam Shazeer
Oriol Vinyals
Jeff Dean
Demis Hassabis
Koray Kavukcuoglu
Clement Farabet
Technical advisors
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry (Dima) Lepikhin
Sebastian Borgeaud
Olivier Bachem
Lead
Armand Joulin
Technical leads
Alek Andreev
Cassidy Hardin
Robert Dadashi
Léonard Hussenot
19
Gemma 3 Technical Report
20
Gemma 3 Technical Report
21
Gemma 3 Technical Report
Gemma 3 PT Gemma 3 IT
Context 4B 12B 27B 4B 12B 27B
RULER 32K 67.1 90.6 85.9 61.4 80.3 91.1
RULER 128K 51.7 80.7 72.9 46.8 57.1 66.0
MRCR 32K 44.7 59.8 63.2 49.8 53.7 63.2
MRCR 128K 40.6 56.9 60.0 44.6 49.8 59.3
4B 12B 27B
MMMU (val) 48.8 59.6 64.9
DocVQA 75.8 87.1 86.6
InfoVQA 50.0 64.9 70.6
TextVQA 57.8 67.7 65.1
AI2D 74.8 84.2 84.5
ChartQA 68.8 75.7 78.0 4B 12B 27B
VQAv2 (val) 62.4 71.6 71.0 Perception Test MCVQA 50.6 54.9 58.1
MathVista (testmini) 50.0 62.9 67.6 ActivityNet-QA 46.3 50.4 52.8
22
Gemma 3 Technical Report
Gemma 2 Gemma 3
2B 9B 27B 1B 4B 12B 27B
MMLU 56.1 71.3 76.2 38.8 58.1 71.9 76.9
MBPP 36.6 59.2 67.4 35.2 63.2 73.0 74.4
HumanEval 20.1 40.2 51.8 41.5 71.3 85.4 87.8
N2C 46.8 68.3 77.3 56.0 70.3 80.7 84.5
LiveCodeBench 7.0 20.0 29.0 5.0 23.0 32.0 39.0
GSM8K 62.6 88.1 91.1 62.8 89.2 94.4 95.9
MATH 27.2 49.4 55.6 48.0 75.6 83.8 89.0
HiddenMath 2.0 8.0 12.0 15.0 42.0 51.0 56.0
BBH 41.4 69.0 74.9 39.1 72.2 85.7 87.6
BBEH 5.9 9.8 14.8 7.2 11.0 16.3 19.3
IFEval 80.4 88.4 91.1 80.2 90.2 88.9 90.4
GMMLU-Lite 41.9 64.8 68.6 34.2 54.5 69.5 75.1
ECLeKTic 5.3 11.8 17.6 1.4 4.6 10.3 16.7
WMT24++ 37.4 48.7 51.7 35.9 46.8 51.6 53.4
Table 18 | Performance of instruction fine-tuned (IT) models of different sizes on more internal and
external benchmarks.
23
Gemma 3 Technical Report
Table 19 | Details on text benchmarks. Char-Len stands for Character Length Normalization and COT
stands for Chain-Of-Thought prompting.
24
Gemma 3 Technical Report
25