Apple LLM Foundations
Apple LLM Foundations
Language Models
Apple
1 Introduction
At the 2024 Worldwide Developers Conference, we introduced Apple Intelli-
gence, a personal intelligence system integrated deeply into iOS 18, iPadOS
18, and macOS Sequoia.
Apple Intelligence consists of multiple highly-capable generative models
that are fast, efficient, specialized for our users’ everyday tasks, and can
adapt on the fly for their current activity. The foundation models built into
Apple Intelligence have been fine-tuned for user experiences such as writing
and refining text, prioritizing and summarizing notifications, creating playful
images for conversations with family and friends, and taking in-app actions to
simplify interactions across apps.
Adapters
···
···
···
···
···
···
Data Preprocessing Pre-Training Post-Training Optimization Apple Foundation Models
1
model, and AFM-server, a larger server-based language model—have been
built and adapted to perform specialized tasks efficiently, accurately, and
responsibly (Figure 1). These two foundation models are part of a larger
family of generative models created by Apple to support users and developers;
this includes a coding model (based on an AFM language model) to build
intelligence into Xcode, as well as a diffusion model to help users express
themselves visually, for example, in the Messages app.
Apple Intelligence is designed with Apple’s core values at every step and
built on a foundation of industry-lead privacy protection. Additionally, we
have created Responsible AI principles to guide how we develop AI tools, as
well as the models that underpin them:
These principles are reflected at every stage of the architecture that enables
Apple Intelligence and connects features and tools with specialized models.
In the remainder of this report, we provide details on decisions such as:
how we develop models that are highly capable, fast, and power-efficient; how
we approach training these models; how our adapters are fine-tuned for specific
user needs; and how we evaluate model performance for both helpfulness and
unintended harm.
2 Architecture
The AFM base models are dense decoder-only models that build on the
Transformer architecture [Vaswani et al., 2017], with the following design
choices:
2
• Pre-Normalization [Nguyen and Salazar, 2019] with RMSNorm [Zhang
and Sennrich, 2019] for training stability.
• Query/key normalization [Wortsman et al., 2023] to improve training
stability.
• Grouped-query attention (GQA) [Ainslie et al., 2023] with 8 key-value
heads to reduce the KV-cache memory footprint.
• The SwiGLU activation [Shazeer, 2020] for higher efficiency.
• RoPE [Su et al., 2024] positional embeddings with the base frequency
set to 500k for long-context support.
3 Pre-training
Our AFM pre-training process plays a critical role in developing highly capable
language models to power a host of Apple Intelligence features that can help
and support users. We focus on efficiency and data quality at every step in
order to pre-train for a high-quality end-to-end user experience with efficient
and low-latency models.
3.1 Data
The AFM pre-training dataset consists of a diverse and high quality data
mixture. This includes data we have licensed from publishers, curated publicly-
available or open-sourced datasets, and publicly available information crawled
by our web-crawler, Applebot [Apple, 2024a]. We respect the right of webpages
to opt out of being crawled by Applebot, using standard robots.txt directives.
Given our focus on protecting user privacy, we note that no private Apple
user data is included in the data mixture. Additionally, extensive efforts have
been made to exclude profanity, unsafe material, and personally identifiable
information from publicly available data (see Section 7 for more details).
Rigorous decontamination is also performed against many common evaluation
benchmarks.
3
We find that data quality, much more so than quantity, is the key deter-
mining factor of downstream model performance. In the following, we provide
more details about key components of the data mixture.
3.1.3 Code
Code data is obtained from license-filtered1 open source repositories on GitHub.
The bulk of the code data covers 14 common programming languages, including:
Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go. The data is
de-duplicated, further filtered for PII and quality, and decontaminated in the
same fashion as in Section 3.1.1.
1
Using MIT, Apache, BSD, CC0, CC-BY, Unlicensed, ISC, and Artistic Licenses.
4
3.1.4 Math
We integrate two categories of high-quality data sourced from the web. The
first category is a Math Q&A dataset, comprising 3 billion tokens from 20
web domains rich in math content. We extract the questions and answers by
identifying relevant tags from HTML pages. The second category is a collection
of 14 billion tokens from web pages such as math forums, blogs, tutorials,
and seminars. To filter these web pages, we used a specialized pipeline that
includes a math tag filter with a collection of 40 strings to identify mathematical
templates, a math symbol filter with a collection of 350 Unicode and LaTeX
symbols to identify math content, a quality filter powered by a language model
classifier specifically designed for math [Kong et al., 2024], and a domain
filter that processes all web pages from domains manually labeled by humans.
We applied these filters, followed by deduplication, decontamination, and PII
removal to produce the final dataset.
3.1.6 Tokenizer
We use a byte-pair encoding (BPE) tokenizer, following the implementation
from SentencePiece. All numbers are split into individual digits and we use
byte-fallback to decompose unknown UTF-8 characters into byte tokens. We
do not enable Unicode normalization. The total vocabulary size is 100k and
49k tokens for AFM-server and AFM-on-device, respectively.
3.2 Recipe
We break AFM pre-training into three distinct stages: 1. core which consumes
most of the compute budget, 2. continued, where we down-weight the lower-
quality bulk web-crawl data, favoring a higher code and math weight instead
combined with inclusion of the licensed data described in Section 3.1.2, 3.
context-lengthening which is similar to another continued pre-training stage,
but conducted at longer sequence length and with synthetic long-context data
included in the mixture.
Details about model quality after each of the three pre-training stages
(alongside additional metrics for AFM derived from our internal benchmark
implementations) are in Appendix C, and Appendix D examines AFM-server’s
long-context capabilities.
All three stages use decoupled weight decay [Loshchilov and Hutter, 2019]
for regularization, as well as a simplified version of µParam [Yang et al., 2022],
similar to what is described as µParam (simple) in [Wortsman et al., 2023].
Thus far we have not found more sophisticated parameter norm controls to be
5
necessary at these scales. All stages maintain sharded model and optimizer
states in float32, casting to bfloat16 for the forward and backward passes
for efficiency.
6
final benchmark results by 0-2%, whilst adding distillation boosts MMLU and
GSM8K by about 5% and 3% respectively. More detailed ablation results can
be found in Appendix B. All training hyper-parameters except for batch-size
are kept the same as AFM-server.
3.2.4 Optimizer
We choose to use a variant of RMSProp [Hinton, 2012] with momentum for
AFM pre-training. In particular, we divide the raw gradient by the square-root
of a bias-corrected exponential moving average of the squared gradient to
produce an instantaneous update, which is clipped to a maximum norm of 1.0
per parameter block, before then further smoothing this estimate over steps
with an exponential moving average without bias-correction to produce the
net update. Unless otherwise noted, the smoothing constants for both the
squared gradient (β2 ) and the update (β1 ) are set to 0.95. A small constant
ϵ = 1e−30 is added to the instantaneous squared gradient prior to smoothing,
for numerical stability.
The smoothed updates are scaled by the learning rate, weight-decay is added,
and then scheduled decay is applied to form the final weight delta. As an
additional guard for stability, prior to the optimizer we clip the global gradient
norm to 1.0. For a recipe ablation against a more typical configuration, see
Appendix A.
7
3.3 Training infrastructure
The AFM models are pre-trained on v4 and v5p Cloud TPU clusters with the
AXLearn framework [Apple, 2023], a JAX [Bradbury et al., 2018] based deep
learning library designed for the public cloud. Training is conducted using a
combination of tensor, fully-sharded-data-parallel, and sequence parallelism,
allowing training to scale to a large number of model parameters and sequence
lengths at high utilization. This system allows us to train the AFM models
efficiently and scalably, including AFM-on-device, AFM-server, and larger
models.
4 Post-Training
While Apple Intelligence features are powered through adapters on top of
the base model (see Section 5 for a deep-dive on the adapter architecture),
empirically we found that improving the general-purpose post-training lifts
the performance of all features, as the models have stronger capabilities on
instruction following, reasoning, and writing.
We conduct extensive research in post-training methods to instill general-
purpose instruction following and conversation capabilities to the pre-trained
AFM models. Our goal is to ensure these model capabilities are aligned with
Apple’s core values and principles, including our commitment to protecting
user privacy, and our Responsible AI principles. Our post-training efforts
include a series of data collection and generation, instruction tuning, and
alignment innovations. Our post-training process contains two stages: su-
pervised fine-tuning (SFT) and reinforcement learning from human feedback
(RLHF). We present two new post-training algorithms: (1) a rejection sampling
fine-tuning algorithm with teacher committee (iTeC), and (2) a reinforcement
learning from human feedback (RLHF) algorithm with mirror descent policy
optimization and a leave-one-out advantage estimator (MDLOO) that are used
on our reinforcement learning iterations and lead to significant model quality
improvements.
4.1 Data
We use a hybrid data strategy in our post-training pipeline, which consists of
both human annotated and synthetic data. Throughout our data collection and
experiment process, we have found data quality to be the key to model success
and thus have conducted extensive data curation and filtering procedures.
8
4.1.1 Human annotations
Demonstration data To fuel the instruction fine-tuning of AFM, we collect
high-quality human annotated demonstration datasets from various sources.
This dialogue-style data consists of both system-level and task-level instructions
(a.k.a. prompts), as well as their corresponding responses. Similar to [Zhou
et al., 2024], we observe quality to weigh more importantly than quantity in
our experiments. As a result, we focus on key data quality criteria including
helpfulness, harmlessness, presentation, and response accuracy, in addition
to targeting a diverse task distribution covering Apple Intelligence features.
To protect user privacy, we take steps to verify no personally identifiable
information is present in our data, and we do not include any personal data
stored by users with Apple.
9
where a seed set of prompts are transformed into a much larger set of diverse
prompts:
Tool use We develop tool-use capabilities such as function call, code inter-
preter, and browsing through a mixture of synthetic and human data. The
model capabilities are first bootstrapped with synthetic data, which focuses
on single-tool use cases. We then collect human annotations to improve model
capabilities that involve multi-tool and multi-step scenarios. We further aug-
ment the human curated function call data by mixing the oracle tool with
other similar tools to increase the difficulty of tool selection. In addition, we
synthesize parallel function call from human curated function call data to
enable the new capability and tool intent detection data based on human
curated function call and general SFT data to mitigate tool call over-triggering
issues.
10
tests and a number of potential solutions. We then use an execution-based
rejection sampling method to select the best solution. This involves compiling
each potential solution with every unit test and executing them. The solution
with the highest number of successful executions is chosen. This results in a
collection of (question, test cases, solution) triplets. At the end, we validate
the quality of the dataset by filtering the triplets using the number of passed
unit tests, resulting in 12K high quality triplets used in the SFT.
Tuning the mixture ratio In order to tune the mixture weight, we treat it as
an optimization problem. Specifically, given a set of weights (w1 , w2 , ..., wn )
where wi represents the ratio of a specific component in the mixture, we train
a model with wi → wi ± ∆wi and evaluate the quality change on a set of
benchmarks. We find that extensively running such experiments can effectively
identify the best mixture and remove the least impactful data components.
11
4.3.1 Reward modeling
We train reward models using the human preference data collected with the
method in Section 4.1.1. Each human preference data item contains one
prompt and two responses along with human labels including:
• The preferred response between the two and the preference level, i.e.,
whether the preferred response is significantly better, better, slightly
better, or negligibly better than the rejected response.
• The single-sided grading of each response, measuring the instruction
following property, the conciseness, truthfulness, and harmlessness of
each of the responses.
Our reward model training follows the standard practice of reward modeling
in RLHF with two main innovations:
• We design a soft label loss function that takes the level of human prefer-
ence into account.
• We incorporate single-sided gradings as regularization terms in reward
modeling.
12
of human preference data collection, we set up a collection of latest promising
models trained from SFT, RS, DPO/IPO, and RL, as well as best models
from the previous iterations, which we refer to as “model committee”. We
collect pairwise human preference on responses sampled from the latest model
committee.
After acquiring each batch of human preference data, we refresh our reward
model, and further train a new set of models using the collection of preference
optimization algorithms. We then continue the next round of iterative RLHF
data collection with a new model committee.
max Ex∼D,y∼πθ (·|x) [rϕ (x, y) − βDKL (πθ (·|x)∥πref (·|x))] , (1)
θ
13
function
πθ (y|x)
R(x, y) = rϕ (x, y) − β log , (2)
πref (y|x)
14
Systemwide experiences
Siri Writing Tools Image Playground …
Apps and
Apps
experiences
Orchestration
Server models
On-device models
Personal
Intelligence
System App
Semantic
Intents
index
Toolbox
ML stack
CPU GPU Neural Engine Secure Enclave CPU GPU Neural Engine Secure Enclave
Apple silicon
Attestation
On-device Apple silicon servers
15
required. It is worth noting that the adapter parameters are initialized using
the accuracy-recovery adapter introduced in Section 5.2.
5.2 Optimizations
The AFM models are designed to support our users throughout their daily
activities, and both inference latency and power efficiency are important for
the overall user experience. We apply various optimization techniques to
allow AFM to be efficiently deployed on-device and in Private Cloud Compute.
These techniques significantly reduce memory, latency, and power usage while
maintaining the overall model quality.
In order to fit AFM into a constrained memory budget of edge devices and
reduce inference cost, it is critical to apply model quantization techniques
to reduce the effective bits per weight while maintaining the model quality.
Previous works have found that 4-bit quantized models only have marginal
loss of quality (typically measured in pre-training metrics) compared to the
original 32/16-bit float-point versions. Since AFM is expected to support
a diverse set of product features, it is essential that the quantized model
retains capabilities in specific domains critical to these use cases. To achieve
an optimal trade-off between model capacity and inference performance, we
have developed state-of-the-art quantization methods and a framework that
utilizes accuracy-recovery adapters. This allows us to achieve near-lossless
quantization that is on average less than 4 bit-per-weight, and provides flexible
quantization scheme choices.
16
unquantized, quantized, and accuracy-recovered models and show that the
recovered models perform much closer to the unquantized version.
More discussions The usage of quantized model and LoRA adapters look
conceptually similar to QLoRA [Dettmers et al., 2024]. While QLoRA was
designed to save computational resources during fine-tuning, our focus is on
the ability to switch between different LoRA adapters to efficiently support
high performance across various specific use cases. Before feature-specific
finetuning, we first train accuracy-recovery adapters on the same pretraining
and post-training data, which is critical to preserve the model quality. The
accuracy-recovery framework can be combined with different quantization
techniques, like GPTQ [Frantar et al., 2022] and AWQ [Lin et al., 2024], since
it does not depend directly on the quantization method itself. The feature
17
adapters described in Section 5 are initialized from these accuracy-recovery
adapters.
6 Evaluation
We evaluate the AFM models on pre-training (Section 6.1), post-training
(Section 6.2), and most importantly, feature-specific (Section 6.3) benchmarks.
18
feature adapters (Section 6.3) are more closely correlated to end-to-end user
experience.
AFM-on-device AFM-server
MMLU (5 shot) 61.4 75.4
AFM-server
MMLU (5-shot) 75.3
GSM8K (5-shot) 72.4
ARC-c (25-shot) 69.7
HellaSwag (10-shot) 86.9
Winogrande (5-shot) 79.2
19
AFM-server
Narrative QA 77.5
Natural Questions (open) 73.8
Natural Questions (closed) 43.1
Openbook QA 89.6
MMLU 67.2
MATH-CoT 55.4
GSM8K 72.3
LegalBench 67.9
MedQA 64.4
WMT 2014 18.6
20
Human Evaluation
AFM wins Tie AFM loses
AFM‑on‑device versus
Llama‑3‑8B 29.7% 32.0% 38.3%
AFM‑server versus
GPT‑4 29.3% 31.9% 38.8%
compared to Phi-3-mini despite being 25% smaller in model sizes, and even
outperforms open-source strong baselines Gemma-7B and Mistral-7B that
are more than twice larger in the number of parameters. When compared to
closed-source models, AFM-server achieves competitive performance, scoring a
win rate of more than 50% and a tie rate of 27.4% against GPT-3.5.
21
measure general instruction-following capability, and results suggest that our
models are highly competitive.
On-Device Server
IFEval Instruction-level IFEval Instruction-level
AFM-on-device 85.7 AFM-server 88.5
0 20 40 60 80
Benchmark Score Arena Hard
GPT-4 78.0
Llama-3-70B 46.6
Mixtral-8x22B 36.4
AFM-server 35.5
DBRX-Instruct 23.9
GPT-3.5 23.3
0 20 40 60 80
Benchmark Score
22
6.2.3 Tool use
In tool use applications, given a user request and a list of potential tools with
descriptions, the model can choose to issue tool calls by providing a structured
output specifying the name and parameter values of the tools to call. We
expect the tool descriptions to follow the OpenAPI specification.4
We evaluate on the public Berkeley Function Calling Leaderboard bench-
marks [Patil et al., 2023] via native support of function calling, using the AST
metrics.
As shown in Figure 5, AFM-server achieves the best overall accuracy,
outperforming Gemini-1.5-Pro-Preview-0514 and GPT-4.
Relevance Average
AFM‑server 91.3 AFM‑server 89.5
Gemini‑1.5‑Pro‑0514 89.6 Gemini‑1.5‑Pro‑0514 88.3
GPT‑4 82.9 GPT‑4 86.2
AFM‑on‑device 81.0 AFM‑on‑device 80.2
GPT‑3.5 2.1 GPT‑3.5 60.3
0 20 40 60 80 100 0 20 40 60 80 100
Benchmark Score Benchmark Score
6.2.4 Writing
Writing is one of the most critical abilities for large language models to have, as
it empowers various downstream use cases such as changing-of-tone, rewriting,
and summarization. However, assessing writing quality is a non-trivial task,
and not well-covered in the above public benchmarks.
We evaluate AFM’s writing ability on our internal summarization and
composition benchmarks, consisting of a variety of writing instructions. Fol-
lowing LLM-as-a-judge [Zheng et al., 2024], we design a grading instruction
4
https://github.com/OAI/OpenAPI-Specification
23
for each summarization and composition task, and prompt GPT-4 Turbo to
assign a score from 1 to 10 for model responses.5 We note that there are
certain limitations and biases associated with using an LLM as a grader, such
as length bias.
We compare AFM with a few of the most outstanding models, along with
smaller-scale open-source models. As shown in Figure 6, AFM-on-device can
achieve comparable or superior performance when compared to Gemma-7B
and Mistral-7B. AFM-server significantly outperforms DBRX-Instruct and
GPT3.5 and is comparable to GPT4.
Writing Benchmarks
On‑Device Server
Summarization Summarization
AFM‑on‑device 9.1 AFM‑server 9.5
Composition Composition
Mistral‑7B 9.1 GPT‑4 9.7
0 2 4 6 8 10 0 2 4 6 8 10
Benchmark Score Benchmark Score
6.2.5 Math
In Figure 7, we compare post-training AFM’s performance on math benchmarks
including GSM8K [Cobbe et al., 2021] and MATH [Hendrycks et al., 2021].
We use 8-shot chain-of-thought (CoT) [Wei et al., 2022] prompt for GSM8K
and 4-shot CoT prompt [Lewkowycz et al., 2022] for MATH. We conduct all
evaluations using an internal automated evaluation pipeline. We see that the
AFM-on-device significantly outperforms Mistral-7B and Gemma-7B, even at
less than half of their sizes.
5
Due to the choice of using GPT-4 as judge, the score of GPT-4 Turbo can be overesti-
mated.
24
Math Benchmarks
On‑Device Server
GSM8K GSM8K
Phi‑3 mini 78.5 GPT‑4 88.6
MATH MATH
AFM‑on‑device 26.1 GPT‑4 43.6
0 20 40 60 80 100 0 20 40 60 80 100
Benchmark Score Benchmark Score
Datasets. We sampled abundant payloads carefully for each use case. These
evaluation datasets emphasize a diverse set of inputs which our product features
are likely to face in production, and include a stratified mixture of single and
stacked documents of varying content types and lengths. We developed a
pipeline to build evaluation datasets that simulate real user inputs.
Grading guidelines. During the evaluation task, graders are presented with
a specification for the summary, the original input content, and the output
summary. Graders assess the summary on each the following sub-dimensions
of quality using 3 point scales (“good”, “neutral”, or “poor” ):
25
Composition: Evaluates the overall readability of the summary consid-
ering grammar, punctuation, spelling, and brevity.
Comprehensiveness: Evaluates how comprehensive the summary is in
capturing the essential points or calling out any actions/conclusions for
the user.
Groundedness: Evaluates how grounded the summary is with respect to
the original payload. Summaries that are not completely grounded may
contain details that are exaggerated, inferred, inaccurate, or hallucinated.
Following instructions: Evaluates whether the summary meets specific
style and formatting requirements. Requirements are tailored to each
feature and reflect specific product and design expectations.
Harmfulness: Evaluates whether the summary contains content that is
harmful or unsafe according to Apple’s safety taxonomy.
A summary is classified as “poor” if any of the sub-dimensions are “poor”
according to predefined product specifications. Likewise a summary is “good”
only if all sub-dimensions are good. These classifications are used to compute
“Good/Poor Result Ratio” metrics defined as the percentage of good/poor
summaries out of all summaries.
7 Responsible AI
7.1 Overview
Apple Intelligence is developed responsibly and designed with care to empower
our users, represent them authentically, and protect their privacy. Of primary
importance to our Responsible AI approach is that we are ultimately delivering
intelligent, well-defined tools that address specific user needs. Having a clear
definition of what a feature is intended to do allows us to better identify any
potential safety gaps.
We have developed a safety taxonomy in order to be comprehensive and
consistent in the design and evaluation of our generative AI-powered features.
This taxonomy builds and extends Apple’s extensive experience in using ar-
tificial intelligence and machine learning to deliver helpful features to users
around the world, and is updated regularly as we develop and test features.
Currently, it consists of 12 primary categories comprised of 51 subcategories,
including “Hate Speech, Stereotypes, and Slurs”, “Discrimination, Marginaliza-
tion, and Exclusion”, “Illegal Activities”, “Adult Sexual Material”, and “Graphic
Violence.”
The taxonomy serves as a structured way to consider potential issues and
risks relative to each specific feature. As new or additional risks are identified,
we develop and revise the associated policies that are contextualized to each
26
Human Satisfaction with Summarization Feature
Message Message
AFM‑on‑device + Adapter 63.0% 15.9%
Gemma‑7B 51.1% 18.3%
Phi‑3‑mini 32.8% 31.4%
Llama‑3‑8B 21.3% 35.3%
Notification Notification
AFM‑on‑device + Adapter 74.9% 10.0%
Gemma‑7B 60.9% 12.9%
Phi‑3‑mini 56.6% 18.3%
Llama‑3‑8B 27.7% 48.3%
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Figure 8: Ratio of “good” and “poor” responses for three summarization use
cases relative to all responses. Summaries are classified as “good”, “neutral”,
or “poor” along five dimensions. A result is classified as “good” if all of the
dimensions are good (higher is better). A result is classified as “poor” if any of
the dimensions are poor (lower is better). Overall, our AFM-on-device adapter
generates better summaries than comparable models.
individual feature, taking into account the specific needs that it serves, the
content it produces, and the appropriate mitigations. They are developed
with extensive internal and external input from academics, AI ethicists, trust
and safety, and legal experts to better identify and understand the relevant
risks, the potential severity of such risks, and the potential disparate impact
these risks may have on certain groups. These policies guide our work in
data collection, human annotation, model training, guardrails development,
evaluation, and red teaming.
Particularly, the taxonomy is not itself the sole determinant of our policy.
For example, content that may fall within the safety taxonomy is not necessarily
always blocked, as doing so unilaterally may be in conflict with other aspects
of Apple’s Responsible AI development principles, such as “respecting how our
users choose to use these tools to accomplish their goals.” Thus, features that
operate as tools may be more permissive in the kinds of content they operate
over and produce in order to effectively address the user’s intent. On the other
hand, features that may generate content beyond a user’s specified intent may
need to be more constrained. Regardless, we strive for some categories of harm
27
to always be treated with special care (such as any content that relates to self
harm) while other categories will always be blocked (such as illegal content).
In addition, our Responsible AI principles are built into every stage of Apple
Foundation Models and Apple Intelligence as well as the safety taxonomy, which
helps us evaluate risks and formulate policies feature by feature. We include
safety-oriented data as part of our fine-tuning of specific adapters tailored by use
case. Furthermore, at the time of inference, we also run guardrail models [Inan
et al., 2023] as pre- and post-processing steps to evaluate potential harm at
both the input and output level. Finally, we have mechanisms in place to
continuously and proactively improve our AI tools with the help of ongoing
user feedback.
7.2 Pre-Training
At the pre-training stage, we take several steps to ensure that the values as
outlined above are upheld. We follow a strict data policy ensuring that no
Apple user data is included, as well as conduct rigorous legal review for each
component in the training corpus. Further, we perform safety filtering to
reduce potentially harmful content, including NSFW content, profanity, spam,
and PII or financial data.
Because pre-training is a step which is shared among various downstream
features, our safety mitigations aim to retain general capabilities that allow us
to iterate on the taxonomy and policy at a per-feature level, without hurting
the helpfulness of these downstream models. We take learnings from prior
work to avoid overly aggressive filtering at the pre-training stage, which has
potential benefits in safety alignment [Touvron et al., 2023]. Intuitively, the
pre-trained model should be aware of content that downstream features and
policies may require it to handle – in some cases with care, or in other cases
operating over such content directly.
7.3 Post-Training
In the post-training phase, we aim to instill a baseline level of alignment with
our Responsible AI principles to avoid necessitating the full complexities of
post-training (such as RLHF) in each downstream model that builds on top of
the foundation model. In doing so, there are two key considerations:
1. We must ensure our models produce output that is helpful to users, while
minimizing potential harm.
28
vendors. We also incorporate safety tasks and benchmarks into the automatic
and human evaluations used during model development.
In total, over 10% of the training data are adversarial or related to safety
or sensitive topics, including single and multi-turn safety category annotations,
pairwise and overall preference ratings, and annotator rewrites. This data is
either used directly or as seed data for synthetic data generation, as described
in Section 4.1.2.
We do additional work to achieve appropriate safety behavior for each fea-
ture beyond baseline alignment. A primary way that we do this is by collecting
safety-specific training data and including it when fine-tuning adapters. For in-
stance, in fine-tuning our summarization adapter we sought to improve aspects
such as, improving robustness against malicious questions included within the
content to be summarized, and reducing the likelihood that summaries would
inadvertently amplify harmful or sensitive content to be summarized.
29
A basic human red teaming task schema is as follows: a red teamer is
assigned a safety taxonomy category and attack vector(s). They author an
input to the model, using that attack vector, that is intended to elicit a response
containing content from that category. If the response does not contain the
target content, the red teamer can engage in a fixed number of conversational
turns, after which they provide a final harmfulness rating of the model output
and list the taxonomy categor(ies) in it, if any. To ensure annotation quality,
red teamers also provide an overall confidence score for their ratings.
In addition to red teaming at the base model level, we also red team
specific features. Red teaming projects at the feature level use feature-specific
guidelines with attack vectors informed by the feature’s safety policy and
engineering concerns. These projects can provide in-depth probing of known
risks for that particular feature and also adversarially probe for unknown
vulnerabilities.
Our red teaming projects are run using internal and external crowds. To
ensure responsible data collection, due to the sensitive nature of red teaming
we: 1) make red teaming completely voluntary; 2) impose a strict time limit
on how much each red teamer spends on the tasks per week; 3) provide health
and well-being resources available around the clock; and 4) maintain an open
line of communication with internal red teamers via weekly office hours and a
Slack channel for them to communicate any concerns that arise.
7.6 Evaluation
As mentioned in previous sections, safety is one of the many axes iterated
on during foundation model development, and therefore undergoes the same
automatic and human evaluation cycles during post-training.
Safety evaluation set To reduce noise, cost, and turn-around time during
human evaluations, we must ensure that our safety evaluation sets are clean,
yet challenging and comprehensive. To that end, we filter out “easy" prompts
which consistently yield low harmfulness responses across different versions of
the model, and employ an embedding-based analysis to improve our evaluation
prompt set coverage. Overall, we curate a set of over a thousand adversarial
prompts to test AFM’s performance on harmful content, sensitive topics, and
factuality according to our safety policy.
30
Human Evaluation of Output Harmfulness
On‑Device Server
AFM‑on‑device 7.5% AFM‑server 6.3%
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Violation Rate Violation Rate
AFM‑on‑device versus
Gemma‑7B 42.9% 41.5% 15.6%
AFM‑server versus
GPT‑3.5 59.4% 29.2% 11.4%
8 Conclusion
In this report we introduced the foundation language models that power
Apple Intelligence features, AFM-on-device and AFM-server. The models are
designed to be fast and run efficiently on iPhone, iPad, and Mac as well as
on Apple silicon servers via Private Cloud Compute. They are trained to
be highly capable in tasks like language understanding, instruction following,
31
reasoning, writing, and tool use. We have developed an innovative model
architecture to specialize these models for our users’ most common tasks.
On top of the foundation models, feature-specific adapters are fine-tuned
to provide high-quality user experiences such as summarization of emails,
messages, and notifications. Our models have been created with the purpose
of helping users do everyday activities across their Apple products, grounded
in Apple’s core values, and rooted in our Responsible AI principles at every
stage. These foundation models are at the heart of Apple Intelligence, the
personal intelligence system built by Apple to continue empowering our users
and enriching their lives.
References
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba
Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration
using expert prediction. In International Conference on Machine Learning,
pages 3692–3702. PMLR, 2019.
Apple. Private cloud compute: A new frontier for ai privacy in the cloud. https:
//security.apple.com/blog/private-cloud-compute/, 2024b. Accessed:
2024-07-11.
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos,
Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical
paradigm to understand learning from human preferences. In International
32
Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR,
2024, arXiv:2310.12036.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris
Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas,
Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations
of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block
designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345,
1952.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko,
Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope,
James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng
Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk
Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus,
Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph,
Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark
Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie
Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov,
Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz,
Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language
modeling with pathways. 2022, arXiv:2204.02311.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William
Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma,
et al. Scaling instruction-finetuned language models. Journal of Machine
Learning Research, 25(70):1–53, 2024, arXiv:2210.11416.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun,
Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Nakano, et al. Training verifiers to solve math word problems. 2021,
arXiv:2110.14168.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora:
Efficient finetuning of quantized LLMs. Advances in Neural Information
Processing Systems, 36, 2024, arXiv:2305.14314.
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto.
Length-controlled alpacaeval: A simple way to debias automatic evaluators.
2024, arXiv:2404.04475.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq:
Accurate post-training quantization for generative pre-trained transformers.
2022, arXiv:2210.17323.
33
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai,
Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal
Ndousse, et al. Red teaming language models to reduce harms: Meth-
ods, scaling behaviors, and lessons learned. 2022, arXiv:2209.07858.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart,
Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical
problem solving with the math dataset. 2021, arXiv:2103.03874.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a
neural network. 2015, arXiv:1503.02531.
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz,
Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang.
Talaria: Interactively optimizing machine learning models for efficient infer-
ence. In Proceedings of the CHI Conference on Human Factors in Computing
Systems, pages 1–19, 2024, arXiv:2404.03085.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language
models. In International Conference on Learning Representations, 2021,
arXiv:2106.09685.
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer,
Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine,
et al. Llama guard: LLM-based input-output safeguard for human-ai
conversations. 2023, arXiv:2312.06674.
Xiang Kong, Tom Gunter, and Ruoming Pang. Large language model-guided
document selection. 2024, arXiv:2406.04638.
34
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples,
get a baseline for free! 2019. URL https://openreview.net/forum?id=
r1lgTGL5DE.
Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, and Csaba Szepesvari. Im-
proved regret bound and experience replay in regularized policy iteration.
In International Conference on Machine Learning, pages 6032–6042. PMLR,
2021, arXiv:2102.12611.
Tao Lei, Junwen Bai, Siddhartha Brahma, Joshua Ainslie, Kenton Lee, Yanqi
Zhou, Nan Du, Vincent Zhao, Yuexin Wu, Bo Li, et al. Conditional adapters:
Parameter-efficient transfer learning with fast inference. Advances in Neural
Information Processing Systems, 36:8152–8172, 2023, arXiv:2304.04947.
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre,
Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui
Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin
Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna
Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner,
Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel
Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang,
Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song,
Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo,
Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang,
Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar,
Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and
Vaishaal Shankar. Datacomp-lm: In search of the next generation of training
sets for language models. 2024a, arXiv:2406.11794.
Tianle* Li, Wei-Lin* Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E.
Gonzalez, and Ion Stoica. From live data to high-quality benchmarks:
The arena-hard pipeline. April 2024b. URL https://lmsys.org/blog/
2024-04-19-arena-hard/.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu,
Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya
Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas,
Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong,
Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr,
Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha,
Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi,
Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto,
35
Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen
Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of
language models. 2023, arXiv:2211.09110.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen
Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq:
Activation-aware weight quantization for on-device llm compression and
acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024,
arXiv:2306.00978.
Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua
Lin. Scaling laws of rope-based extrapolation. 2024, arXiv:2310.05209.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.
2019, arXiv:1711.05101.
Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse
neural networks through l0 regularization. In International Conference
on Learning Representations, 2018, arXiv:1712.01312. URL https://
openreview.net/forum?id=H1Y8hhg0b.
Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving
the normalization of self-attention. In Jan Niehues, Rolando Cattoni, Sebas-
tian Stüker, Matteo Negri, Marco Turchi, Thanh-Le Ha, Elizabeth Salesky,
Ramon Sanabria, Loic Barrault, Lucia Specia, and Marcello Federico, edi-
tors, Proceedings of the 16th International Conference on Spoken Language
Translation, Hong Kong, November 2-3 2019. Association for Computational
Linguistics. URL https://aclanthology.org/2019.iwslt-1.17.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, et al. Training language models to follow instructions with human
feedback. Advances in Neural Information Processing Systems, 35:27730–
27744, 2022, arXiv:2203.02155.
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez.
Gorilla: Large language model connected with massive APIs. 2023,
arXiv:2305.15334.
Ofir Press and Lior Wolf. Using the output embedding to improve language
models. In Conference of the European Chapter of the Association for
Computational Linguistics, 2016, arXiv:1608.05859.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano
Ermon, and Chelsea Finn. Direct preference optimization: Your language
model is secretly a reward model. Advances in Neural Information Processing
Systems, 36, 2024, arXiv:2305.18290.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp
Moritz. Trust region policy optimization. In International Conference on
Machine Learning, pages 1889–1897. PMLR, 2015, arXiv:1502.05477.
36
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg
Klimov. Proximal policy optimization algorithms. 2017, arXiv:1707.06347.
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng
Liu. Roformer: Enhanced transformer with rotary position embedding.
Neurocomputing, 568:127063, 2024.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li,
Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca:
An instruction-following llama model. https://github.com/tatsu-lab/
stanford_alpaca, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi,
Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava,
Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya
Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin
Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony
Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor
Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh
Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich,
Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi
Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith,
Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina
Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen
Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez,
Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open
foundation and fine-tuned chat models. 2023, arXiv:2307.09288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all
you need. In Advances in Neural Information Processing Systems, 2017,
arXiv:1706.03762.
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large
language models. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 6151–6162, 2020,
arXiv:1910.04732. doi: 10.18653/v1/2020.emnlp-main.496.
37
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reason-
ing in large language models. Advances in Neural Information Processing
Systems, 35:24824–24837, 2022.
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben
Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak,
Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin
Gilmer, and Simon Kornblith. Small-scale proxies for large-scale transformer
training instabilities. 2023, arXiv:2309.14322.
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama:
Accelerating language model pre-training via structured pruning. 2023,
arXiv:2310.06694.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng,
Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language
models to follow complex instructions. 2023, arXiv:2304.12244.
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu,
David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao.
Tensor programs v: Tuning large neural networks via zero-shot hyperparam-
eter transfer. 2022, arXiv:2203.03466.
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang,
James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath:
Bootstrap your own mathematical questions for large language models. 2023,
arXiv:2309.12284.
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi.
How johnny can persuade llms to jailbreak them: Rethinking persuasion
to challenge ai safety by humanizing llms, 2024, arXiv:2401.06373. URL
https://arxiv.org/abs/2401.06373.
Biao Zhang and Rico Sennrich. Root mean square layer normal-
ization. In Advances in Neural Information Processing Systems,
2019. URL https://proceedings.neurips.cc/paper_files/paper/
2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging
LLM-as-a-judge with MT-bench and chatbot arena. Advances in Neural
Information Processing Systems, 36, 2024.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao,
Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. LIMA: Less is more for
alignment. Advances in Neural Information Processing Systems, 36, 2024.
38
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu,
Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for
large language models. 2023, arXiv:2311.07911.
39
Contributors
Within each section, contributors are listed in alphabetical order by first name.
Foundation Models
Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chong Wang
(inference efficiency lead), Chung-Cheng Chiu, David Qiu, Deepak Gopinath,
Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang,
Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen,
Quentin Keunebroek, Ruoming Pang (overall lead), Sam Wiseman, Syd Evans,
Tao Lei, Tom Gunter (pre-train lead), Vivek Rathod, Xiang Kong, Xianzhi
Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu,
Zhiyun Lu, Zirui Wang (post-train lead)
40
Appendix
41
and distillation methods can outperform a baseline model trained from scratch.
For example, pruning and distillation achieve a MMLU score of 42.9% and
44.9% respectively, whereas a baseline using 50% more steps gets 34.6%. It is
also interesting that pruning achieves a higher score on the CoreEn benchmark,
while distillation is better on MMLU. Finally, when combining these two
methods together, we observe further improvements on MMLU and GSM8k
by a large margin, getting better or on par results compared to the baseline
trained using 5× more computation.
D Long-context evaluation
Although the focus for this version of AFM was not to support context lengths
longer than 8k, in Table 9 we use the RULER [Hsieh et al., 2024] benchmark to
evaluate AFM-server at 4k to 32k context lengths. We note that the model is
capable of performing perfectly up to a sequence length of ≥ 32k when tested
against straightforward retrieval-like tasks, e.g., needle-in-the-haystack (NIAH).
It is clear, however, that the model performance gradually suffers with an
42
AFM-on-device Core Continued Context lengthened
ARC_C 43.17 47.53 45.39
ARC_E 74.87 78.62 78.37
HellaSwag 54.70 55.50 55.24
LAMBADA 73.51 70.13 69.90
PIQA 77.37 78.67 78.40
SciQ 94.90 95.80 95.70
WinoGrande 65.82 67.32 67.01
TriviaQA (1 shot) 42.46 39.13 38.11
WebQS (1 shot) 19.24 18.06 17.22
CoreEN average 60.67 61.20 60.59
MMLU (5 shot) 57.00 61.35 60.64
GSM8K (8 shot CoT) 27.45 42.53 40.00
MATH (4 shot CoT) 8.31 16.97 15.48
HumanEval-Py pass@1 16.48 27.38 30.84
MultiPLE-Swift pass@1 8.88 19.24 18.06
43
AFM-server Core Continued Context lengthened
ARC_C 58.28 58.87 57.94
ARC_E 85.61 85.44 85.06
HellaSwag 64.17 64.53 64.37
LAMBADA 78.38 77.59 77.82
PIQA 82.37 81.99 81.88
SciQ 96.60 97.10 97.00
WinoGrande 80.51 79.16 79.08
TriviaQA (1 shot) 54.33 53.57 53.42
WebQS (1 shot) 29.97 27.66 27.41
CoreEN average 70.02 69.55 69.33
MMLU (5 shot) 74.00 75.24 74.80
GSM8K (8 shot CoT) 75.44 74.83 75.51
MATH (4 shot CoT) 32.24 36.48 35.77
HumanEval-Py 33.23 40.77 39.55
MultiPLE-Swift 30.15 37.70 38.11
Table 9: RULER [Hsieh et al., 2024] average evaluation results, averaged over
13 synthetic long-context tasks using 500 examples per task.
In our reward modeling, the preference level ℓ takes 4 possible values, indi-
cating that the chosen response is negligibly better, slightly better, better,
or significantly better than the rejected response. As for the single sided
gradings, each label, e.g., zcif , takes 3 possible values. For instruction following,
truthfulness, and harmlessness, the 3 values correspond to the cases where the
response has major issue, minor issue, or no issue. For verbosity, the 3 values
44
correspond to the cases where the response is too verbose, too short, or just
right.
We use a multi-head architecture for the reward model. More specifically,
we take a decoder-only transformer and obtain the last-layer embedding of
the last non-padding token. We attach one linear and four MLP heads to the
embedding. Denote the model parameters by ϕ and the input prompt-response
pair by (x, y). The linear head outputs the preference reward rϕ (x, y) ∈ R.
The four MLP heads are classification heads representing the instruction-
following, verbosity, truthfulness, and harmlessness property of the response.
We denote the output logits of the 4 classification heads by uifϕ , uverb
ϕ , uϕ
truth ,
uϕ , respectively.
harm
Soft label loss. We train the preference reward rϕ (x, y) based on Bradley-
Terry-Luce (BTL) model [Bradley and Terry, 1952]. Recall that in BTL model,
the probability that yc is preferred over yr is modeled as σ(rϕ (x, yc )−rϕ (x, yr )),
where σ is the sigmoid function. Intuitively, this probability should be larger if
the preferred response yc is annotated as significantly better than the rejected
response yr , and smaller if yc is only negligibly better than yr . We incorporate
this information using the preference level ℓ. More specifically, for each
preference level ℓ, we design a target preference probability pℓ . Then we use a
soft label loss as follows:
45
Leave-One-Out (LOO) estimator of the advantage. In each iteration of the
algorithm, we have a data collection stage and a policy updating stage. Let θk
be the model parameter at the beginning of the k-th iteration. We sample a
batch of n prompts from our prompt set, and for each prompt, we sample K
responses according to the policy πθk , and thus collecting a total of nK data
points in each iteration. Let x be a prompt and yi be one of the responses.
Since we consider the bandit setting, by definition, the advantage of (x, yi ) is
We use the leave-one-out (LOO) method [Kool et al., 2019] to estimate Ak (x, yi ).
Namely, we estimate the mean reward given the prompt x with the other K − 1
responses, i.e.,
1 X
bk (x, yi ) = R(x, yi ) −
A R(x, yj ). (7)
K −1
j̸=i
Note that here the KL regularization term is different from the one in Eq. (1).
The KL regularization in Eq. (1) is between the policy model and the reference
model; whereas the KL regularization term in Eq. (8) is between the policy
model and the policy at the beginning of the k-th iteration. Then we can
obtain the gradient of Ψ(θ) as
πθ (y|x)
∇Ψ(θ) =Ex∼D,y∼πθ (·|x) Ak (x, y)∇ log πθ (y|x)
k πθk (y|x) (9)
− γEx∼D [∇DKL πθ (·|x)||πθk (·|x)] .
46
The MDLOO algorithm can be derived by replacing the expectations in Eq. (9)
with the nK samples collected with πθk , and the advantage Ak (x, y) with the
LOO estimator A bk (x, y) in Eq. (7). Empirically, we find that MDLOO works
better than the popular PPO [Schulman et al., 2017] algorithm in our setting.
47