A I M F - L L M: Utomatically Nterpreting Illions of EA Tures in Arge Anguage Odels
A I M F - L L M: Utomatically Nterpreting Illions of EA Tures in Arge Anguage Odels
A BSTRACT
arXiv:2410.13928v1 [cs.LG] 17 Oct 2024
While the activations of neurons in deep neural networks usually do not have a
simple human-understandable interpretation, sparse autoencoders (SAEs) can be
used to transform these activations into a higher-dimensional latent space which
may be more easily interpretable. However, these SAEs can have millions of
distinct latent features, making it infeasible for humans to manually interpret each
one. In this work, we build an open-source automated pipeline to generate and
evaluate natural language explanations for SAE features using LLMs. We test
our framework on SAEs of varying sizes, activation functions, and losses, trained
on two different open-weight LLMs. We introduce five new techniques to score
the quality of explanations that are cheaper to run than the previous state of the
art. One of these techniques, intervention scoring, evaluates the interpretability
of the effects of intervening on a feature, which we find explains features that
are not recalled by existing methods. We propose guidelines for generating better
explanations that remain valid for a broader set of activating contexts, and discuss
pitfalls with existing scoring techniques. We use our explanations to measure the
semantic similarity of independently trained SAEs, and find that SAEs trained on
nearby layers of the residual stream are highly similar. Our large-scale analysis
confirms that SAE latents are indeed much more interpretable than neurons, even
when neurons are sparsified using top-k postprocessing. Our code is available
here, and our explanations are available here.
1 I NTRODUCTION
Large language models (LLMs) have reached human level performance in a broad range of domains
(OpenAI, 2023), and can even be leveraged to develop agents (Wang et al., 2023) that can strategize
(Bakhtin et al., 2022), cooperate and develop new ideas (Lu et al., 2024; Shaham et al., 2024). At the
same time, we understand little about the internal representations driving their behavior. Early mech-
anistic interpretability research focused on analyzing the activation patterns of individual neurons
(Olah et al., 2020; Gurnee et al., 2023; 2024). Due to the large number of neurons to be interpreted,
automated approaches were proposed (Bills et al., 2023), where a second LLM is used to propose an
explanation given a set of neuron activations and the text snippets it activates on, in a process similar
to that of generating a human label. But research has found that most neurons are “polysemantic,”
activating in contexts that can be very different (Arora et al., 2018; Elhage et al., 2022). The Linear
Representation Hypothesis (Park et al., 2023) posits that human-interpretable concepts are encoded
in linear combinations of neurons. A significant branch of current interpretability work focuses on
extracting these features and disentangling them (Bereska & Gavves, 2024).
Sparse autoencoders (SAEs) were proposed as a way to address polysemanticity (Cunningham et al.,
2023). SAEs consist of two parts: an encoder that transforms activation vectors into a sparse, higher-
dimensional latent space, and a decoder that projects the latents back into the original space. Both
parts are trained jointly to minimize reconstruction error. SAE latents were found to be interpretable
and potentially more monosemantic than neurons (Bricken et al., 2023; Cunningham et al., 2023).
Recently, a significant effort was made to scale SAE training to larger models, like GPT-4 (Gao et al.,
2024) and Claude (Templeton et al., 2024), and they have become an important interpretability tool
for LLMs.
1
Training an SAE yields a large number of sparse latent features, each of which needs a natural lan-
guage explanation. In this work, we introduce an automated framework that uses LLMs to generate
an explanation for each latent in an SAE. We use this framework to explain millions of latents across
multiple models, layers, and SAE architectures. We also propose new ways to evaluate the quality
of explanations, and discuss the problems with existing approaches.
Although other repositories of explanations for SAE features already exist (Lin & Bloom, 2023),
we believe it would be helpful for the community to provide high quality explanations, using open
source models, in a way that is reproducible. This kind of exhaustive explanatory work could enable
more downstream SAE applications utilizing SAE capabilities like steering and concept localization.
Explaining the latents of SAEs trained on models like Llama 3.1 7b or Gemma 2 9b requires the
generation of millions of explanations. As an example, the most extensive open-source set of SAEs
available, Gemmascope (Lieberum et al., 2024), includes SAEs for all layers of Gemma 2 9b and
Gemma 2 2b and would require explaining tens of millions of latents. Doing the same for bigger
models like Llama 3.1 70b or 405b could easily reach hundreds of millions, if not billions, of latents
to be explained, meaning that optimizing the process to obtain these explanations is essential.
2 R ELATED WORK
One of the first approaches to automated interpretability focused on explaining neurons of GPT-2
using GPT-4 (Bills et al., 2023). GPT-4 was shown examples of contexts where a given neuron was
active and was tasked to provide a short explanation that could capture the activation patterns. To
evaluate if a given explanation captured the behavior of the neuron, GPT-4 was tasked to predict the
activations of the neuron in a given context having access to that explanation. The explanation is
then scored by how much the simulated activations correlate with the true activations.
A similar approach was used in Templeton et al. (2024) to explain SAE latents of Claude. In general,
current approaches focus on collecting contexts together with latent activations from the model to
be explained, and use a larger model to find patterns in activating contexts.
Following those works, other methods of evaluating explanations have been proposed, including
asking a model to generate activating examples and measuring how much the neuron activates (Kopf
et al., 2024; Hernandez et al., 2022). More recently, “interpretability agents” have been proposed,
iteratively doing experiments to find the best explanations of vision neurons (Shaham et al., 2024).
On the other end of the spectrum, a potentially cheaper version of automated interpretability has
been proposed, where the model that is being explained doubles as an explanation generation model
(Kharlapenko et al., 2024). A prompt querying the meaning of a single placeholder token is passed to
the model and latent activations are patched into its residual stream at the position of the placeholder
token during execution, generating continuations related to the latent. This technique is inspired by
earlier work on Patchscopes (Ghandeharioun et al., 2024) and SelfIE (Chen et al., 2024).
3 M ETHODS
We collected latent activations from the SAEs over a 10M token sample of RedPajama-v2 (RPJv2;
Computer 2023). This is the same dataset that we used to train our Llama 3.1 8b SAEs, and consists
of a data mix similar to the Llama 1 pretraining corpus.
We collected batches of 256 tokens starting with the beginning of sentence token (BOS).1 The con-
texts used for activation collection are smaller than the contexts used to train the SAEs, and we find
that on average, 30% of the latents of the 131k latent, per layer, Gemma 2 9b SAE don’t activate
more than 200 times over these 10M tokens and 15% don’t activate at all. When we consider the
1
We throw out the activations on the BOS token when generating explanations for Gemma, as we were told
in personal communication that these SAEs were not trained on BOS activations.
2
Figure 1: SAE latents explanations in a random sentence. To visualize the latent explanations
produced, we select a sentence taken from the RPJv2 dataset. We selected 4 tokens in different
positions in that sentence and filter for latents that are active in different layers. Then we randomly
select active latents and their corresponding explanations to display. We display the detection and
fuzzing scores of each explanation, which indicate how well it explains other examples in the dataset
(see section 3 for details on these scores). The features selected had high activation, but were not
cherry-picked based on explanations or scores.
training context length of 1024, only 5% of latents don’t fire. When using a closer proxy of the
training data, the “un-copyrighted” Pile, we find that the number of latents that activate fewer than
200 times decreases to 15%, and that only 1% don’t activate at all, even when considering contexts
of size 256. Interestingly, in the case of the 16k latent Gemma 2 9b SAE only around 10% of la-
tents activate fewer than 200 times on the RPJv2 sample, suggesting that the larger 131k latent SAE
learned more dataset specific latents than its smaller cousin.
Our approach follows Bills et al. (2023) in showing the explainer model examples sampled from
different quantiles, but uses a more “natural” prompt where the activating example is shown whole,
with the activating tokens emphasized and the strength of the activation shown after the example.
See Appendix A.1 for the full prompt.
The activating example is selected to only have 32 tokens, irrespective of the position of the activat-
ing tokens in the prompt. We found that using COT, at least when explaining using Llama 70b, does
not significantly increase the quality of the explanations, while significantly increasing the compute
and time required to generate explanations. For this reason, we have not used it for our main ex-
periments. Showing such short contexts to the explainer model hinders the correct identification of
latents with complex activation patterns.
We find that randomly sampling from a broader set of examples leads to explanations that cover
a larger set of activating examples, sometimes to the detriment of the top activating examples, see
fig. 3. Sampling from top examples often generates more concise and specific explanations that
accurately describe the top activating examples but fail to capture the whole distribution. On the
other hand, uniformly sampling examples that activate the feature, or ensuring that chosen examples
cover all activation strength quantiles, can lead to explanations that are too broad and that fail to
3
capture a meaningful explanation. For examples of these types of failure modes, see fig. A1 and the
discussion in the appendix A.2.
Focusing on maximally activating contexts to generate and score explanations biases the types of
explanations found and the types of explanations that are highly scored.
Automatic interpretability methods, including most of those explored in this paper, typically look
for correlations between activation of a feature and some natural-language property of the input.
However, some features are more closely related to what the model will output. For example, we
find a feature2 whose activation causes the model to output words associated with reputation but
does not have a simple explanation in terms of inputs.
We define an output feature as a feature that causes a property of the model’s output that can be
easily explained in natural language. See sec 3.5.5 for the definition we use in scoring.
Output features can also be described in terms of their correlation with inputs. For example, the
“reputation” feature activates in contexts where likely next tokens relate to reputation. However,
explaining output features in terms of causal influence on output has two advantages.
1. Scalability. Output features are easier to describe in terms of effect on output because
the explainer only needs to notice a simple pattern of output. The pattern of inputs this
feature correlates with is more complex. Explaining a feature by correlating it with inputs
requires approximating the computation of the subject model leading up to that feature,
which may be challenging when subject models are highly capable and performing difficult
tasks. Some features’ influence on output, however, might remain easily explainable.
2. Causal evidence. We might like to know that a feature causes a property of the model’s
output so that we can steer the model. Previous work has shown that existing auto-
interpretability scores fail to accurately capture how well a given explanation can predict
the effect of intervening on a given neuron (Huang et al., 2023). Further, prior work has ar-
gued that causal evidence is more robust to distribution shifts (Bühlmann, 2018; Schölkopf
et al., 2012).
Being able to efficiently evaluate explanations of SAE latents is important for a few reasons. First,
practitioners would like to know how faithful each explanation is to the actual behavior of the net-
work. Some latents may not be amenable to a simple explanation, and in these cases we expect the
auto-interpretability pipeline to output an explanation with poor evaluation metrics. Secondly, we
can use evaluations as a feedback signal for the training of explainer models and tuning hyperpa-
rameters in the pipeline. Finally, some SAEs may simply be poor quality overall, and the aggregate
evaluation metrics of the SAE’s explanations can be used to detect this.
The quality of SAE explanations has so far been measured via simulation scoring, a technique intro-
duced by Bills et al. (2023) for evaluating explanations of neurons. It involves asking an explainer
model to “simulate” a feature in a set of contexts, which means predicting how strongly the feature
should activate at each token in each context given an explanation. The Pearson correlation between
the simulated and true activations is the simulation score. The standard in the literature is to sample
contexts from a “top-and-random” distribution that mixes maximally activating contexts and con-
texts sampled uniformly at random from the corpus. While oversampling the top activating contexts
introduces bias, it is used as a cheap variance reduction technique, given that simulation scoring over
hundreds of examples per feature would be expensive (Cunningham et al., 2023).
In this work, we take a slightly different view of what makes for a good explanation: an explanation
should serve as a binary classifier distinguishing activating from non-activating contexts. The rea-
soning behind this is simple. Given the highly sparse nature of SAE latents, most of their variance
could be captured by a binary predictor that predicts the mean nonzero activation when the latent is
2
Feature 157 of Gemma 2 9b’s layer 32 autoencoder with 131k latents and an average L0 norm of 51. It has
a detection score of 0.6.
4
expected to be active, and zero otherwise. In statistics, zero-inflated phenomena are often modeled
as a mixture distribution with two components: a Dirac delta on zero, and another distribution (e.g.
Poisson) for nonzero values.
Figure 2: The new proposed scoring methods. In detection scoring, the scorer model is tasked
with selecting the set of sentences that activate a given latent given an explanation. In this work,
we show 5 examples at the same time, and each has an identical probability of being a sentence
that activates the latent, independent of whether any other example also activates the latent. The
activating tokens are colored in green for display, but that information is not shown to the scorer
model. For fuzzing scoring, the scorer model is tasked with selecting the sentences where the
highlighted tokens are the tokens that activate a target latent given an explanation of that latent.
In surprisal scoring, activating and non-activating examples are run through the model and the
loss over those sentences is computed. Correct explanations should decrease the loss in activating
sentences compared to a generic explanation, but shouldn’t significantly decrease the loss in non-
activating sentences. For embedding scoring, activating and non-activating sentences are embedded
as “documents” that should be retrieved using the explanation as a query.
As an alternative to simulation scoring, we introduce four new evaluation methods that focus on how
well an explanation enables a scorer to discriminate between activating and non-activating contexts.
As an added benefit, all of these methods are more compute-efficient than simulation.
Currently, there is no consensus over what makes an ideal explanation. In this work, we chose to
focus on the idea that the explanation of the feature should accurately distinguish between which
contexts the feature is active and which contexts the feature is not active.
3.5.1 D ETECTION
One simple approach for scoring explanations is to ask a language model to identify whether a whole
sequence activates a SAE latent given an explanation. Currently, the bottleneck of scoring is on the
autoregressive nature of token generation. Detection requires few output tokens for each example
used in scoring, meaning examples from a wider distribution of latent activation strengths can be
used at the same expense. By including non-activating contexts, this method measures both the
precision and recall of the explanation.
Detection is more “forgiving” than simulation insofar as the scorer does not need to localize the
feature to a particular token.
5
Both this method and fuzzing (described next) can leverage token probabilities to estimate how
certain the scorer model is of their classification, and this is an effect that we believe can be to
improve the scoring methods. Details on the prompt in Appendix A.3.1.
3.5.2 F UZZING
Fuzzing is similar to detection, but at the level of individual tokens. Here, potentially activating
tokens are delimited in each example and the language model is prompted to identify which of
the sentences are correctly marked. Evaluating an explanation on both detection and fuzzing can
identify whether a model is classifying examples for the correct reason.
While in detection, non-activating examples were used, in fuzzing we choose activating contexts
and randomly select non-activating tokens. Details on the prompt on Appendix A.3.2
3.5.3 S URPRISAL
Surprisal scoring is based on the idea that a good explanation should help a base language model
M (the “scorer”) achieve lower cross-entropy loss on activating contexts than it would without a
relevant explanation. Specifically, for each context x, activating or non-activating, we measure the
information value of an explanation z as log pM (x|z) − log pM (x|z̃), where z̃ is a fixed pseudo-
explanation. A good explanation should have higher information value on activating examples than
on non-activating ones. The overall surprisal score of the explanation is given by the AUROC of its
information value when viewed as a classifier distinguishing activating from non-activating contexts.
Details on the prompt and how to compute the score in Appendix A.3.3.
3.5.4 E MBEDDING
We can imagine explanations of latents as “queries” that should be able to retrieve contexts where
the latent is active. These contexts can act as “documents” that are embedded by an encoding trans-
former, and we can use the similarities between the query and the documents to distinguish between
activating and non-activating contexts. If the encoding model is small enough, this technique is
the fastest and opens up the possibility to evaluate a larger fraction of the activation distribution.
We have seen that using a larger embedding model didn’t significantly improve the scores, see fig
A2, although we believe that this approach was under-investigated. Details on the prompt, on the
embedding model and on the way to compute the score in Appendix A.3.4.
G(x) is the distribution over subject model generations given prompt x at temperature 1. In GI (x),
the intervention is applied to the subject model as it generates. In practice we estimate the quantity
S by sampling one clean and one intervened generation for each sampled of prompt.
Sufficiently strong interventions can be trivially interpretable by causing the model to deterministi-
cally output some logit distribution. Therefore, interpretability scores of interventions should be
compared for interventions of a fixed strength. We define the strength σ of an arbitrary interven-
tion I as the average KL-divergence of the model’s intervened logit distribution with reference to
the model’s clean output.
σ(I; π) = Ex∼π DKL (psubject (·|x) || psubject,I (·|x)) (2)
See Appendix A.7 for details on the explanation and scoring pipeline we use in our intervention
experiments.
6
Table 1: Spearman correlation computed over 600 different latent scores
Fuzzing Detection Simulation Embedding Surprisal
Fuzzing 1 0.73 0.70 0.41 0.30
Detection 1 0.44 0.71 0.62
Simulation 1 0.33 0.20
Embedding 1 0.79
Surprisal 1
4 R ESULTS
Since simulation scoring is an established method, we measure how our other context-based scoring
techniques correlate with simulation, as well as how they correlate between themselves (see tables
1 and A1).
We find that fuzzing and detection have the highest Spearman correlations with simulation scoring,
at 0.70 and 0.44 respectively. The imperfect correlations hint at either shortcomings of the scoring
metrics or the fact that these metrics can measure different qualities of explanations.
Simulation correlates more strongly with fuzzing than with detection, which is expected because
fuzzing is similar to simulation scoring. It correlates weakly with surprisal and embedding scoring.
On the other hand, surprisal and embedding scores correlate more with detection scoring than with
fuzzing, and most strongly with each other. Due to the speed of embedding scoring, we believe it to
be a potentially scalable scoring technique for quick iteration.
In figure 4, we see that the intervention score we propose is a valuable contribution to the set of
automatic interpretability metrics because it (a) distinguishes between features from a trained SAE
and random features and (b) recalls features that context-based scoring methods fail to interpret.
We use a set of 500+ latents as a testbed to measure the effects of design choices and hyperpa-
rameters on explanation quality. Each latent is scored using 100 activating and 100 non-activating
examples. The activating examples are chosen via stratified sampling such that there are always 10
examples from each of the 10 deciles of the activation distribution.
We evaluate explanation quality using fuzzing, detection and embedding scores, as those were both
quick to compute and easy to interpret. We expect this mix of scores to correlate well with simulation
scoring and reflect the extent to which the proposed explanation is valid over both activating and
non-activating examples.
Table A6 shows that explanations based on more examples tend to have higher scores. By sampling
the examples that are shown to the explainer model instead of showing just the top activating ex-
amples it is possible to increase the scores of the features, see fig. 3 and A5. This effect would not
be seen if the scoring were done on just the most activating examples, underscoring a problem with
current auto-interpretability evaluations, which produce explanations using top activating examples
and evaluate them on a small subset of the activation distribution.
Increasing the size of the explainer model increases the scores of the explanations but we don’t find
the performance of Claude Sonnet 3.5 to be much higher than that of Llama 3.1 70b, see A7, and
both are similar to those generated by a human. We expect that this may be due to the fact that the
prompting techniques were first optimized for Llama 3.1, and that there could be potential gains in
optimizing the prompting technique of Claude. Not surprisingly, we also see that even for fuzzing
and detection, using a smaller scorer model leads to lower scores.
7
Figure 3: Fuzzing and detection scores for different sampling techniques. Panels a) and b)
show the distributions of fuzzing and detection scores, respectively, as a function of different ex-
ample sampling methods for explanation generation. Sampling only from the top activation gets
on average lower accuracy in fuzzing and on detection when compared with randomly sampling
and sampling from quantiles. The distributions from random sampling and sampling from quan-
tiles are very similar. Panels c) and d) measure how the explanations generalize across activation
quantiles, showing that explanations generated from the top quantiles are better at distinguishing
non-activating examples in detection, but have lower accuracy on other quantiles, especially on the
lower activating deciles. This also happens with explanations generated from examples sampled
randomly and from quantiles, but the accuracy does not drop as much in lower activating deciles.
In table A4, we don’t find a significant dependence on whether we use the same dataset as the
training set of the SAEs (see A2) whether we used chain of thought or not (see A3) and whether we
change the size of the contexts shown.
SAEs with more latents have higher scores, and scores that are significantly higher than those of
neurons, which are just slightly better than randomly initialized SAEs, see A8. Neurons are more
interpretable if made sparser by only considering the top k most activated neurons on a given token,
but still significantly underperform SAEs in our tests, see A8.
The location of the SAEs matters; residual stream SAEs have slightly better scores than ones trained
on MLP outputs, see A8. We also observe that earlier layers have lower overall scores, but that the
scores across model depth remain constant after those layers, see A4.
A detailed analysis of various factors in the auto-interpretability pipeline is in Appendix A.5.
8
layer 32 layer 32
target KL: 0.1 target KL: 3.0
100 y = -2.71x + 2.79 (r=-0.37)
4
10 1
1 0 1 2 3 4 5 6
1 0 1 2 3 4 5 6
1
target KL: 3.0
100 SAE
Random TopK SAE
density
Random Explanations 0
10 1
Figure 4: Intervention scores. Here we present intervention scores (Sec 3.5.5) for SAE features in
Gemma 2 9B at layer 32. Left: SAE features are more interpretable than random features, especially
when intervening more strongly. Our explainer also produces explanations that are scored higher
than random explanations. Right: Many features that would normally be uninterpreted when using
context-based automatic interpretability are interpretable in terms of their effects on output.
Because each block in a transformer performs an incremental additive update to the residual stream,
one would expect SAEs trained at nearby layers to learn similar features. Measuring the degree
of feature overlap is interesting for a few reasons. First, if adjacent SAEs learn almost identical
features, it may not be worthwhile to train and interpret SAEs at every single residual stream layer.
Second, we can use feature overlap to sanity check our auto interpretability pipeline: if the expla-
nation for latent α at layer j is very different from the explanation for “the same” latent at layer
j + 1, this would suggest that our pipeline is inconsistent and noisy. Finally, feature overlap may
allow us to estimate the degree of semantic similarity between layers, as opposed to mere statistical
similarity.
Unfortunately, we cannot simply fix some index i and compare latent i at layer j to latent i at layer
j +1. This is due to permutation symmetry: given an SAE with parameters (We , be , Wd , bd ) and a
permutation matrix P ̸= I, the SAE (PWe , Pbe , PT Wd , PT bd ) has identical input-output behav-
ior, and yet its latents are completely “unaligned” with the original. Instead, we use the Hungarian
algorithm (Crouse, 2016) to compute the permutation which maximizes the Frobenius inner product
between the decoder weight matrices of the two SAEs. A similar method was used by Ainsworth
et al. (2023) to align weights of independently trained MLPs.
In this section, we focus on the 16k latent SAEs trained on Gemma 2 9b. In Appendix A.6, we report
the pairwise Frobenius inner products between SAE decoder matrices at different layers before and
after aligning them with the Hungarian algorithm. As expected, this simple metric shows that SAEs
trained on adjacent layers of the residual stream are more similar than those trained on the MLPs.
To measure the semantic similarity between the explanations, we embed the explanation of each
latent and organize the embeddings into a matrix, sorted with the indices found by the Hungarian
algorithm. We then compute the Frobenius inner product of these matrices, ignoring the latents that
don’t have an explanation on both layers. We observe that the residual stream SAEs have higher
semantic overlap for neighboring layers than the MLP SAEs.
Our results suggest that, when training SAEs given a limited compute budget, one should prioritize
training wider SAEs on a smaller subset of residual stream layers, perhaps every k th layer for some
stride k > 1. If feature diversity is a priority, it may make more sense to train SAEs on MLP outputs
exclusively, although we find MLPs to be slightly less interpretable than the residual stream, see A8.
9
Figure 5: Alignment between embeddings of explanations of the 16k Gemma 2 9b SAEs. After
“aligning” the features of each layers by using the decoder weights, we observe that the explanations
found for the latents of the residual stream SAE are more similar than for latents of the MLPs. In this
figure, alignment is the measured by the Frobenius inner product between the feature embeddings.
It’s worth noting that the Hungarian algorithm could also be used to align latents from independently
trained SAEs in other contexts, such as SAEs trained on different checkpoints of the same training
run, or even completely different architectures trained on the same data (Moschella et al., 2022). We
leave these extensions to future work.3
6 C ONCLUSION
Explaining the latents of SAEs trained on cutting-edge LLMs is a computationally demanding task,
requiring scalable methods to both generate explanations and assess their quality. We addressed
issues with the conventional simulation-based scoring and introduced four new scoring techniques,
each with distinct strengths and limitations. These methods allowed us to explore the “prompt design
space” for generating effective explanations and propose practical guidelines.
While most latents are well-explained by contexts of length 32, we expect features active over longer
contexts to be important and plan to develop better methods for capturing these long-context fea-
tures. Additionally, although current scoring methods do not account for explanation length, we
believe shorter explanations are generally more useful and will incorporate this in future metrics.
Some scoring methods also require further refinement, particularly in selecting non-activating ex-
amples to improve evaluation.
By analyzing a large number of explanations, we examined overlaps across layers and the distribu-
tion of explanations in decoder direction space. Our results suggest that, when compute resources
are constrained, it may be more efficient to train wider SAEs on a small subset of residual stream
layers, rather than narrower SAEs on all layers.
Access to better, automatically generated explanations could play a crucial role in areas like model
steering, concept localization, and editing. We hope that our efficient scoring techniques will enable
feedback loops to further enhance the quality of explanations.
3
All that is needed is an adequate cost function for measuring the similarity of individual latents. SciPy’s
implementation supports rectangular cost matrices, allowing SAEs of different sizes to be (partially) aligned.
10
7 DATA AND CODE AVAILABILITY
Together with this manuscript, we release the sae-auto-interp library and accompanying scripts to
guide the reproduction of these results.
In collaboration with Neuronpedia we also release the explanations for the Gemma 2 9b 16k residual
stream SAEs, the 16k MLP SAEs, the 131k residual stream SAEs and the Llama 3.1 8b 262k residual
stream and MLP SAEs. Explanations can be directly downloaded here.
8 ACKNOWLEDGEMENTS
We would like to acknowledge Lucia Quirke and David Johnston for the work reviewing and dis-
cussing the manuscript. We would like to acknowledge Jacob Drori’s role in the initial development
of new scoring techniques. We would also like to thank Joseph Bloom, Sam Marks, Can Rager,
Jannik Brinkmann and Sam Marks, Neel Nanda, Sarah Schwettmann and Jacob Steinhardt for their
discussions and suggestions in an earlier version of the work. We are thankful to Open Philanthropy
for funding this work. We are grateful to CoreWeave for providing the compute resources.
R EFERENCES
Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models mod-
ulo permutation symmetries. In The Eleventh International Conference on Learning Representa-
tions, 2023. URL https://openreview.net/forum?id=CQsmMYmlP5T.
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic struc-
ture of word senses, with applications to polysemy. Transactions of the Association for Compu-
tational Linguistics, 6:483–495, 2018.
Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, An-
drew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath,
Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sandra Mitts, Adithya Ren-
duchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David J.
Wu, Hugh Zhang, and Markus Zijlstra. Human-level play in the game of diplomacy by com-
bining language models with strategic reasoning. Science, 378:1067 – 1074, 2022. URL
https://api.semanticscholar.org/CorpusID:253759631.
Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv
preprint arXiv:2404.14082, 2024.
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever,
Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in lan-
guage models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index.
html.(Date accessed: 14.05. 2023), 2, 2023.
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con-
erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu,
Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex
Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter,
Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language
models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-
circuits.pub/2023/monosemantic-features/index.html.
Peter Bühlmann. Invariance, causality and robustness. arXiv preprint arXiv:1812.08233, 2018.
Haozhe Chen, Carl Vondrick, and Chengzhi Mao. Selfie: Self-interpretation of large language model
embeddings, 2024. URL https://arxiv.org/abs/2403.10949.
Together Computer. Redpajama: an open dataset for training large language models, 2023. URL
https://github.com/togethercomputer/RedPajama-Data.
11
David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on
Aerospace and Electronic Systems, 52(4):1679–1696, 2016.
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen-
coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600,
2023.
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec,
Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposi-
tion. arXiv preprint arXiv:2209.10652, 2022.
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya
Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint
arXiv:2406.04093, 2024.
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes:
A unifying framework for inspecting hidden representations of language models, 2024. URL
https://arxiv.org/abs/2401.06102.
Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bert-
simas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint
arXiv:2305.01610, 2023.
Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway,
Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models. arXiv preprint
arXiv:2401.12181, 2024.
Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob
Andreas. Natural language descriptions of deep visual features, 2022. URL https://arxiv.
org/abs/2201.11114.
Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rig-
orously assessing natural language explanations of neurons. arXiv preprint arXiv:2309.10312,
2023.
Caden Juang, Gonçalo Paulo, Jacob Drori, and Belrosem Nora. Understanding and steering Llama
3, 9 2024. URL https://goodfire.ai/blog/research-preview/.
Dmitrii Kharlapenko, neverix, Neel Nanda, and Arthur Conmy. Self-explaining SAE fea-
tures, 8 2024. URL https://www.lesswrong.com/posts/8ev6coxChSWcxCDy8/
self-explaining-sae-features.
Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina M. C. Höhne,
and Kirill Bykov. Cosy: Evaluating textual explanations of neurons, 2024. URL https://
arxiv.org/abs/2405.20331.
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant
Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse
autoencoders everywhere all at once on gemma 2, 2024. URL https://arxiv.org/abs/
2408.05147.
Johnny Lin and Joseph Bloom. Neuronpedia: Interactive reference and tooling for analyzing neural
networks with sparse autoencoders, 2023. URL https://www.neuronpedia.org. Soft-
ware available from neuronpedia.org.
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist:
Towards fully automated open-ended scientific discovery, 2024. URL https://arxiv.org/
abs/2408.06292.
12
Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and
Emanuele Rodolà. Relative representations enable zero-shot latent space communication. arXiv
preprint arXiv:2209.15430, 2022.
Niklas Muennighoff, Nouamane Tazi, Loı̈c Magne, and Nils Reimers. Mteb: Massive text embed-
ding benchmark. arXiv preprint arXiv:2210.07316, 2022. doi: 10.48550/ARXIV.2210.07316.
URL https://arxiv.org/abs/2210.07316.
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.
Zoom in: An introduction to circuits. Distill, 5(3), March 2020. ISSN 2476-0757. doi: 10.23915/
distill.00024.001. URL http://dx.doi.org/10.23915/distill.00024.001.
OpenAI. Gpt-4 technical report, 2023. URL https://arxiv.org/abs/2303.08774.
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the
geometry of large language models. ArXiv, abs/2311.03658, 2023. URL https://api.
semanticscholar.org/CorpusID:265042984.
Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij.
On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob
Andreas, and Antonio Torralba. A multimodal automated interpretability agent, 2024. URL
https://arxiv.org/abs/2404.14394.
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen,
Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L
Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers,
Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan.
Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Trans-
former Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/
scaling-monosemanticity/index.html.
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan,
and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models,
2023. URL https://arxiv.org/abs/2305.16291.
13
A A PPENDIX
A.1 E XPLAINER P ROMPT
One of the few shot examples of how examples are displayed to the model.
Example 1: and he was <<over the moon>> to find
14
Activations: (‘‘till", 5), (‘‘ the", 5),
(‘‘ cows", 8), (‘‘ come", 8),
(‘‘ home", 8)
Step 1.
- The activating tokens are all parts of common idioms.
- The surrounding tokens have nothing in common.
Step 2.
- The examples contain common idioms.
- In some examples, the activating tokens are followed
by an exclamation mark.
Step 3.
- The activation values are the highest for the more common
idioms in examples 1 and 3.
As discussed in the main text, the explanations found for a given latent can be very different depend-
ing on the way to sample the activating contexts shown to the explainer model, see fig. A1. This has
its advantages and disadvantages.
When sampling from only the top activations, it is possible that the explainer model gives an expla-
nation that is more narrow - “The concept of a buffer, referring to something that separates, shields,
or protects one thing from another, often used in various contexts such as physical barriers, chemi-
cal reactions, or digital data processing” - instead of one that captures the full distribution - “Words
or phrases associated with concepts of spatial or temporal separation (buffers, zones, or cushions)
or colloquialisms (buff, buffs, or Buffy, referring to a popular TV show or enthusiast)”. Here the
narrower explanations resulted in lower scores: the explanation generated from randomly sampled
examples scored 0.95 accuracy in fuzzing and 0.93 accuracy in detection, while the explanation
generated from top examples only achieves 0.8 accuracy in fuzzing and 0.74 accuracy in detection.
On the other hand, if we look at the second example, sampling from random examples may some-
times confuse the explainer model - “Tokens often precede or succeed prepositions, articles, and
words that signal possession or quantity, frequently indicating a relationship between objects or ac-
tions. of feature” , which might not see the pattern that is more clear in the top activating examples
15
-“Descriptions of food or events where food is involved, often mentioning leftovers, and sometimes
mentioning the act of eating, serving, or storing food, as well as the amount of food, or the pleasure
or satisfaction derived from it.”. Here we see very poor scores on the explanation from randomly
sampled examples, 0.54 accuracy in detection and 0.57 accuracy in fuzzing, while the explanation
generated from the top examples has 0.89 accuracy both in detection and fuzzing (over the full
distribution).
Together with the prompt, there are several few shot examples like the following:
<user_prompt>
Text examples:
<assistant_response>
[1,0,0,0,1]
In this example the <user prompt> and the <assistant response> tags are substituted with the
correct instruct format used by the scorer model. In detection scoring, 5 shuffled examples are
shown to the model at the same time.
16
Figure A1: Activating contexts of feature 209 (top) and 293 (bottom), from layers 8 and 32, respec-
tively, of the 131k latent SAE trained on the residual stream of Gemma 2 9b. The shown examples
are similar to the ones given to the explainer model to come up with explanations.
17
A.3.2 F UZZING DETAILS
<user_prompt>
Text examples:
<assistant_response>
[1,0,0,1,1]
In this example, the <user prompt> and the <assistant response> tags are substituted with the
correct instruct format used by the scorer model. In fuzzing scoring, 5 shuffled examples are shown
to the model at the same time.
18
A.3.3 S URPRISAL DETAILS
In surprisal scoring, the cross-entropy loss of the scorer model is computed over the tokens of an
example. For each explanation, this loss is computed with the explanation and with a default ex-
planation - “Various unrelated sentences,” where the examples can either be activating context or
non-activating contexts. In our first approach, Llama 3.1 70b base was used. The prompt starts with
few shot examples like:
The following is a description of a certain feature of text
and a list of examples that contain the feature.
Description:
Sentences:
‘‘3 begins. And the rise of Antichrist. Get ready with "
Description:
Sentences:
Description:
Sentences:
Description:
19
Sentences:
‘‘The Smoking Tire hits the canyons with one of the fastest
Audi’s on the road "
With the pairs of losses with the explanation and with the default explanation, it is possible to
compute the decrease in loss caused by having access to the explanation. It is expected that in
activating contexts this difference will be greater than in non-activating examples, and the score of
the explanation is given by the AUC computed using this loss as a proxy for activating and non-
activating labels.
The cosine-similarity between the instruction embedding and the examples is computed and is used
as a proxy for activating and non-activating labels when computing the AUC, which is the score of
that explanation.
Throughout this work, we used different SAEs trained on Gemma. The 16k latent ones trained on
the MLP have the following average L0 norms per layer:
0:50, 1:56, 2:33, 3:55, 4:66, 5:46, 6:46, 7:47, 8:55, 9:40, 10:49, 11:34, 12:42, 13:40, 14:41,
15:45, 16:37, 17:41, 18:36, 19:38, 20:41, 21:34, 22:34, 23:73, 24:32, 25:72, 26:57, 27:52,
20
Figure A2: Comparison between the scores given by a small embedding model and a larger
one
Table A1: Pearson correlation computed over 600 different latent scores
Fuzzing Detection Simulation Embedding Surprisal
Fuzzing 1 0.74 0.70 0.42 0.32
Detection 1 0.45 0.70 0.62
Simulation 1 0.34 0.19
Embedding 1 0.79
Surprisal 1
28:50, 29:49, 30:51, 31:43, 32:44, 33:48, 34:47, 35:46, 36:47, 37:53, 38:45, 39:43, 40:37,
41:58
The 16k latent ones trained on the residual stream have the following average L0 norms:
0:35, 1:69, 2:67, 3:37, 4:37, 5:37, 6:47, 7:46, 8:51, 9:51, 10:57, 11:32, 12:33, 13:34, 14:35,
15:34, 16:39, 17:38, 18:37, 19:35, 20: 36, 21:36, 22: 35, 23: 35, 24: 34, 25: 34, 26: 35, 27:36,
28: 37, 29:38, 30:37, 31:35, 32: 34, 33:34, 34:34, 35:34, 36:34, 37:34, 38:34, 39:34, 40:32,
41:52
The 131k latent ones trained in the residual steam have the following average L0 norms:
0:30, 1:33, 2:36, 3:46, 4:51, 5:51, 6:66, 7:38, 8:41, 9:42, 10:47, 11:49, 12:52, 13:30, 14:56,
15:55, 16:35, 17:35, 18:34, 19:32, 20:34, 21:33, 22:32, 23:32, 24: 55, 25:54, 26:32, 27:33, 28:
32, 29:33, 30:32, 31:52, 32: 51, 33:51, 34:51, 35:51, 36: 51, 37:53, 38:53, 39:54, 40: 49, 41:45
We also used SAEs with 65k latents, trained on the residual stream and the MLP of Llama 8b.
21
Figure A3: Scatter plots with different combinations of scores
Table A2: Impact of training dataset on Fuzzing and Detection performance. Numbers shown are
the median score and the interquartile (25%-75%) range.
Experiment Fuzzing Detection Embedding
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58) 0.51 (0.44–0.57)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59) –
RPJ-v2 0.76 (0.67–0.86) 0.74 (0.63–0.85) 0.67 (0.57–0.80)
Pile 0.76 (0.67–0.86) 0.76 (0.67–0.85) 0.69 (0.57–0.80)
Even though a significant portion of SAE latents being less active when using RPJv2 instead of the
Pile, we find that the latents that are active are generally interpretable to the same degree. These
evaluations were done using the 131k latent SAE trained on the residual stream of Gemma 2 9b.
The scorer and the explainer model where Llama 3.1b 70b instruct, quantized to 4bit.
We find that COT slightly increases the scores of the explanations found, and that it significantly
slows down the rate at which one can produce explanations. Giving the explainer model, the activa-
tions associated with each token seem to slightly increase the scores of the explanations generated.
22
Table A3: Impact of prompt content on Fuzzing and Detection performance. Numbers shown are
the median score and the interquartile (25%-75%) range.
Experiment Fuzzing Detection Embedding
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58) 0.51 (0.44–0.57)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59) –
Activations in prompt 0.76 (0.67–0.86) 0.74 (0.63–0.85) 0.68 (0.57–0.80)
No activations in prompt 0.75 (0.65–0.86) 0.73 (0.60–0.84) 0.68 (0.58–0.79)
COT in prompt 0.76 (0.67–0.86) 0.73 (0.61–0.85) 0.65 (0.55–0.75)
Table A4: Impact of context length on Fuzzing and Detection performance. Numbers shown are the
median score and the interquartile (25%-75%) range.
Experiment Fuzzing Detection Embedding
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58) 0.51 (0.44–0.57)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59) –
16 context 0.75 (0.65–0.86) 0.74 (0.62–0.85) 0.70 (0.59–0.81)
32 context 0.76 (0.67–0.86) 0.74 (0.63–0.85) 0.67 (0.57–0.80)
64 context 0.74 (0.64–0.64) 0.70 (0.57–0.81) 0.65 (0.54–0.78)
These evaluations were done using the 131k latent SAE trained on the residual stream of Gemma 2
9b. The scorer and the explainer model where Llama 3.1b 70b instruct, quantized to 4bit.
23
Figure A4: Accuracy on fuzzing and detection scoring Dots correspond to the median score over
c.a. 300 latent explanations, and colored region denotes the interquartile range.
24
Table A5: Impact of example sampling strategies on Fuzzing and Detection performance. Numbers
shown are the median score and the interquartile (25%-75%) range.
Experiment Fuzzing Detection Embedding
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58) 0.51 (0.44–0.57)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59) –
Randomly sampled 0.76 (0.68–0.86) 0.74 (0.62–0.84) 0.66 (0.56–0.78)
Sampled from quantiles 0.77 (0.69–0.87) 0.74 (0.64–0.85) 0.68 (0.57–0.80)
Sampled from top examples 0.73 (0.64–0.83) 0.72 (0.62–0.82) 0.70 (0.58–0.80)
Table A6: Impact of number of examples on Fuzzing and Detection performance. Numbers shown
are the median score and the interquartile (25%-75%) range.
Experiment Fuzzing Detection Embedding
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58) 0.51 (0.44–0.57)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59) –
Shown 10 examples 0.73 (0.62–0.85) 0.71 (0.58–0.82) 0.64 (0.54-0.74)
Shown 20 examples 0.74 (0.64–0.85) 0.72 (0.60–0.84) 0.66 (0.54-0.76)
Shown 40 examples 0.76 (0.67–0.86) 0.74 (0.63–0.85) 0.68 (0.57–0.80)
Shown 60 examples 0.75 (0.66–0.85) 0.73 (0.62–0.84) 0.68 (0.57–0.79)
better tuned for it. We find that having a smaller model score significantly drops the scores of the
explanations—as bad as having the scorer model generate the explanations and the bigger model
score them.
See figure A5 for a depiction of the alignment between weights of decoders at various layers. Here,
the alignment is measured using the Frobenius inner product between the decoder weight matrices.
Due to the permutation symmetry of SAEs layers are not aligned by default, but can’t be aligned
using the Hungarian algorithm. We observe that the decoder weights of SAEs trained on the residual
stream can be better aligned than those trained on the MLP.
25
Table A7: Comparison of explainer models for Fuzzing and Detection. Numbers shown are the
median score and the interquartile (25%-75%) range.
Experiment Fuzzing Detection Embedding
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58) 0.51 (0.44–0.57)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59) –
Claude explaining 0.75 (0.68–0.84) 0.75 (0.65–0.85) 0.70 (0.58–0.81)
Llama 70b explaining (4 bit) 0.76 (0.67–0.86) 0.74 (0.63–0.85) 0.67 (0.57–0.80)
Llama 70b explaining (4 bit) 8b scoring 0.69 (0.61–0.77) 0.69 (0.59–0.79) –
Llama 8b explaining 0.70 (0.60–0.81) 0.70 (0.59–0.81) 0.64 (0.54–0.75)
Llama 8b explaining, 8b scoring 0.67 (0.59–0.75) 0.67 (0.57–0.77) –
Human explaining 0.75 (0.66–0.85) 0.74 (0.64–0.85) 0.71 (0.62–0.81)
Table A8: Comparison of SAEs for Fuzzing and Detection. Numbers shown are the median score
and the interquartile (25%-75%) range.
Experiment Fuzzing Detection
Random explanation 0.51 (0.45–0.57) 0.51 (0.45–0.58)
Randomly initialized Topk SAE 0.55 (0.50–0.60) 0.54 (0.50–0.59)
131k latents Gemma 2 9b 0.76 (0.67–0.86) 0.74 (0.63–0.85)
16k latents Gemma 2 9b 0.73 (0.63–0.83) 0.70 (0.59–0.79)
Top 32 neurons Gemma 2 9b 0.62 (0.54–0.70) 0.59 (0.53–0.64)
Top 256 neurons Gemma 2 9b 0.59 (0.53–0.65) 0.57 (0.52–0.62)
262k latents Llama 3.1 8b MLP 0.79 (0.69–0.86) 0.79 (0.64–0.85)
262k latents Llama 3.1 8b 0.81 (0.71–0.86) 0.83 (0.68–0.85)
Top 32 neurons Llama 3.1 8b 0.55 (0.51–0.62) 0.53 (0.50–0.59)
Top 256 neurons Llama 3.1 8b 0.54 (0.49–0.60) 0.53 (0.50–0.57)
We then sample generations with a maximum of 8 new tokens from the subject model. For genera-
tions with the intervention, we perform an additive intervention on the feature at all token positions
after and including the final prompt token. We tune the intervention strength of each feature to vari-
ous KL-divergence values on the scoring set with a binary search. We stop the binary search when
the KL divergence is within 10% of the desired value4 . For our zero-ablation experiment, instead of
doing an additive intervention we clamp the feature’s activation to 0. Specifically, the hidden states
are encoded by the SAE, then the SAE reconstruction error is computed using a clean decoding,
then the SAE encoding is clamped and decoded, and the error is added to the clamped decoding.
The scorer is Llama-3.1 8B base. We use the base model for improved calibration, and prompt it as
follows.
<PASSAGE>
from west to east, the westmost of the seven wonders of the
world is the great wall of china
<PASSAGE>
Given 4x is less than 10, 4
<PASSAGE>
4
Sometimes the KL-divergence is not perfectly monotonic in the intervention strength so 10% error is
exceeded. We report the average KL divergence that we observe in figure A6
26
Figure A5: Alignment between decoder directions of the 16k Gemma 2 9b SAEs Because each
layer is trained independently, there is no correspondence between latents at different layers. It is
possible to align them using the Hungarian algorithm, but even still the MLP SAEs are less aligned
than the residual stream ones.
<PASSAGE>
My favorite food is oranges
<PASSAGE>
...
27
A.7.2 G ENERATING EXPLANATIONS
We generate explanations using only the intervention’s effect on the subject’s next-token probabili-
ties because this leads to a concise and precise prompt. The explainer, like the scorer, is Llama-3.1
8B.
The explainer sees a distribution of 10 prompts that is sampled i.i.d. from the same population as
the scorer’s prompts.
We use the following prompt for the explainer, with 3 few-shot examples truncated for brevity.
We’re studying neurons in a transformer model. We want to know
how intervening on them affects the model’s output.
Neuron 1
<PROMPT>Given 4x is less than 10,</PROMPT>
Most increased tokens: ’ 4’ (+0.11), ’ 10’ (+0.04),
’ 40’ (+0.02), ’ 2’ (+0.01)
Explanation: numbers
Neuron 2
...
A.7.3 BASELINES
Random SAE. We experiment with a random top-k SAE with k = 50, roughly the average sparsity
of the gemma SAEs. The encoder is initialized with 131,072 spherically uniform unit-norm features,
and the decoder is initialized to its transpose. We use a random SAE with TopK activations because
we need sparsity for our sampling procedure to work properly.
Random explanations. For each layer and target KL value, we compute a random explanation
baseline where we shuffle the explanations across features.
28
SAE Random TopK SAE
avg KL=0.10 avg KL=0.10
0.8 avg KL=0.33 avg KL=0.33
avg KL=1.01 avg KL=1.02
0.6 avg KL=3.07 avg KL=3.07
Average Score
0.4
0.2
0.0
10 15 20 25 30 35 40 10 15 20 25 30 35 40
Layer Layer
Figure A6: Average intervention score vs layer. SAE features in later layers have more explainable
effects on output. Random features, however, have uninterpretable effects on output, even at late
layers.
3 3 3
2 2 2
1 1 1
0 0 0
0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9
fuzzing Score fuzzing Score fuzzing Score
Figure A7: Comparison of intervention and fuzzing scores across layers at various target KL values.
29