I S P L M S A: Nterpreting and Teering Rotein Anguage Odels Through Parse Utoencoders
I S P L M S A: Nterpreting and Teering Rotein Anguage Odels Through Parse Utoencoders
A BSTRACT
1 I NTRODUCTION
Since the introduction of the transformer architecture (Vaswani, 2017), the capabilities of neural
networks to model and generate natural language have increased dramatically. Yet, due to their
black-box nature, we still lack a clear understanding of how these models achieve such capabilities
(Rai et al., 2024). Recently, the mechanistic interpretability approach has been proposed, where
researchers try to reverse engineer neural networks in a way similar to reverse engineering computer
programs (Olah, 2022; Rai et al., 2024). This involves understanding which features the network is
learning from our input data, and then how it performs operations with this set of features.
It has been observed that neural networks tend to encode high-level features as linear directions in
their representation space—such as the gender direction in word embeddings (Park et al., 2023).
Additionally, these models can store more facts and features than their parameter counts would
suggest, a phenomenon known as superposition (Elhage et al., 2022). This phenomenon represents
a core problem for interpretability: as a single neuron activation can be polysemantic and represent
multiple features simultaneously.
Recently, sparse autoencoders (SAEs) have been proposed as a method to disentangle internal rep-
resentations in language models, extracting features from superposition in an unsupervised manner
(Templeton et al., 2024; Bricken et al., 2023; Cunningham et al., 2023). Notably, these features
appear to be actionable: artificially activating them during inference can steer a model’s output
(Templeton et al., 2024; Makelov, 2024). Such methods have been successfully applied to language
(Templeton et al., 2024; Cunningham et al., 2023; Gao et al., 2024), as well as vision and mul-
timodal models (Gorton, 2024; Surkov et al., 2024), but biological and protein sequence models
remain relatively unexplored (Simon & Zou, 2024; Adams et al., 2025).
Protein language models have been shown to encode structural, functional, and evolutionary infor-
mation in their internal representations (Rives et al., 2019; Lin et al., 2023; Hayes et al., 2024).
Interpretability methods for these models could reveal biological mechanisms, and support model
1
debugging and editing for safety considerations. Additionally, model steering can be incorporated
into sequence design pipelines.
The main contributions of this paper are:
• A trained sparse autoencoder (SAE) for the ESM-2 8M parameter model, along with potential
interpretations of its latent components (sections 2.2, 3.1 and 3.3).
• A methodology for generating protein sequences by intervening on specific latents, demonstrating
successful steering towards non-trivial features, such as zinc finger domains (section 3.4).
• A heuristic for selecting the model layer from which to extract representations using an intrinsic
dimension estimator (section 3.2).
2 BACKGROUND
2.1 P ROTEIN L ANGUAGE M ODELS
Many advancements in Natural Language Processing have been successfully applied to biological
sequence modeling. Transformer-based neural networks can be trained on protein sequences using
the Masked Language Modeling (MLM) task, where each amino acid is treated as a token that can
be randomly masked. The model learns to predict the masked tokens by minimizing the following
loss function (Rives et al., 2019):
X
LMLM = Ex∼X EM −log p(xi |x/M ) (1)
i∈M
where x is a protein sequence, M is a set of masked indices and p(xi |x/M ) is the probability as-
signed to the ground truth amino acid xi given its sequence context.
Training on the Masked Language Modeling (MLM) task forces the network to learn dependencies
between masked amino acids and their sequence context while simultaneously capturing various
biological features present in the data. Embeddings extracted from these models have been shown
to encode information about secondary structure, tertiary contacts (residue-residue interactions),
function, remote evolutionary relationships, and factors relevant to predicting mutational effects
(Rives et al., 2019; Elnaggar et al., 2021; Meier et al., 2021; Lin et al., 2023; Hayes et al., 2024).
On the other hand, the attention mechanism appears to prioritize binding sites, with attention maps
capturing information about residue-residue interactions. (Vig et al., 2020).
The sparse autoencoders used for interpretability are simple, single-layer models trained on the ac-
tivations of a larger language model. To disentangle network features, the hidden layer is made
significantly larger than the original embeddings, creating an overcomplete basis. A sparsity con-
straint is then applied to ensure that only a few latent neurons are active at a time, making the SAE’s
hidden representation far more interpretable than standard language model components (Templeton
et al., 2024; Bricken et al., 2023; Cunningham et al., 2023).
2.2.1 A RCHITECTURE
The autoencoder is composed of an encoding and a decoding function, given by:
Here fenc is the encoder, that takes an embedded amino acid token x ∈ IRd from a given layer in
the model and returns a latent z ∈ IRn≥0 with a hidden dimension n that is m times bigger that of the
original vector (expansion factor). The decoder fdec approximately reconstructs x given z, through
the decoding matrix Wdec ∈ IRn×d and the bias weight bdec ∈ IRd .
2
The loss function used for the training is a combination of the reconstruction error of the autoencoder
LM SE plus a sparsity constraint LL1 :
X X
L(x) = LM SE + LL1 = (xd − x̌d )2 + λ zn (4)
d n
While training, we renormalize the Wdec matrix to have unit norm after each backward pass. This is
necessary to prevent that autoencoder latents become arbitrarily small and satisfy the L1 constraint
without actually being sparse.
3 M ETHODS
We use the ESM-2 family of models (Lin et al., 2023) as our base, extracting activations from the fi-
nal output of the transformer block. We train on approximately 15k non-redundant protein sequences
from SCOPe 2.08 (Fox et al., 2014). Further details on the architecture, training procedures, and
hyperparameter selection can be found in section A.1 of the appendix.
We adopt a principled strategy to select the layer from which we extract representations for the
sparse autoencoder. The initial intuition, in line with earlier studies (Templeton et al., 2024; Gao
et al., 2024), is to choose a mid-to-late layer, where the model is assumed to have developed ab-
stract features but is not yet focused on the output reconstruction task. However, unlike these prior
works, we move beyond mere intuition by incorporating a quantitative measure based on intrinsic
dimension.
Specifically, we compute the intrinsic dimension of each layer’s representations using the estimator
proposed by (Facco et al., 2017), and then identify where this value plateaus. Previous research has
shown that layers corresponding to local minima or plateaus in intrinsic dimension are where abstract
information is most clearly encoded (Valeriani et al., 2024). Selecting a layer within this plateau
increases the likelihood of capturing meaningful representations, providing a stronger foundation
for interpretability and model steering.
We extract protein annotations from the UniProt database (Uniprot, 2025) and convert them into
binary labels for each amino acid in the sequence. We then compute the precision π and recall ρ of
each latent component k in detecting a given feature ϕ.
Let A be the set of all amino acids and Aϕ+ the set of amino acids that have been annotated with the
feature ϕ. Considering a latent k to be active (k + ) for a given amino acid when its value zk exceeds
a certain threshold τz , we have:
+ + a ∈ Aϕ+ : zk > τz
π = P (ϕ |k ) = (5)
|{a ∈ A : zk > τz }|
+ + a ∈ Aϕ+ : zk > τz
ρ = P (k |ϕ ) = (6)
Aϕ+
This gives us a value of precision and recall for each pair of k, ϕ. We consider a latent component to
be associated with a specific feature if its precision or recall exceed a predefined threshold that we
set to 0.80.
3
3.4 G ENERATING STEERED SEQUENCES
Once a latent corresponding to a specific feature is identified, we can steer the model during in-
ference to increase the likelihood of generating protein sequences that contain that feature. This
approach, previously demonstrated in natural language models by (Templeton et al., 2024), is out-
lined in figure 1.
We begin with a randomly generated amino acid sequence of fixed length. After a forward pass
through the model and the encoder layer of the SAE, we modify the target latent zk by scaling and
shifting its value to increase its magnitude (equation 7). We then pass the modified value zk∗ through
the decoder layer fdec and add back the original reconstruction error of the embedding x before
passing it through the rest of the model (equation 8).
zk∗ = a · zk + b (7)
∗
x = fdec (zk∗ ) + xerr (8)
For each position in the sequence, we randomly sample an amino acid according to the probability
distribution predicted by the model under the intervention. We repeat this process starting from the
predicted sequence and perform 100 iterations of inference-prediction to refine the sequence. We
select the sequence at the iteration where the value of the activation zk is maximum.
(A) (B)
MALWMRLLPL
embed
random sequence
MALWMRLLPL
+ attention
2. modify value of PLM + sample
the k-th latent Steering sequence x100
✕N layers + MLP
logits
*
pick highest z k
1. insert SAE
in layer N SAE TSGPTTFKQQ
final sequence
unembed
logits
Figure 1: Sequence generation procedure. (A) To steer the model outputs, the base Protein Language
Model is modified through the insertion of a sparse autoencoder in the residual stream, at a particular
layer. During inference, the value of one of the latents in the autoencoder is modified. (B) Starting
from a random sequence, we perform inference with the modified and intervened model, and sample
a new sequence from the output logits. We repeat this procedure iteratively a certain number of times
(i.e. 100), and at the end we retain the sequence which gives the highest value for the activation of
the target latent zk .
4 R ESULTS
For the interpretability analysis, we focus on the autoencoder that provides the best trade-off between
sparsity and reconstruction quality, as described in section A.3.2 of the appendix. We compute recall
and precision for all [k, ϕ] pairs, following the methodology outlined in section 3.3, using three
increasing thresholds of latent activation. This allows us to assess the robustness of the identified
features.
4
Table 1: Number of latent-feature annotation pairs with a minimum precision/recall of 0.8 for dif-
ferent values of the activation threshold τk
τk # Pairs (Precision > 0.8) # Pairs (Recall > 0.8) Total
0.01 4 262 266
0.10 8 234 242
1.00 133 61 194
Total (unique) 133 262 395
We find 395 putative [k, ϕ] associations, detailed in table 1. Among these, there are latent compo-
nents associated to different binding sites, cellular regions and motifs like zinc fingers. The complete
set of latent - feature associations is available in the supplementary data (see section 5).
We also identify many potential associations with a lower confidence (lower values of precision or
recall). To get an idea of how many putative association are found for each value of precision/recall
(as well as a combined F1-score) we plot their cumulative distributions in figure 2.
Intuitively, a latent component that perfectly matches an annotation type should exhibit both high
precision and recall, resulting in a high F1-score. However, since the model is trained to optimize
a masked language modeling loss, the features it learns may not directly align with those in the
manually curated dataset. For instance, a latent k might encode a more specific subcategory of a
dataset label ϕ, such as identifying the starting amino acid of a helix rather than the entire helix
structure (as seen by Adams et al. (2025) on some features). In such cases, the association between
k and ϕ would likely have high precision but low recall.
Similarly, a high recall but low precision may indicate that the model has learned a more coarse-
grained feature than those defined in the dataset. This is evident in cases such as latent k = 610,
which activates across various types of zinc fingers, and latent k = 555, which responds to alpha-
keto acids. To prevent selecting latents with high recall due to trivial reasons – such as activating
indiscriminately on all amino acids – we also assess the proportion of times a latent is active on
amino acids lacking a given label, denoted as P (k + |ϕ− ), before confirming an association. In
section A.3.3, we present examples of the distributions of P (k + |ϕ+ ) and P (k + |ϕ− ) for a zinc
finger region.