0% found this document useful (0 votes)

7 views11 pages

I S P L M S A: Nterpreting and Teering Rotein Anguage Odels Through Parse Utoencoders

This paper investigates the use of sparse autoencoders to interpret and steer protein language models, particularly the ESM-2 model, by analyzing latent components related to protein characteristics. The authors demonstrate a methodology for generating protein sequences by modifying specific latent values, successfully steering the model towards desired features like zinc finger domains. This work enhances the mechanistic interpretability of biological sequence models, providing insights for sequence design and model debugging.

Uploaded by

workboy.mulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views11 pages

I S P L M S A: Nterpreting and Teering Rotein Anguage Odels Through Parse Utoencoders

Uploaded by

workboy.mulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

I NTERPRETING AND S TEERING P ROTEIN L ANGUAGE

M ODELS THROUGH S PARSE AUTOENCODERS

Edith N. Villegas Garcia Alessio Ansuini

Area Science Park, Trieste, Italy

{edith.villegas, alessio.ansuini}
@areasciencepark.it
arXiv:2502.09135v1 [cs.LG] 13 Feb 2025

A BSTRACT

The rapid advancements in transformer-based language models have revolution-

ized natural language processing, yet understanding the internal mechanisms of
these models remains a significant challenge. This paper explores the application
of sparse autoencoders (SAE) to interpret the internal representations of protein
language models, specifically focusing on the ESM-2 8M parameter model. By
performing a statistical analysis on each latent component’s relevance to distinct
protein annotations, we identify potential interpretations linked to various protein
characteristics, including transmembrane regions, binding sites, and specialized
motifs. We then leverage these insights to guide sequence generation, shortlisting
the relevant latent components that can steer the model towards desired targets
such as zinc finger domains. This work contributes to the emerging field of mech-
anistic interpretability in biological sequence models, offering new perspectives
on model steering for sequence design.

1 I NTRODUCTION

Since the introduction of the transformer architecture (Vaswani, 2017), the capabilities of neural
networks to model and generate natural language have increased dramatically. Yet, due to their
black-box nature, we still lack a clear understanding of how these models achieve such capabilities
(Rai et al., 2024). Recently, the mechanistic interpretability approach has been proposed, where
researchers try to reverse engineer neural networks in a way similar to reverse engineering computer
programs (Olah, 2022; Rai et al., 2024). This involves understanding which features the network is
learning from our input data, and then how it performs operations with this set of features.
It has been observed that neural networks tend to encode high-level features as linear directions in
their representation space—such as the gender direction in word embeddings (Park et al., 2023).
Additionally, these models can store more facts and features than their parameter counts would
suggest, a phenomenon known as superposition (Elhage et al., 2022). This phenomenon represents
a core problem for interpretability: as a single neuron activation can be polysemantic and represent
multiple features simultaneously.
Recently, sparse autoencoders (SAEs) have been proposed as a method to disentangle internal rep-
resentations in language models, extracting features from superposition in an unsupervised manner
(Templeton et al., 2024; Bricken et al., 2023; Cunningham et al., 2023). Notably, these features
appear to be actionable: artificially activating them during inference can steer a model’s output
(Templeton et al., 2024; Makelov, 2024). Such methods have been successfully applied to language
(Templeton et al., 2024; Cunningham et al., 2023; Gao et al., 2024), as well as vision and mul-
timodal models (Gorton, 2024; Surkov et al., 2024), but biological and protein sequence models
remain relatively unexplored (Simon & Zou, 2024; Adams et al., 2025).
Protein language models have been shown to encode structural, functional, and evolutionary infor-
mation in their internal representations (Rives et al., 2019; Lin et al., 2023; Hayes et al., 2024).
Interpretability methods for these models could reveal biological mechanisms, and support model

1
debugging and editing for safety considerations. Additionally, model steering can be incorporated
into sequence design pipelines.
The main contributions of this paper are:
• A trained sparse autoencoder (SAE) for the ESM-2 8M parameter model, along with potential
interpretations of its latent components (sections 2.2, 3.1 and 3.3).
• A methodology for generating protein sequences by intervening on specific latents, demonstrating
successful steering towards non-trivial features, such as zinc finger domains (section 3.4).
• A heuristic for selecting the model layer from which to extract representations using an intrinsic
dimension estimator (section 3.2).

2 BACKGROUND
2.1 P ROTEIN L ANGUAGE M ODELS

Many advancements in Natural Language Processing have been successfully applied to biological
sequence modeling. Transformer-based neural networks can be trained on protein sequences using
the Masked Language Modeling (MLM) task, where each amino acid is treated as a token that can
be randomly masked. The model learns to predict the masked tokens by minimizing the following
loss function (Rives et al., 2019):
X
LMLM = Ex∼X EM −log p(xi |x/M ) (1)
i∈M

where x is a protein sequence, M is a set of masked indices and p(xi |x/M ) is the probability as-
signed to the ground truth amino acid xi given its sequence context.
Training on the Masked Language Modeling (MLM) task forces the network to learn dependencies
between masked amino acids and their sequence context while simultaneously capturing various
biological features present in the data. Embeddings extracted from these models have been shown
to encode information about secondary structure, tertiary contacts (residue-residue interactions),
function, remote evolutionary relationships, and factors relevant to predicting mutational effects
(Rives et al., 2019; Elnaggar et al., 2021; Meier et al., 2021; Lin et al., 2023; Hayes et al., 2024).
On the other hand, the attention mechanism appears to prioritize binding sites, with attention maps
capturing information about residue-residue interactions. (Vig et al., 2020).

2.2 S PARSE AUTOENCODERS

The sparse autoencoders used for interpretability are simple, single-layer models trained on the ac-
tivations of a larger language model. To disentangle network features, the hidden layer is made
significantly larger than the original embeddings, creating an overcomplete basis. A sparsity con-
straint is then applied to ensure that only a few latent neurons are active at a time, making the SAE’s
hidden representation far more interpretable than standard language model components (Templeton
et al., 2024; Bricken et al., 2023; Cunningham et al., 2023).

2.2.1 A RCHITECTURE
The autoencoder is composed of an encoding and a decoding function, given by:

z = fenc (x) = ReLU(Wenc (x − bdec ) + benc ) (2)

x̌ = fdec (z) = (Wdec · z + bdec ) (3)

Here fenc is the encoder, that takes an embedded amino acid token x ∈ IRd from a given layer in
the model and returns a latent z ∈ IRn≥0 with a hidden dimension n that is m times bigger that of the
original vector (expansion factor). The decoder fdec approximately reconstructs x given z, through
the decoding matrix Wdec ∈ IRn×d and the bias weight bdec ∈ IRd .

2
The loss function used for the training is a combination of the reconstruction error of the autoencoder
LM SE plus a sparsity constraint LL1 :

X X
L(x) = LM SE + LL1 = (xd − x̌d )2 + λ zn (4)
d n

While training, we renormalize the Wdec matrix to have unit norm after each backward pass. This is
necessary to prevent that autoencoder latents become arbitrarily small and satisfy the L1 constraint
without actually being sparse.

3 M ETHODS

3.1 T RAINING D ETAILS

We use the ESM-2 family of models (Lin et al., 2023) as our base, extracting activations from the fi-
nal output of the transformer block. We train on approximately 15k non-redundant protein sequences
from SCOPe 2.08 (Fox et al., 2014). Further details on the architecture, training procedures, and
hyperparameter selection can be found in section A.1 of the appendix.

3.2 L AYER S ELECTION

We adopt a principled strategy to select the layer from which we extract representations for the
sparse autoencoder. The initial intuition, in line with earlier studies (Templeton et al., 2024; Gao
et al., 2024), is to choose a mid-to-late layer, where the model is assumed to have developed ab-
stract features but is not yet focused on the output reconstruction task. However, unlike these prior
works, we move beyond mere intuition by incorporating a quantitative measure based on intrinsic
dimension.
Specifically, we compute the intrinsic dimension of each layer’s representations using the estimator
proposed by (Facco et al., 2017), and then identify where this value plateaus. Previous research has
shown that layers corresponding to local minima or plateaus in intrinsic dimension are where abstract
information is most clearly encoded (Valeriani et al., 2024). Selecting a layer within this plateau
increases the likelihood of capturing meaningful representations, providing a stronger foundation
for interpretability and model steering.

3.3 I NTERPRETING AUTOENCODER L ATENTS

We extract protein annotations from the UniProt database (Uniprot, 2025) and convert them into
binary labels for each amino acid in the sequence. We then compute the precision π and recall ρ of
each latent component k in detecting a given feature ϕ.
Let A be the set of all amino acids and Aϕ+ the set of amino acids that have been annotated with the
feature ϕ. Considering a latent k to be active (k + ) for a given amino acid when its value zk exceeds
a certain threshold τz , we have:

+ + a ∈ Aϕ+ : zk > τz
π = P (ϕ |k ) = (5)
|{a ∈ A : zk > τz }|

+ + a ∈ Aϕ+ : zk > τz
ρ = P (k |ϕ ) = (6)
Aϕ+

This gives us a value of precision and recall for each pair of k, ϕ. We consider a latent component to
be associated with a specific feature if its precision or recall exceed a predefined threshold that we
set to 0.80.

3
3.4 G ENERATING STEERED SEQUENCES

Once a latent corresponding to a specific feature is identified, we can steer the model during in-
ference to increase the likelihood of generating protein sequences that contain that feature. This
approach, previously demonstrated in natural language models by (Templeton et al., 2024), is out-
lined in figure 1.
We begin with a randomly generated amino acid sequence of fixed length. After a forward pass
through the model and the encoder layer of the SAE, we modify the target latent zk by scaling and
shifting its value to increase its magnitude (equation 7). We then pass the modified value zk∗ through
the decoder layer fdec and add back the original reconstruction error of the embedding x before
passing it through the rest of the model (equation 8).

zk∗ = a · zk + b (7)
∗
x = fdec (zk∗ ) + xerr (8)

For each position in the sequence, we randomly sample an amino acid according to the probability
distribution predicted by the model under the intervention. We repeat this process starting from the
predicted sequence and perform 100 iterations of inference-prediction to refine the sequence. We
select the sequence at the iteration where the value of the activation zk is maximum.

(A) (B)

MALWMRLLPL

embed
random sequence
MALWMRLLPL

+ attention
2. modify value of PLM + sample
the k-th latent Steering sequence x100
✕N layers + MLP
logits
*

pick highest z k
1. insert SAE
in layer N SAE TSGPTTFKQQ
final sequence

unembed

logits

Figure 1: Sequence generation procedure. (A) To steer the model outputs, the base Protein Language
Model is modified through the insertion of a sparse autoencoder in the residual stream, at a particular
layer. During inference, the value of one of the latents in the autoencoder is modified. (B) Starting
from a random sequence, we perform inference with the modified and intervened model, and sample
a new sequence from the output logits. We repeat this procedure iteratively a certain number of times
(i.e. 100), and at the end we retain the sequence which gives the highest value for the activation of
the target latent zk .

4 R ESULTS

4.1 I NTERPRETING L ATENTS

For the interpretability analysis, we focus on the autoencoder that provides the best trade-off between
sparsity and reconstruction quality, as described in section A.3.2 of the appendix. We compute recall
and precision for all [k, ϕ] pairs, following the methodology outlined in section 3.3, using three
increasing thresholds of latent activation. This allows us to assess the robustness of the identified
features.

4
Table 1: Number of latent-feature annotation pairs with a minimum precision/recall of 0.8 for dif-
ferent values of the activation threshold τk
τk # Pairs (Precision > 0.8) # Pairs (Recall > 0.8) Total
0.01 4 262 266
0.10 8 234 242
1.00 133 61 194
Total (unique) 133 262 395

We find 395 putative [k, ϕ] associations, detailed in table 1. Among these, there are latent compo-
nents associated to different binding sites, cellular regions and motifs like zinc fingers. The complete
set of latent - feature associations is available in the supplementary data (see section 5).
We also identify many potential associations with a lower confidence (lower values of precision or
recall). To get an idea of how many putative association are found for each value of precision/recall
(as well as a combined F1-score) we plot their cumulative distributions in figure 2.
Intuitively, a latent component that perfectly matches an annotation type should exhibit both high
precision and recall, resulting in a high F1-score. However, since the model is trained to optimize
a masked language modeling loss, the features it learns may not directly align with those in the
manually curated dataset. For instance, a latent k might encode a more specific subcategory of a
dataset label ϕ, such as identifying the starting amino acid of a helix rather than the entire helix
structure (as seen by Adams et al. (2025) on some features). In such cases, the association between
k and ϕ would likely have high precision but low recall.
Similarly, a high recall but low precision may indicate that the model has learned a more coarse-
grained feature than those defined in the dataset. This is evident in cases such as latent k = 610,
which activates across various types of zinc fingers, and latent k = 555, which responds to alpha-
keto acids. To prevent selecting latents with high recall due to trivial reasons – such as activating
indiscriminately on all amino acids – we also assess the proportion of times a latent is active on
amino acids lacking a given label, denoted as P (k + |ϕ− ), before confirming an association. In
section A.3.3, we present examples of the distributions of P (k + |ϕ+ ) and P (k + |ϕ− ) for a zinc
finger region.

$3UHFLVLRQ'LVWULEXWLRQ %5HFDOO'LVWULEXWLRQ &)6FRUH'LVWULEXWLRQ

k k k

1XPEHURI/DWHQWV

0LQLPXP3UHFLVLRQ 0LQLPXP5HFDOO 0LQLPXP)6FRUH

Figure 2: Distribution of the number of latent SAE components that detect a feature with a mini-
mum value of precision, recall and F1-score. Setting a higher value for the activation threshold τk
significantly increases the precision with which latents detect features, but it decreases the recall.

4.2 S EQUENCE G ENERATION

We test the sequence generation procedure with shortlisted latents with a clear association to
the zinc finger region annotations (high recall in this specific case). We probe different val-
ues of sequence length ([22, 27, 30, 35, 40, 60]), of the scaling a ([2, 5, 10, 20, 30]) and shift b
([0.1, 1, 10, 50, 100, 200]), for a total of 180 combinations, obtaining one sequence for each.

5
Figure 3: Examples of generated sequences subsequently folded with ESMFold (Lin et al., 2023).
The sequences were generated while intervening on the model by increasing the value of latent
components associated to the zinc finger motif. With the intervention, the model has a tendency to
generate pairs of beta sheets in the vicinity of a helix, as in a typical zinc finger structure.

Using an online tool for automatic annotation of zinc finger regions (Sathyaseelan et al., 2023), we
look for known zinc finger motifs and Pfam family matches in our generated sequences. We find
24 matching regions (out of 180 sequences) when we simultaneously intervene on the two most
prominent latents (highest recall) displaying a zinc finger association. In contrast, intervening on
the most prominent latent produces only 3 matches, while generating sequences from the baseline
model or intervening on a random latent or a random pair of latents produces no matches in any case.
Among the matched sequences, the highest percent similarity was 48%, with an average sequence
similarity of 31%, indicating a good level of diversity among the generated sequences.
While this sequence generation pipeline requires parameter fine-tuning to improve the success rate,
the process can be automated by introducing appropriate heuristics to search the parameter space
efficiently. To the best of our knowledge, this is the first application of steering with sparse autoen-
coder features to generate complex protein sequences, extending beyond trivial features like specific
amino acids or simple amino acid repeats.

5 D ISCUSSION AND CONCLUSIONS

In this study, we demonstrate the potential of sparse autoencoders (SAEs) for interpreting and ma-
nipulating the internal representations of protein language models. By training a SAE on the ESM-2
8M parameter model, we identified and interpreted latent features associated with various protein
annotations, including transmembrane regions, binding sites, and zinc finger motifs.
We have also demonstrated, for the first time, that these latent components can be leveraged to steer
the model towards generating protein sequences with non-trivial structural features, like zinc finger
motifs. The results highlight the utility of SAEs in disentangling the complex, polysemantic repre-
sentations within protein language models, paving the way for more interpretable and controllable
sequence generation. This approach not only deepens our understanding of how these models encode
biological features but also opens up new possibilities for protein design and engineering. Future
work could extend these methods to larger models and a wider range of protein features, further
bridging the gap between interpretability and practical applications in computational biology.

C ODE & DATA AVAILABILITY

The weights for the trained sparse autoencoder model are available from huggingface at:
https://huggingface.co/evillegasgarcia/sae esm2 6 l3. The code is available from github at:
https://github.com/edithvillegas/plm-sae. Supplementary data is available from zenodo at
https://doi.org/10.5281/zenodo.14837817

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

We thank the technical support of the Laboratory of Data Engineering staff and acknowledge the
AREA Science Park supercomputing platform ORFEO.

6
A.A. ws supported by the project “Supporto alla diagnosi di malattie rare tramite l’intelligenza
artificiale”- CUP: F53C22001770002, and by the European Union – NextGenerationEU within the
project PNRR ”PRP@CERIC” IR0000028 - Mission 4 Component 2 Investment 3.1 Action 3.1.1.
E.N.V.G. was supported by the project PON “BIO Open Lab (BOL) - Raforzamento del capitale
umano”—CUP: J72F20000940007.

R EFERENCES
Etowah Adams, Liam Bai, Minji Lee, Yiyang Yu, and Mohammed AlQuraishi. From mechanistic
interpretability to mechanistic biology: Training, evaluating, and interpreting sparse autoencoders
on protein language models. bioRxiv, 2025. doi: 10.1101/2025.02.06.636901. URL https:
//www.biorxiv.org/content/early/2025/02/08/2025.02.06.636901.

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick
Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decompos-
ing language models with dictionary learning, 2023. https://transformer-circuits.
pub/2023/monosemantic-features/index.html [Accessed: 2024].

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen-
coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600,
2023.

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec,
Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish,
Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of super-
position. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/
2022/toy_model/index.html.

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones,
Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: Toward un-
derstanding the language of life through self-supervised learning. IEEE transactions on pattern
analysis and machine intelligence, 44(10):7112–7127, 2021.

Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimen-
sion of datasets by a minimal neighborhood information. Scientific reports, 7(1):12140, 2017.

Naomi K Fox, Steven E Brenner, and John-Marc Chandonia. Scope: Structural classification of
proteins—extended, integrating scop and astral data and classification of new structures. Nucleic
acids research, 42(D1):D304–D309, 2014.

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya
Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint
arXiv:2406.04093, 2024.

Liv Gorton. The missing curve detectors of inceptionv1: Applying sparse autoencoders to incep-
tionv1 early vision. arXiv preprint arXiv:2406.03662, 2024.

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert
Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years
of evolution with a language model. bioRxiv, pp. 2024–07, 2024.

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin,
Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level
protein structure with a language model. Science, 379(6637):1123–1130, 2023.

Aleksandar Makelov. Sparse autoencoders match supervised features for model steering on the ioi
task. In ICML 2024 Workshop on Mechanistic Interpretability, 2024.

Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and Alex Rives. Language
models enable zero-shot prediction of the effects of mutations on protein function. Advances in
neural information processing systems, 34:29287–29303, 2021.

7
Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases, 2022.
https://transformer-circuits.pub/2022/mech-interp-essay/index.
html.
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry
of large language models. arXiv preprint arXiv:2311.03658, 2023.
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mecha-
nistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646,
2024.
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo,
Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function
emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2019. doi:
10.1101/622803. URL https://www.biorxiv.org/content/10.1101/622803v4.
Chakkarai Sathyaseelan, L Ponoop Prasad Patro, and Thenmalarchelvi Rathinavelan. Sequence
patterns and hmm profiles to predict proteome wide zinc finger motifs. Pattern Recognition, 135:
109134, 2023.
Elana Simon and James Zou. Interplm: Discovering interpretable features in protein language mod-
els via sparse autoencoders. bioRxiv, pp. 2024–11, 2024.
Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, and Caglar
Gulcehre. Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders.
arXiv preprint arXiv:2410.22366, 2024.
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam
Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, et al. Scaling monosemanticity: Extracting
interpretable features from claude 3 sonnet, 2024. https://transformer-circuits.
pub/2024/scaling-monosemanticity/ [Accessed: 2024].
Uniprot. Uniprot: the universal protein knowledgebase in 2025. Nucleic Acids Research, 53(D1):
D609–D617, 2025.
Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Al-
berto Cazzaniga. The geometry of hidden representations of large transformer models. Advances
in Neural Information Processing Systems, 36, 2024.
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
Jesse Vig, Ali Madani, Lav R Varshney, Caiming Xiong, Richard Socher, and Nazneen Fatema Ra-
jani. Bertology meets biology: Interpreting attention in protein language models. arXiv preprint
arXiv:2006.15222, 2020.

A A PPENDIX
A.1 S PARSE AUTOENCODER T RAINING

A.1.1 T RAINING DATASET

We train our model using the Astral SCOPe 2.08 dataset, filtered to 40% sequence identity, which
includes approximately 15k highly non-redundant protein sequences (Fox et al., 2014). This dataset
provides a manageable number of tokens, enabling faster iteration over different hyperparameters
while maintaining a diverse range of protein sequences and structural domains. This diversity allows
the autoencoder to learn a broad spectrum of features.

A.1.2 H ANDLING OF DEAD LATENTS

We check how frequently each of the latents activates over a subsample of tokens (50 batches of
4096 tokens) at regular intervals during training (every 500 batches). When this frequency is close
to zero (< 10−5 ), we consider that the latent is “dead” and we re-initialize its weights to “revive” it.

8
A.1.3 E VALUATION M ETRICS
To decide which hyperparameters (learning rate, sparsity penalty λ, SAE hidden size n) produce the
best sparse autoencoder, we use the following metrics:

• L0 : The average number of non-zero components in the latent vector z for a given amino
acid token. This is our measure of the sparsity level of the autoencoder.
• Number of dead latents: The number of components in the latent space that are never non-
zero over a large number of sample tokens (∼ 105 ). This is a general metric for sparse
autoencoders quality.
• Cross Entropy (CE) Increase: Difference between the average cross entropy loss of the
original model and the cross entropy loss of the model when we substitute the activations
in a given layer by the corresponding activations reconstructed by the autoencoder. This
indicates how much of the model’s performance the sparse autoencoder fails to reconstruct.

A.1.4 H YPERPARAMETER SELECTION

We perform a hyperparameter sweep with the following values: learning rate [5e−4 , 1e−4 , 1e−3 ];
L1 penalty [0.0003, 0.001, 0.005] and dictionary size multiplier [32, 10, 5].
We use the evaluation metrics detailed in section A.1.3 to decide on the best combination of hyper-
parameters. Specifically, we aim to find a model that balances reconstruction error and sparsity by
focusing on the “elbow” part of the CE increase against L0 plot, where increasing the density of
active latents in the latent space does not significantly reduce the CE increase, indicating an optimal
trade-off.

A.2 I NTERPRETABILITY

A.2.1 F EATURE A NNOTATION DATA

As ground truth features for the interpretability analysis, we use the following protein annotations
from Uniprot version 2024 1 :

• Transmembrane region
• Topological domain
• Binding site
• Zinc finger region
• Region of interest
• Intramembrane region
• Active site
• Disulfide bond
• Glycosylation site
• Helix
• Turn
• Strand

A.3 A DDITIONAL R ESULTS

A.3.1 I NTRINSIC D IMENSION A NALYSIS

We estimate the intrinsic dimension across all layers of ESM-2 8M following the methodology out-
lined by (Valeriani et al., 2024). The resulting curve, shown in Figure 4, guides our decision to
extract embeddings from layer 3. Larger models in the ESM-2 family exhibit a similar intrinsic
dimension profile across different model sizes, with a more pronounced plateau region of compara-
tively low intrinsic dimension in the same relative layer position.

9
Intrinsic Dimension in ESM2-8M
16

Estimated Intrinsic Dimension

0 1 2 3 4 5
Transformer Layer Block

Figure 4: Evolution of the intrinsic dimension estimate through the layers of the ESM-2 8M model.
We highlight the layers in the plateau/final ascent region.

A.3.2 S PARSE AUTOENCODER SELECTION

We evaluate all versions of the trained autoencoder primarily on two metrics: cross-entropy increase
and sparsity (measured by L0 ). These two goals are in conflict with each other, so we select what we
think is a good compromise at the bend of the Pareto frontier (figure 5). The selected autoencoder
has an average L0 per amino acid of 18, and a cross-entropy increase of 0.10, with a hidden size that
is 10 times larger than the original hidden size of the ESM-2 model. The number of dead latents
in this SAE is 573. Decreasing the cross-entropy even more would entail a significant increase in
activation density, which we want to avoid.

Sparse Autoencoder Evaluation

Cross-Entropy Increase

100

1
10

0 50 100 150 200 250 300 350

L0 - Sparsity

Figure 5: Cross-entropy increase vs sparsity trade-off for all the vanilla sparse autoencoders trained
on layer 3 embeddings from ESM-2 8M. The selected autoencoder is indicated by a dashed circle.

10
A.3.3 I NTERPRETING L ATENTS - ACTIVATION P LOTS

=LQF)LQJHU5HJLRQ&+W\SH
$ % &

z z z

P(k + | + ) P(k + | )
+ )UHFDOO

P(k + | )

P(k + |

/DWHQW&RPSRQHQW /DWHQW&RPSRQHQW /DWHQW&RPSRQHQW

Figure 6: (A) P (k + |ϕ+ ) - Percentage of tokens for which each latent component is active when
there is a C2H2 zinc finger type label, (B) P (k + |ϕ− ) - percentage when there is no C2H2 zinc finger
label, and (C) difference between these two values for three different activation thresholds τk . From
the last plot, we see that there is a latent component that is prominently associated with the C2H2
zinc finger label.

English Stage 9 Sample Paper 2 Insert - tcm143-595376
55% (11)
English Stage 9 Sample Paper 2 Insert - tcm143-595376
4 pages
AFCONS - DESIGN - Pavement Design (PK 50-75) - Anglais - 2021-03-08
100% (1)
AFCONS - DESIGN - Pavement Design (PK 50-75) - Anglais - 2021-03-08
89 pages
Islamic Political System (Basic Concept) : Madiha Ashraf
100% (1)
Islamic Political System (Basic Concept) : Madiha Ashraf
13 pages
Designing For Clarity Author Bianca Woods
No ratings yet
Designing For Clarity Author Bianca Woods
61 pages
03 Corpo Rigido-2d
No ratings yet
03 Corpo Rigido-2d
91 pages
Scripta Minoa Part II
No ratings yet
Scripta Minoa Part II
450 pages
Seminar On: Electronic Braking System (Ebs)
No ratings yet
Seminar On: Electronic Braking System (Ebs)
21 pages
Off-Line Programming Techniques For Multirobot Cooperation System
No ratings yet
Off-Line Programming Techniques For Multirobot Cooperation System
17 pages
3GPP TS 36.331 V10.12.0 (2013-12)
No ratings yet
3GPP TS 36.331 V10.12.0 (2013-12)
312 pages
VBQ-XII - English Core - 2
No ratings yet
VBQ-XII - English Core - 2
25 pages
Performance Management (Final)
No ratings yet
Performance Management (Final)
16 pages
Computer Science Worksheet
No ratings yet
Computer Science Worksheet
6 pages
Sparse Predictive Hierarchies: Eric Laukien, Ogma Corp
No ratings yet
Sparse Predictive Hierarchies: Eric Laukien, Ogma Corp
30 pages
Brief Summary of Peptic Ulcers
No ratings yet
Brief Summary of Peptic Ulcers
3 pages
Lm3622 Aplication Circuit
No ratings yet
Lm3622 Aplication Circuit
2 pages
#10 - Energy Balance - 01 (Rev01)
No ratings yet
#10 - Energy Balance - 01 (Rev01)
48 pages
702 - Sample Assignment
No ratings yet
702 - Sample Assignment
20 pages
Do Transformers Really Perform Bad For Graph Representation?
No ratings yet
Do Transformers Really Perform Bad For Graph Representation?
19 pages
Resume Piping Superintendent Gedeandi
No ratings yet
Resume Piping Superintendent Gedeandi
5 pages
Dynamic Fluid Pulsation
No ratings yet
Dynamic Fluid Pulsation
17 pages
Autoencoders U
No ratings yet
Autoencoders U
44 pages
Lohmann - Poultry
No ratings yet
Lohmann - Poultry
12 pages
Proteomics - 2023 - Le
No ratings yet
Proteomics - 2023 - Le
12 pages
2022 ESMFold
No ratings yet
2022 ESMFold
31 pages
【2023】热点文章 Mamba Linear-Time Sequence Modeling with Selective State Spaces
No ratings yet
【2023】热点文章 Mamba Linear-Time Sequence Modeling with Selective State Spaces
37 pages
Dokumen - Pub Machine Learning in Bioinformatics of Protein Sequences Algorithms Databases and Resources For Modern Protein Bioinformatics 9811258570 9789811258572
No ratings yet
Dokumen - Pub Machine Learning in Bioinformatics of Protein Sequences Algorithms Databases and Resources For Modern Protein Bioinformatics 9811258570 9789811258572
378 pages
Experiment No. 7: Numerical Aperture of The Optical Fiber
No ratings yet
Experiment No. 7: Numerical Aperture of The Optical Fiber
4 pages
Itlog Ni Jan
No ratings yet
Itlog Ni Jan
10 pages
1 Autoencoders
No ratings yet
1 Autoencoders
22 pages
2016, Yamasaki Et Al, Auditory Perceptual Evaluation of Normal and Dysphonic Voices Using The Voice Deviation Scale J Voice
No ratings yet
2016, Yamasaki Et Al, Auditory Perceptual Evaluation of Normal and Dysphonic Voices Using The Voice Deviation Scale J Voice
5 pages
Choosing A Course Booklet 2022
No ratings yet
Choosing A Course Booklet 2022
9 pages
Amazonico London A La Carte Menu
No ratings yet
Amazonico London A La Carte Menu
2 pages
Introduction To Autoencoders: A Brief Overview
No ratings yet
Introduction To Autoencoders: A Brief Overview
27 pages
Mamba Architecture
No ratings yet
Mamba Architecture
36 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
Autoencoder
No ratings yet
Autoencoder
39 pages
Neural Networks To Learn Protein Sequence-Function Relationships From Deep Mutational Scanning Data
No ratings yet
Neural Networks To Learn Protein Sequence-Function Relationships From Deep Mutational Scanning Data
12 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
AP I W T - L M: Rimer On The Nner Orkings of Ransformer Based Anguage Odels
No ratings yet
AP I W T - L M: Rimer On The Nner Orkings of Ransformer Based Anguage Odels
55 pages
Unit 5e - Autoencoders
No ratings yet
Unit 5e - Autoencoders
32 pages
BY:-Walabuma Lenjiso: Advisor
No ratings yet
BY:-Walabuma Lenjiso: Advisor
22 pages
Grade 10 Science Support Material Book Delhi
No ratings yet
Grade 10 Science Support Material Book Delhi
150 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
DUnit IV
No ratings yet
DUnit IV
22 pages
Lecture 2.3.1 - Autoencoders
No ratings yet
Lecture 2.3.1 - Autoencoders
6 pages
Neural Network Extrapolation To Distant Regions of The Protein Tness Landscape
No ratings yet
Neural Network Extrapolation To Distant Regions of The Protein Tness Landscape
13 pages
MODULE 5 Auto-Encoders and Generative Models
No ratings yet
MODULE 5 Auto-Encoders and Generative Models
25 pages
Autoencoder GAN Edited
No ratings yet
Autoencoder GAN Edited
138 pages
A I M F - L L M: Utomatically Nterpreting Illions of EA Tures in Arge Anguage Odels
No ratings yet
A I M F - L L M: Utomatically Nterpreting Illions of EA Tures in Arge Anguage Odels
29 pages
2020 - Transformer Protein Language Models Are Unsupervised Structure Learners
No ratings yet
2020 - Transformer Protein Language Models Are Unsupervised Structure Learners
24 pages
D5 PPT
No ratings yet
D5 PPT
79 pages
Rives Et Al 2021 Biological Structure and Function Emerge From Scaling Unsupervised Learning To 250 Million Protein
No ratings yet
Rives Et Al 2021 Biological Structure and Function Emerge From Scaling Unsupervised Learning To 250 Million Protein
12 pages
A Survey of Deep Learning Methods in Protein Bioinformatics and Its Impact On Protein Design
No ratings yet
A Survey of Deep Learning Methods in Protein Bioinformatics and Its Impact On Protein Design
30 pages
Yang2024-Convolutions Are Competitive With Transformers For Protein Sequence Pretraining
No ratings yet
Yang2024-Convolutions Are Competitive With Transformers For Protein Sequence Pretraining
24 pages
DL Asmt-2
No ratings yet
DL Asmt-2
17 pages
AAI Module 3
No ratings yet
AAI Module 3
11 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
Essay Structure and Paragraphing
No ratings yet
Essay Structure and Paragraphing
3 pages
NeurIPS 2021 Understanding How Encoder Decoder Architectures Attend Paper
No ratings yet
NeurIPS 2021 Understanding How Encoder Decoder Architectures Attend Paper
12 pages
LLMs Proteins
No ratings yet
LLMs Proteins
33 pages
Inverse Problems With Experiment-Guided Alphafold
No ratings yet
Inverse Problems With Experiment-Guided Alphafold
22 pages
Wgformer: An Se (3) - Transformer Driven by Wasserstein Gradient Flows For Molecular Ground-State Conformation Prediction
No ratings yet
Wgformer: An Se (3) - Transformer Driven by Wasserstein Gradient Flows For Molecular Ground-State Conformation Prediction
22 pages
English For Ug
No ratings yet
English For Ug
19 pages
Tsai Et Al - 2020 - Learning Molecular Dynamics With Simple Language Model Built Upon Long
No ratings yet
Tsai Et Al - 2020 - Learning Molecular Dynamics With Simple Language Model Built Upon Long
11 pages
Persistent Sheaf Laplacian Analysis of Protein Flexibility: Keywords
No ratings yet
Persistent Sheaf Laplacian Analysis of Protein Flexibility: Keywords
14 pages
Deep Generative Models
No ratings yet
Deep Generative Models
55 pages
Small Molecule Drug Discovery Through Deep Learning: Progress, Challenges, and Opportunities
No ratings yet
Small Molecule Drug Discovery Through Deep Learning: Progress, Challenges, and Opportunities
9 pages
Auto Encoder S
No ratings yet
Auto Encoder S
22 pages
Prottex: Structure-In-Context Reasoning and Editing of Proteins With Large Language Models
No ratings yet
Prottex: Structure-In-Context Reasoning and Editing of Proteins With Large Language Models
40 pages
Exploration of Protein Sequence Embeddings For Protein-Ligand Binding Site Detection
No ratings yet
Exploration of Protein Sequence Embeddings For Protein-Ligand Binding Site Detection
6 pages
Graphormer 2021 neurIPS
No ratings yet
Graphormer 2021 neurIPS
12 pages
Proteinbert
No ratings yet
Proteinbert
9 pages
Proteinbert
No ratings yet
Proteinbert
9 pages
Transformer Models in Biomedicine: Review Open Access
No ratings yet
Transformer Models in Biomedicine: Review Open Access
22 pages
Norm Referenced Interpretation
No ratings yet
Norm Referenced Interpretation
1 page
7& 9 Autoencoder and Variational Autoencoder
No ratings yet
7& 9 Autoencoder and Variational Autoencoder
13 pages
NeurIPS 2023 Birth of A Transformer A Memory Viewpoint Paper Conference
No ratings yet
NeurIPS 2023 Birth of A Transformer A Memory Viewpoint Paper Conference
29 pages
Hath Yoga
No ratings yet
Hath Yoga
5 pages
G L: F D - C L R - S M: ATE OOP Ully ATA Ontrolled Inear E Currence For Equence Odeling
No ratings yet
G L: F D - C L R - S M: ATE OOP Ully ATA Ontrolled Inear E Currence For Equence Odeling
14 pages
Unit Iii
No ratings yet
Unit Iii
15 pages
December Deep Learning
No ratings yet
December Deep Learning
10 pages
Unit5 Autoencoders
No ratings yet
Unit5 Autoencoders
45 pages
Dlunit 4
No ratings yet
Dlunit 4
122 pages
Unit 5 Autoencoders
No ratings yet
Unit 5 Autoencoders
6 pages
Rives Et Al Biological Structure and Function Emerge From Scaling Unsupervised Learning To 250 Million Protein Sequences
No ratings yet
Rives Et Al Biological Structure and Function Emerge From Scaling Unsupervised Learning To 250 Million Protein Sequences
12 pages
Chapter 7 - Autoencoders
No ratings yet
Chapter 7 - Autoencoders
91 pages
Leveraging Large Language Models For Protein Understanding
No ratings yet
Leveraging Large Language Models For Protein Understanding
14 pages
Lecture 23b Auto Encoder
No ratings yet
Lecture 23b Auto Encoder
27 pages
4373 Mechanistic Permutability
No ratings yet
4373 Mechanistic Permutability
18 pages
Autoencoders in Machine Learning
No ratings yet
Autoencoders in Machine Learning
7 pages
Foundation Models of Protein Sequences A B 2025 Current Opinion in Structur
No ratings yet
Foundation Models of Protein Sequences A B 2025 Current Opinion in Structur
10 pages
DeepPFP - A Multi-task-Aware Architecture For Protein Function Prediction
No ratings yet
DeepPFP - A Multi-task-Aware Architecture For Protein Function Prediction
10 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
105 pages

I S P L M S A: Nterpreting and Teering Rotein Anguage Odels Through Parse Utoencoders

Uploaded by

I S P L M S A: Nterpreting and Teering Rotein Anguage Odels Through Parse Utoencoders

Uploaded by

I NTERPRETING AND S TEERING P ROTEIN L ANGUAGE

M ODELS THROUGH S PARSE AUTOENCODERS

Area Science Park, Trieste, Italy

The rapid advancements in transformer-based language models have revolution-

2.2 S PARSE AUTOENCODERS

z = fenc (x) = ReLU(Wenc (x − bdec ) + benc ) (2)

3.1 T RAINING D ETAILS

3.2 L AYER S ELECTION

3.3 I NTERPRETING AUTOENCODER L ATENTS

4.1 I NTERPRETING L ATENTS

$ 3UHFLVLRQ'LVWULEXWLRQ % 5HFDOO'LVWULEXWLRQ & )6FRUH'LVWULEXWLRQ

4.2 S EQUENCE G ENERATION

5 D ISCUSSION AND CONCLUSIONS

C ODE & DATA AVAILABILITY

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

A.1.1 T RAINING DATASET

A.1.2 H ANDLING OF DEAD LATENTS

A.1.4 H YPERPARAMETER SELECTION

A.2.1 F EATURE A NNOTATION DATA

A.3 A DDITIONAL R ESULTS

A.3.1 I NTRINSIC D IMENSION A NALYSIS

Estimated Intrinsic Dimension

A.3.2 S PARSE AUTOENCODER SELECTION

Sparse Autoencoder Evaluation

0 50 100 150 200 250 300 350

  

You might also like

$3UHFLVLRQ'LVWULEXWLRQ %5HFDOO'LVWULEXWLRQ &)6FRUH'LVWULEXWLRQ