0% found this document useful (0 votes)

10 views25 pages

Contrastive Learning of T Cell Receptor Representations: Few-Shot

learning of T cell receptor

Uploaded by

liviu100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views25 pages

Contrastive Learning of T Cell Receptor Representations: Few-Shot

learning of T cell receptor

Uploaded by

liviu100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Contrastive learning of T cell receptor representations

Yuta Nagano,1, 2 Andrew Pyo,3 Martina Milighetti,2, 4 James Henderson,2, 5

John Shawe-Taylor,6 Benny Chain,2, 6, ∗ and Andreas Tiffeau-Mayer2, 5, ∗
1
Division of Medicine, University College London
2
Division of Infection and Immunity, University College London
3
Center for the Physics of Biological Function, Princeton University
4
Cancer Institute, University College London
5
Institute for the Physics of Living Systems, University College London
6
Department of Computer Science, University College London
Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand
challenge in immunology. Despite advances in high-throughput assays, specificity-labelled TCR data
remains sparse. In other domains, the pre-training of language models on unlabelled data has been
successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein
language models for TCR specificity prediction. Here we introduce a TCR language model called
SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), capable
of data-efficient transfer learning. Through our model, we introduce a novel pre-training strategy
arXiv:2406.06397v2 [q-bio.BM] 10 Oct 2024

combining autocontrastive learning and masked-language modelling, which enables SCEPTR to

achieve its state-of-the-art performance. In contrast, existing protein language models and a variant
of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-
based methods. We anticipate that contrastive learning will be a useful paradigm to decode the
rules of TCR specificity.

Antigen-specific T cells play important protective representation learning. A TCR representation model
and pathogenic roles in human disease [1]. The recog- that compactly captures important features would
nition of peptides presented on major histocompat- provide embeddings useful for data-efficient training
ibility complexes (pMHCs) by αβ T cell receptors of downstream specificity predictors.
(TCRs) determines the specificity of cellular immune In natural language processing (NLP), unsuper-
responses [2]. Hyperdiverse αβTCRs are generated vised pre-trained transformers have demonstrated ca-
during T cell development in the thymus by genetic pacity for transfer learning to diverse downstream
recombination of germline-encoded V, D (for TCRβ) tasks [29–31]. This has spurred substantial work ap-
and J gene segments with additional diversification plying transformers to protein analysis. Protein lan-
by trimming of the germline and insertions of non- guage models (PLMs) such as those of the ESM [32,
template nucleotides at gene segment junctions. 33] and ProtTrans [34] families have been successfully
A major goal of systems immunology is to uncover used in structure-prediction pipelines and for protein
the rules governing which TCRs interact with which property prediction [35–37]. PLMs have also been ap-
pMHCs [3]. Advances in high-throughput functional plied to TCR-pMHC interaction prediction [11, 22–
assays of TCR specificity [4–6] have made the use 24], and the related problem of antibody-antigen in-
of machine learning a promising prospect to discover teraction prediction [38, 39]. However, there has been
such rules. limited systematic testing of how competitive PLM
The most direct approach for applying machine embeddings are in the few-shot setting typical for
learning to TCR specificity prediction has been to most ligands – that is, where only few labelled data
train pMHC-specific models that take an arbitrary points are available for transfer learning.
TCR and predict binding [7–12]. More ambitiously, To address this question, we benchmarked existing
model architectures have been proposed that can in PLMs on a standardised few-shot specificity predic-
principle generalise predictions to arbitrary pMHCs tion task, and surprisingly found that they are inferior
as well [13–24]. Independent benchmarking studies to state-of-the-art sequence alignment-based meth-
have shown that both approaches are effective for pre- ods. This motivated us to develop SCEPTR (Simple
dicting TCR binders against pMHCs for which many Contrastive Embedding of the Primary sequence of
TCRs have been experimentally determined [25], but T cell Receptors), a novel TCR PLM which closes
generalisation to pMHCs not seen during training has this gap. Our key innovation is a pre-training strat-
largely remained elusive [26] and prediction accuracy egy involving an autocontrastive learning procedure
is limited for pMHCs with few known binders [27]. adapted for αβTCRs, which we show is the primary
This severely limits the utility of current predictive driver behind SCEPTR’s improved performance.
tools given that to date, only ∼ 103 of the > 1015 pos-
sible pMHCs are annotated with any TCRs in VDJdb
[28], and given that for > 95% of them less than 100 I. RESULTS
specific TCRs are known.
Meanwhile, there is abundant unlabelled TCR se-
A. Benchmarking PLM embeddings on TCR
quence data that may be exploited for unsupervised specificity prediction

Given the scarcity of specificity-labelled TCR data,

∗ Joint last authors it is of practical importance to evaluate model perfor-
2

a c Representation Space

Reference TCRs
TCR Distance
Calculation
d( , ) = 3windel + 1wsub
Model

Sequence
CAS ALNEQFF pMHC

CASSGALG QFF
Alignment Query TCRs

insertions / deletions substitutions

b d

TCR Distance
Calculation
d( , ) = x

Representation
Space

SCEPTR (ours)

Representation
Model

TCRs

Figure 1. Benchmarking TCR language models against sequence alignment-based approaches on few-shot
TCR specificity prediction. a) TCR similarity can be quantified using sequence-alignment by taking a (weighted)
count of how many sequence edits turn one TCR into another. b) Learned sequence representations allow alignment-free
sequence comparisons based on distances in the embedding feature space. c) Sketch of our standardized benchmarking
approach to allow side-by-side comparison of sequence-alignment and embedding methods. Using a reference set of known
TCR binders to a pMHC of interest, we propose nearest neighbour prediction as a task for unbiased comparison of the
quality of embeddings for specificity prediction. d) Performance of six different models on TCR specificity prediction
as a function of the number of reference TCRs. Specificity predictions were made by the nearest neighbour method
sketched in c against six different pMHCs and performance is reported as the AUROC averaged across the pMHCs. The
error bars represent standard deviations of model AUROCs relative to the average across all models within a data split.

mance where access to such data is limited. Therefore, els in the few-shot regime, since it remains well defined
we set up a benchmarking framework focused on few- for as few as a single reference TCR and does not re-
shot TCR specificity prediction. quire model specific fine-tuning.
To conduct our benchmark, we curated a set of We conducted multiple benchmarks for each
specificity-labelled αβTCR data from VDJdb [28]. pMHC, varying the number of its cognate TCRs used
We only included human TCRs with full α and β chain as the reference set. In each case, we combined the
information, and excluded data from an early 10x Ge- remaining TCRs for the target with the rest of the
nomics whitepaper [40], as there are known issues with filtered VDJdb dataset (including TCRs annotated to
data reliability in this study [41, 42]. This left us with pMHCs other than the six target pMHCs) to create
a total of 7168 αβTCRs annotated to 864 pMHCs. Of a test set (see methods III A). By studying how per-
these, we used the six pMHCs with greater than 300 formance depends on the size of the reference set, we
distinct binder TCRs for our benchmarking task. are effectively probing representation alignment with
We created a benchmarking task that allowed us to TCR co-specificity prediction at different scales.
directly compare sequence alignment-based distance We benchmarked six models: two alignment-
metrics such as the state-of-the-art TCRdist [4, 43] based TCR metrics (CDR3 Levenshtein distance
(Fig. 1a) to distances in PLM embedding spaces and TCRdist [4]), two general-purpose PLMs (Prot-
(Fig. 1b). For each pMHC, we tested models on their Bert [34] and ESM2 [33]), and two TCR domain-
ability to distinguish binder TCRs from non-binders specific language models (TCR-BERT [11] and our
using embedding distances between a query TCR and own model SCEPTR). We report performance using
its closest neighbour within a reference set (Fig. 1c). the area under the receiver operator characteristic
We call this nearest neighbour prediction. This frame- (AUROC) averaged over the tested pMHCs.
work is simple and attractive for benchmarking mod- To our surprise, we found that TCR-BERT, ESM2,
3

and ProtBert all fail to outperform the baseline se- needs to be introduced. The transformer is a neural
quence alignment method (CDR3 Levenshtein) and network developed in NLP that uses dot-product at-
are significantly inferior to TCRdist (Figs. 1d / S1). tention to flexibly learn long-range dependencies in se-
A repeat of the benchmarking with a broader set of quential data [29]. BERT is an encoder-only variation
epitopes obtained by including post-processed 10x Ge- of the transformer useful for text analysis and pro-
nomics whitepaper data [42] recapitulated these re- cessing [30]. BERT’s innovation was its ability to be
sults, demonstrating the robustness of our findings pre-trained in an unsupervised manner through MLM,
(Fig. S2). In contrast to existing PLMs, SCEPTR per- where snippets of text are fed to the model with a
forms on par with or better than TCRdist (Figs. 1d / certain proportion of tokens (e.g. words) masked, and
S1, Table SI). For a reference set of size 200, SCEPTR the model must use the surrounding context to recon-
performs better than TCRdist for five out of six tested struct the masked tokens. MLM allowed BERT and
peptides (p = 0.11, binomial test), and for all six pep- its derivative models to exploit large volumes of unla-
tides when compared to all other models (p = 0.015, belled data to learn grammar and syntax and achieve
binomial test). high performance on downstream textual tasks with
We additionally compared models using the aver- comparatively little supervised fine-tuning.
age distance between a query TCR and all references, While MLM-trained PLMs have been successful in
instead of only the nearest neighbour (Fig. S3). In some protein prediction tasks [32–34], they have been
this case, SCEPTR outperforms other models by an documented to struggle with others [37]. Our bench-
even wider margin. Interestingly, all models perform marking results led us to believe that MLM pre-
worse compared to their nearest neighbour counter- training may not be optimal for TCR-pMHC speci-
part. This finding might be explained mechanistically ficity prediction. Firstly, the majority of observed
by the multiplicity of viable binding solutions with TCR sequence variation is attributable to the stochas-
distinct sequence-level features, which are thought tic process of VDJ recombination. As such, MLM
to make up pMHC-specific TCR repertoires [44, 45]. may not teach models much transferable knowledge
We hypothesized that the poor performance of prior for specificity prediction. Secondly, since the low vol-
PLMs might be overcome by learning projections ume of specificity-labelled TCR data provides limited
from their high-dimensional representation space that opportunities for fine-tuning complex models, repre-
align better with the TCR co-specificity prediction sentation distances should ideally be directly predic-
task. We thus used the embeddings as input to tive of co-specificity.
linear support vector classifiers (Appendix B). Op- We were inspired to use contrastive learning to
timised linear probing does improve prediction per- overcome these problems by the success of our previ-
formance (Fig. S4), but they remain inferior to near- ous work using statistical approaches to uncover pat-
est neighbour predictions using SCEPTR further il- terns of sequence similarity characteristic of ligand-
lustrating the usefulness of SCEPTR embeddings for specific TCR repertoires [44–46]. Contrastive learning
data-efficient transfer learning. minimises distances between model representations of
In some use cases, we may want to apply SCEPTR positive sample pairs while maximising distances be-
to the analysis of single chain TCR data. In bench- tween background pairs (Fig. 2c) through a loss func-
marking on either α or β chain reference data alone tion of the following form [47, 48]:
prediction accuracy drops somewhat regardless of dis-
tance measure (Fig. S5), as expected given that both Lcontrastive (f ) :=
receptor chains provide non-redundant information
" ⊤ +
#
ef (x) f (x )
about specificity [45]. Importantly though, SCEPTR E − log f (x)⊤ f (x+ ) P f (x)⊤ f (y )
distances also provide comparable or better prediction (x,x+ )∼ppos e + ie i

iid
accuracy than TCRdist on the single chain level. {yi }N
i=1 ∼ pdata

(1)

where f : X → S m−1 is a trainable embedding map-

B. Autocontrastive learning as a pre-training ping from sample observation space X to points on
strategy the m-dimensional unit hypersphere S m−1 ⊂ Rm , ppos
is the joint distribution of positive pairs, pdata is the
We now briefly summarize SCEPTR’s architecture overall data distribution, and N ∈ Z+ is some fixed
and autocontrastive pre-training strategy (see Meth- number of background samples.
ods for full details). SCEPTR featurises an input There are several well-known variants of this learn-
TCR as the amino acid sequences of its six CDR ing approach. In supervised contrastive learning, pos-
loops. It uses a simple one-hot encoding system to itive pairs are generated by sampling observations
embed the amino acid tokens, and uses a stack of three known to belong to the same class (Fig. 2d top). In
self-attention layers to generate a 64-dimensional rep- the context of TCRs, we can define positive pairs
resentation vector of the input receptor (Fig. 2a,b). to be TCRs annotated to interact with the same
Unlike existing TCR language models, SCEPTR is pMHC, in which case contrastive learning regresses
jointly pre-trained using autocontrastive and masked- distances between TCR pairs to their probabilities of
language modelling (MLM) (Fig. 2c,d). co-specificity. Autocontrastive learning approximates
To motivate the considerations that have led us to such positive pairs through data augmentation by gen-
adopt this training paradigm, some background on erating two independent “views”’ of the same obser-
transformer architectures and their training by MLM vation (Fig. 2d bottom).
4

a b
<cls> Average-Pooled one-hot
Representation Representation (22 dims)

Embedding Vector
CASSALNEQFF relative
position
Contextualised AA Residue Embeddings Token ID: S (1 dim)
0-indexed Position: 2 (of 10) 0.2
... ... ... ... ... ... Compartment ID: CDR3B one-hot
(6 dims)

Initial Token Embeddings

Self-Attention
Stack ... ... ... ... ... ...

...
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.33 0.66 1.0
Initial Embedding ... ... ... ... ... ...
Vectors
<cls> CDR1A CDR2A CDR3A
Embedding Compartments
Amino Acid
T ... Y G ... N C ... F P ... T F ... Q C ... F
Sequence
CDR1A CDR2A CDR3A CDR1B CDR2B CDR3B
d
Supervised Contrastive Learning
Matched TCRs: same speciﬁcity
TCR

Autocontrastive Learning
Matched TCRs: two views of same TCR

Add repulsive force Add attractive force

v v v
between all TCRs between matched TCRs

Figure 2. A visual introduction to how SCEPTR works. a) SCEPTR featurises an input TCR as the amino acid
sequences of its six CDR loops. Each amino acid residue is vectorised to R64 (see panel b) and are passed along with the
special <cls> token vector through a stack of three self-attention layers. SCEPTR uses the contextualised embedding
of the <cls> token as the overall TCR representation, in contrast to the average-pooling representations used by other
models. b) SCEPTR’s initial token embedding module uses a simple one-hot system to encode a token’s amino acid
identity and CDR loop number, and allocates one dimension to encode the token’s relative position within its CDR
loop as a single real-valued scalar. c) Contrastive learning allows us to explicitly optimise SCEPTR’s representation
mapping for TCR co-specificity prediction. At a high level, contrastive learning encourages representation models to
make full use of the available representation space while keeping representations of similar input samples close together.
d) Contrastive learning generalises to both the supervised and unsupervised settings. In the supervised setting, positive
pairs can be generated by sampling pairs of TCRs that are known to bind the same pMHC. In the unsupervised setting,
positive pairs can be generated by generating two independent “views” of the same TCR. We implement this by only
showing a random subset of the input data features for every view – namely, we remove a proportion of input tokens
and sometimes drop the α or β chain entirely (see methods III C).

Given the scarcity of available labelled data, we censoring strategy inspired by masked-language mod-
opted to use the autocontrastive approach for purely eling that randomly removes a proportion of residues
unsupervised PLM pre-training (see Sec. I E for an ap- or even complete α or β chains. In contrast to the
plication of supervised contrastive learning, more sim- only other study known to us having explored the ap-
ilar to other recent applications of contrastive learn- plication of autocontrastive learning to TCRs [53], we
ing to TCRs [49, 50]). For this pre-training, we used trained SCEPTR on all six hypervariable loops of the
data on close to a million unique paired TCRs ob- full paired chain αβ TCR, as all contribute to TCR-
tained by Tanno et al. [51], which represents one the pMHC specificity [45]. That being said, our chain
largest collections of αβ TCRs from a single study dropping procedure during censoring ensures that sin-
collected to date. This data was filtered and stan- gle chain data are also in distribution for the model,
dardized as described in methods III C. We gener- giving SCEPTR flexibility for downstream applica-
ate different “views” of a TCR by dropout noise as tions with bulk sequenced TCR repertoires (Fig. S5).
is standard in NLP [52], but additionally adopted a
We define SCEPTR’s output representation vector
5

to be a contextualised embedding of a special input SCEPTR (synthetic data): This variant is trained
token called <cls> (the naming convention for <cls> on a size-matched set unlabelled αβTCRs gen-
comes from the fact that the output of this vector is erated by OLGA [55], a probabilistic model of
often used for downstream classification [30]), which VDJ recombination. This synthetic data models
is always appended to the tokenised representation of only the recombination statistics and thus esti-
an input TCR (Fig. 2a). This allows SCEPTR to mates the TCR distribution without taking into
fully exploit the attention mechanism when generat- account the imprints of thymic and peripheral
ing the overall TCR representation. Such training selection found in real repertoires [44].
of a sequence-level representation is uniquely made
possible by having an objective – the contrastive loss SCEPTR (shuffled data): This variant is trained
(Eq. 1) – that directly acts on the representation out- on the same set of αβTCRs as the original
put. In contrast, MLM-trained PLMs such as Prot- model, but the α/β chain pairing is randomised,
Bert, ESM2 and TCR-BERT generate sequence em- thus removing pairing biases [56].
beddings by average-pooling the contextualised em- We find that SCEPTR trained on synthetic or shuf-
beddings of each input token at some layer: a destruc- fled data performs worse for five out of six pMHCs
tive operation which risks diluting information [37]. (p = 0.11, binomial test), but differences in AUROCs
between model variants are small and regardless of
training data SCEPTR performs on par with TCRdist
C. Ablation studies (Fig. 3c).
Taken together, these ablation studies provide evi-
To understand which modelling choices drive the dence that autocontrastive learning is the main factor
improved performance of SCEPTR, we trained vari- enabling SCEPTR to close the gap between PLMs and
ants of SCEPTR ablating a single component of ei- alignment-based methods.
ther its architecture or training at a time, and bench- Information-theoretic analysis of the sequence de-
marked them using the framework described previ- terminants of TCR specificity demonstrate that all
ously. CDR loops and their pairing are important for deter-
To establish the contribution of the autocontrastive mining binding specificity [45]. To understand how
learning to SCEPTR’s performance, we trained the much SCEPTR’s improved performance with respect
same model only using masked-language modelling: to TCR-BERT is due to the restriction of the latter
model’s input to the CDR3 alone, we trained variants
SCEPTR (MLM only): This variant is trained of SCEPTR restricted to this hypervariable loop:
only on MLM, without jointly optimising for
autocontrastive learning. Following conven- SCEPTR (CDR3 only): This variant only accepts
tion in the transformer field [54], TCR rep- the α and β chain CDR3 sequences as input
resentation vectors are generated by average- (without knowledge of the V genes/first two
pooling the contextualised vector embeddings CDR loops of each chain). It is jointly optimised
of all constituent amino acid tokens produced for MLM and autocontrastive learning.
by the penultimate self-attention layer, and ℓ2- SCEPTR (CDR3 only, MLM only): This other-
normalising the result. wise equivalent variant is only trained using
the MLM objective, and thus uses the average-
The MLM-only variant underperforms compared to
pooling representation method.
both SCEPTR and TCRdist, demonstrating that au-
tocontrastive learning is a necessary ingredient for the The results demonstrate that taking into account all
increased performance of SCEPTR in few-shot speci- CDR loops leads to a performance gain as expected
ficity prediction (Fig. 3a). (Fig. 3d). We also see that autocontrastive learning
We next sought to determine how much our pool- even when restricted to CDR3s leads to a substantial
ing strategy and training dataset choice contributed performance gain, helping the autocontrastive CDR3
to SCEPTR’s performance gain. First, we asked variant achieve similar performance to the full-input
whether autocontrastive learning also improves em- MLM-only variant.
beddings generated via token average-pooling:

SCEPTR (average pooling): This variant re- D. Comparison of SCEPTR embeddings to

ceives both autocontrastive learning and alignment-based TCR similarity
MLM, but uses the average-pooling method to
generate TCR representations. To gain insights into what SCEPTR has learned
during pre-training, we compared its embedding dis-
While SCEPTR’s <cls> embeddings achieve the tances to alignment-based distances as calculated by
best results, the autocontrastive average-pooling vari- TCRdist. To do so, we calculated all pairwise dis-
ant still performs on par with TCRdist (Fig. 3b). tances between one thousand TCRs randomly sam-
Second, we determined how the performance of pled from the testing partition of the Tanno et al.
SCEPTR depends on the precise dataset used for pre- dataset [51] (held out during SCEPTR pre-training)
training. To answer this question we trained two vari- using both models. We find that SCEPTR and
ants of SCEPTR using size-matched datasets: TCRdist distances are clearly correlated (Fig. 4a).
6

Training Ablation Architectural Ablation

a 0.80 b

0.75
Mean AUROC

0.70

0.65

0.60 SCEPTR (MLM only) SCEPTR (average pooling)

SCEPTR
Data Ablation Feature Ablation TCRdist
c 0.80 d TCR-BERT

0.75
Mean AUROC

0.70

0.65
SCEPTR (shuffled data) SCEPTR (CDR3 only)
0.60 SCEPTR (synthetic data) SCEPTR (CDR3 only, MLM only)

1 2 5 10 20 50 100 200 1 2 5 10 20 50 100 200

Number of Reference TCRs Number of Reference TCRs

Figure 3. Autocontrastive pre-training significantly improves SCEPTR’s downstream performance. The

subplots show performance profiles of SCEPTR, TCRdist, TCR-BERT, and various ablation variants of SCEPTR on
binary specificity prediction. a) Training SCEPTR solely on MLM results in worse specificity prediction performance.
b) The baseline SCEPTR variant which uses the <cls> pooling method performs marginally better than the variant which
uses the average-pooling method. However, the average-pooling variant still performs on par with TCRdist. c) Replacing
SCEPTR’s pre-training dataset with 1) the same dataset from Tanno et al., but with α/β chain pairing shuffled, and 2)
synthetic data generated by OLGA both result in similar specificity prediction performance. d) Restricting SCEPTR’s
featurisation of input TCRs to the amino acids of the α and β CDR3 loops significantly worsen downstream performance.
Additionally restricting training to only MLM further degrades performance, and produces a model with a near-equivalent
performance profile to TCR-BERT.

a b c
400 400 10 15
TCRdist distance

TCRdist distance

250
10 18
300 300 200
pgen

10 21
200 200 10 24 150
100 r = 0.380 100 10 27
r = 0.479
100
1.0 1.5 1.0 1.5 25 20 15
SCEPTR distance SCEPTR distance log10 (pgen)

Figure 4. SCEPTR embedding distances weight sequence similarity with respect to recombination biases.
a) Scatter plot of SCEPTR and TCRdist distances between pairs of TCRs from the held-out test set of the pre-training
dataset. The points are coloured according to a Gaussian kernel density estimate. b) Colouring TCR pairs instead by the
minimal probability of generation pgen of the two TCRs as estimated by OLGA [55] suggests that SCEPTR embeddings
locally contract regions of representation space that due to recombination biases are sparsely sampled. c) For sequence
pairs judged to be similar by SCEPTR (distance ∈ [0.98, 1.02]), variations in pgen explain a substantial fraction of the
variance in TCRdist, providing statistical evidence for the hypothesized weighting of sequence similarity with respect to
the local density of sequences produced by VDJ recombination (see Fig. S6 for the generality of this dependence across
SCEPTR bins).
7

This shows that in parts the success of SCEPTR can 1.0

be understood by its embedding distances providing
good alignment-free approximations to traditional se-
0.9
quence similarity measures. Yet, there is also substan-
tial variability between both measures for many pairs,
and an inspection of such discordant pairs can provide 0.8

AUROC
insights into how metrics differ.
First, we noticed that pairs of sequences judged 0.7
to be similar by SCEPTR but not TCRdist were
less likely to be generated during VDJ recombina- 0.6 SCEPTR (finetuned)
tion (Fig. 4b). We found that among sequence pairs SCEPTR
judged to be similar by SCEPTR, TCRdist distance TCRdist
was strongly negatively correlated with pgen (Figs. 4c 0.5 TCR-BERT
/ S6). This implies that SCEPTR embeds TCRs
closer to each other if they are in regions of sequence

FLL
E

V
YY
DL

AT
VF

LG
RT
YF
space that are less densely sampled by the generative

MV
LM

SF
QP
RW
PF

VP
GIL

DP
YL
distribution. As argued in detail in the Discussion,

NL
TT
YV
this property of SCEPTR embeddings is expected

TFE
on theoretical grounds due to the loss function used Epitope
for contrastive learning. This property might enable
SCEPTR embeddings to capture the intuition that
finding close-by nearest neighbours is more surprising Figure 5. Supervised contrastive learning improves
for sequences with low pgen and thus more informa- discrimination between pMHCs. Prediction perfor-
mance as measured by AUROC on binary one-versus-rest
tive compared to similarity between highly probable
classification for each of six pMHCs for different models.
TCRs. The fine-tuned model improves performance by exploiting
Second, we noticed that a high similarity on a single the discriminative nature of the classification task.
chain tended to be sufficient for a small SCEPTR dis-
tance (Fig. S7). To quantify this effect, we analysed
how SCEPTR distances correlate with different ways (methods III D).
of averaging the α and β chain TCRdist distances into We used the framework from section I A to bench-
a paired chain measure. We found that SCEPTR dis- mark the performance of fine-tuned SCEPTR, using
tances correlate more closely with the minimum dis- the training set as the references. The results show
tance of the two chains rather than their arithmetic that fine-tuning can greatly improve the ability of
mean (Fig. S8). Further investigation might focus on the model to discriminate between pMHCs (Fig. 5).
whether this property helps prediction performance Improvements are most noticeable for the pMHCs
due to the varying contributions of the TCR α and β against which other methods achieve relatively low
chain to specific binding across pMHCs [45]. performance. When filtering all TCRs with greater
than 90% or 80% sequence similarity to any training
sequence from the test set, the fine-tuned model still
E. Supervised contrastive learning as a improves performance significantly (Fig. S9) show-
fine-tuning strategy ing that learning goes beyond memorization of public
TCRs.
Supervised contrastive learning provides an avenue Interestingly, unlike other models, fine-tuned
to further optimise pre-trained embeddings for TCR SCEPTR makes better inferences by measuring the
specificity prediction. As a proof-of-concept, we fine- average distance between a query TCR and all refer-
tuned SCEPTR to better discriminate between the six ence TCRs instead only the nearest TCR (Fig. S13).
pMHC specificities used as the benchmarking targets This suggests an ability of supervised contrastive
in section I A. fine-tuning to help the model discover the common-
For this task we took all the TCRs annotated alities between the multiple different binding solu-
against the target pMHCs from our labelled TCR tions thought to exist for each pMHC. We thus anal-
dataset, and split them into a training, a validation, ysed how fine-tuning changes the SCEPTR embed-
and a testing set. We ensured that no study used ding distances between co-specific and cross-pMHC
for training or validation contributed any data to the TCR pairs (Fig. 6). Unexpectedly, we found that a
test set, so that the fine-tuned model would not be major difference of fine-tuned SCEPTR distances con-
able to achieve good performance simply by exploit- cerns cross-pMHC pairs. We observe that fine-tuning
ing inter-dataset biases. The training set included 200 allows the model to identify a subset of “easy” nega-
binders against each target pMHC, totalling to 1200 tive pairs. These presumably involve TCRs the model
TCRs. The rest of the TCRs from the same studies is highly confident are specific to different pMHC, thus
were used to construct the validation set. TCRs from illustrating how discrimination between a fixed set of
all remaining studies were used for the testing set, potential target pMHCs is easier than binary classifi-
which comprised of 5670 TCRs. SCEPTR was fine- cation with respect to TCRs of arbitrary specificity.
tuned on the training set with supervised contrastive Conversely, the fine-tuned model’s performance de-
learning, using the validation loss for early stopping grades with respect to unseen pMHCs (Fig. S10), per-
8

(metaclonotypes) by sequence-based clustering [4, 43].

A limitation of our study is that we did not un-
dertake a complete exploration of training and archi-
tectural hyperparameters, as well as training dataset
Cospecific choice. We envisage multiple avenues that may im-
0.8 True prove SCEPTR further. Firstly, training could be
SCEPTR (finetuned) distances

False made more efficient by optimising the distribution of

masked/dropped tokens during pre-training, taking
0.6 into account the variable relevance of different parts
of the sequence in determining specificity [45]. Sec-
0.4 ondly, as certain sequence motifs appear recurrently
(e.g. CDR3 loops often begin with CAS), a more in-
telligent tokenisation scheme could offload learning
0.2 of these primary sequence statistics into the tokeni-
sation process. Finally, with the emergence of in-
creasingly large paired-chain TCR datasets [57–59],
0.0
retraining SCEPTR on data from multiple sources
0.0 0.5 1.0 1.5 could eliminate biases inherent to specific experimen-
SCEPTR (default) distances tal approaches and donor MHC restrictions.
Pre-trained PLMs have achieved high performance
Figure 6. Supervised contrastive learning reshapes
on protein stability and structural predictions [33, 34].
the embedding space. Scatter plot of distance between However, we find that existing PLMs fail to con-
pairs of TCRs as measured by the pre-trained and fine- fer similar benefits to predicting TCR-pMHC inter-
tuned versions of SCEPTR, and their marginal conditional actions. This finding adds to recent work showing
probability density functions. Points are coloured based on that current PLM pre-training is not well-aligned with
whether the corresponding TCR pair involves TCRs that certain downstream tasks [37]. Importantly, we show
are annotated against the same pMHC (orange) or against that autocontrastive pre-training can overcome mis-
different pMHCs (purple). Pairs of TCRs were obtained alignment, and thus provide a constructive path out
by sampling one receptor from the training partition and of this impasse which could also be applied outside of
the other from the test partition of the labelled TCR data the TCR domain.
used during fine-tuning.
What determines whether a certain downstream
task is aligned with MLM pre-training? MLM teaches
PLMs to predict the conditional distribution of tokens
haps unsurprisingly given the very limited number of
given sequence context. Thus it stands to reason that
pMHCs represented in the training data.
amenable downstream tasks involve predictions of
Another notable feature of Fig. 5 is that perfor-
properties that determine the distribution of observed
mance varies substantially by ligand. That is, predic-
proteins on sequence space. Observed proteins tend to
tion is easier for some pMHCs, regardless of method.
concentrate in areas of sequence space with higher pro-
This is a phenomenon that has also been observed in
tein stability since evolution on average selects for this
public benchmarks [25]. Consequently, we used coin-
property [60]. For datasets containing protein fami-
cidence (order two Rényi) entropy measures [44, 45]
lies whose members have a conserved structure despite
to discover intrinsic properties of the pMHC-specific
primary sequence variation, co-evolutionary couplings
TCR repertoires that determine model AUROCs.
driven by structural constraints influence allowed se-
We find that TCRs annotated against the “easier”
quence variability [33, 61, 62]. These data distribu-
pMHCs have lower V/J gene diversity (Fig. S11) and
tional properties might explain how MLM can teach
lower average sequence distances between pairs of
PLMs features related to both stability and structure.
TCRs (Fig. S12). This indicates that differences in
models’ ability to predict TCR-pMHC specificity are In contrast, the distribution of TCRs over sequence
linked to the diversity of epitope-specific repertoires. space is primarily shaped by the biases of VDJ re-
combination with antigen-specific selection playing an
important, but likely second-order effect [44]. While
long-term evolutionary pressures may act to align re-
II. DISCUSSION combination statistics with TCR function [63, 64], em-
pirical evidence so far suggests recombination biases
In this study, we have introduced SCEPTR, a primarily anticipate thymic selection for stability and
pre-trained TCR PLM that achieves state-of-the-art folding [65]. In contrast, studies to date have found no
few-shot TCR-pMHC specificity prediction accuracy. clear relationship between probabilities of recombina-
Through SCEPTR we demonstrate that joint au- tion and the likelihood of receptors engaging specific
tocontrastive and masked-language pre-training is a pMHCs [55]. Given these considerations, we expect
paradigm for learning PLMs better aligned with TCR MLM pre-training to align better to tasks concerning
specificity prediction tasks. Our model can be readily VDJ recombination than to TCR specificity predic-
used for alignment-free TCR analysis in downstream tion. Indeed, previous work training PLMs on adap-
applications (see code availability) including the un- tive immune receptors has demonstrated that embed-
supervised discovery of antigen-specific T cell groups dings strongly depend on V/J gene usage and can be
9

a b forms almost equivalently to the much larger but sim-

ilarly pre-trained TCR-BERT (Fig. 3d). This finding
0.8
200-shot Mean AUROC
stands in contrast to observations of performance scal-
ing with model size in general PLMs [33] and antibody
0.7 language modelling [39]. While the focus of the cur-
rent study was on training a simple model, it would be
interesting in future work to investigate performance
0.6 scaling with model complexity and training dataset
size with our novel training procedure.
0.5 Looking forward, there are many exciting avenues
104 106 108 101 102 103 to further develop contrastive learning as a paradigm
Parameter Count Representation to crack the TCR code. For example, there may be
Dimensionality ways to exploit the uniformity (Eq. 2) and alignment
(Eq. 3) decomposition to simultaneously train on un-
SCEPTR ESM2 (T6 8M) labelled and specificity-labelled data. A practical ben-
TCR-BERT ProtBert efit of our contrastive learning formulation is that it
does not require any optimisation with respect to the
true negative distribution (i.e. TCRs that are explic-
Figure 7. Model complexity does not correlate
with downstream performance. Model performance itly not co-specific) – a non-trivial distribution to es-
as measured by mean 200-shot AUROC (section I A) does timate for TCRs [66].
not scale with model complexity as measured by either a) Another interesting avenue is the use of labels other
parameter count or b) representation dimensionality. De- than pMHC specificity – such as phenotypic annota-
spite being the smallest PLM by a wide margin, SCEPTR tions from single-cell data – as additional supervised
performs better than alternative models. contrastive training signals. While supervised con-
trastive learning does not currently lead to general-
isable learning beyond training pMHCs, we expect a
used to predict primarily generation-related proper- transition towards generalisation as larger volumes of
ties such as receptor publicity [24, 38]. specificity-labelled TCR data become available, as has
Why does autocontrastive learning help to generate been the case with supervised contrastive learning in
embeddings better suited for specificity prediction? other fields [52, 67–70].
An interesting insight comes from an asymptotic de- Finally, while the focus of this work given current
composition of the contrastive loss function into the data limitations has been on learning TCR embed-
uniformity and alignment terms [47]: dings, contrastive learning may also help us learn ef-
fective joint TCR-pMHC embeddings in the future
h 2
i when the joint space (and particularly the pMHC
Unif.(f ) := log E e−∥f (x)−f (y)∥ (2) space) is better sampled, and thus ultimately enable
iid
x,y ∼ pdata the zero-shot prediction of TCR-pMHC specificity.
∥f (x) − f (x+ )∥

Align.(f ) := E (3)
(x,x+ )∼p pos

III. METHODS
Uniformity incentivises the model to make use of
the full representation space, while alignment min-
imises the expected distance between positive pairs A. Model benchmarking
(e.g. co-specific TCRs) [47]. From this view, con-
trastive learning on adaptive immune receptor data For each pMHC, we varied the number k of ref-
encourages PLMs to undo the large-scale distribu- erence TCRs where k ∈ {1, 2, 5, 10, 20, 50, 100, 200}.
tional biases created by VDJ recombination through Within each model-pMHC-k-shot combination, we
the uniformity term, while helping to identify fea- benchmarked multiple reference-test splits of the data
tures relating to TCR (co-)specificity via the align- to ensure robustness. For k = 1, we benchmarked ev-
ment term. While autocontrastive learning approxi- ery possible split. For k ∈ [2, 200], we benchmarked
mates the alignment term through the generation of 100 random splits, where we ensured that the same
pairs of views, it still provides a direct empirical es- splits were used across all models to reduce extrane-
timate for the uniformity term. Thus, a key bene- ous variance.
fit of autocontrastive learning may be that it reduces In assessing the statistical significance of differences
the confounding effects of VDJ recombination in em- in average model performance, we took a paired dif-
bedding space. SCEPTR’s ability to “adjust” its dis- ference approach. We expected certain pMHCs and
tances for pgen as demonstrated in Figs. 4 and S6 lends data splits to present a more difficult prediction prob-
support to this conjecture. lem than others. As we are interested in assessing
Comparing SCEPTR to other PLMs suggests that the relative performance of models, we calculated the
model complexity as measured by either parameter variance across splits of the difference between each
count or representation dimensionality is not currently individual model’s AUROC and the average across all
the limiting factor for TCR-pMHC prediction perfor- models. For each model, we estimated this variance
mance (Fig. 7). This is directly supported by how within each of the pMHCs, and then averaged these
the CDR3-only, MLM-only variant of SCEPTR per- variances to obtain an estimate of overall variance.
10

The CDR3 Levenshtein model computes the dis- step to remove clonotypes that shared the same nu-
tance between two TCRs as the sum of the Lev- cleotide sequence for either the α or the β chain, as
enshtein distances between the receptors’ α and β previously described [44]. After filtering for functional
CDR3s. TCRs using tidytcells, a TCR gene symbol standard-
Note that while SCEPTR’s architecture and train- iser [71], we retained 842,683 distinct clonotypes.
ing allows it to directly generate representation vec- A random sub-sample of 10% of this data was re-
tors for complete αβ TCR sequences (Fig. 2a, meth- served for use as an unseen test set, containing 84,268
ods III B), this is not the case for the other PLMs. unique clonotypes distributed across 83,979 unique
For TCR-BERT, ESM2 and ProtBert representations TCRs. Of the remaining 90% of the data, we filtered
for the α and β chains were independently gener- out any clonotypes with amino acid sequences that
ated, then concatenated together, and finally average- also appeared in the test set, resulting in a training
pooled to produce an embedding of the heterodimeric set of 753,838 unique clonotypes across 733,070 unique
receptor (see appendix A). TCRs.

B. SCEPTR architecture 2. Procedure

SCEPTR (Simple Contrastive Embedding of the SCEPTR was jointly optimised for MLM and auto-
Primary sequence of T cell Receptors) is a BERT- contrastive learning, where the total loss of a training
like transformer encoder that maps TCR sequences step was calculated as the sum of the MLM and au-
to vector embeddings. Like BERT, it is comprised of tocontrastive (Eq. 4) losses.
a tokeniser module, an embedder module and a self- We implemented MLM following established proce-
attention stack (Fig. 2a). dures [30]. Namely, 15% of input tokens were masked,
The tokeniser module represents each input TCR and masked tokens had an 80% probability of being
as the amino acid sequences of the first, second and replaced with the <mask> token, a 10% probability of
third complementarity-determining regions (CDRs) of being replaced by a randomly chosen amino acid dis-
each chain, where each amino acid is a token. A spe- tinct from the original, and a 10% probability of re-
cial <cls> token is appended to each input TCR, as maining unchanged. The MLM loss was computed as
its contextualised embedding will eventually become the cross-entropy between SCEPTR’s predicted token
SCEPTR’s output representation vector (Fig. 2a,b). probability distribution and the ground truth.
SCEPTR uses a simple, non-trainable embedder Our choice of autocontrastive loss function is in-
module, where a one-hot vector is used to encode to- spired by related work in NLP [52] and computer
ken identity (22 dimensions for 20 amino acids plus vision [48], but adapted to the TCR setting. Let
special tokens <cls> and <mask>), and token positions B = {σi }Ni=1 be a minibatch of N TCRs. We generate
are specified by first one-hot encoding the containing two independent “views” of each TCR σi by passing
CDR loop number (6 dimensions), then encoding the two censored-variants of the same receptor through
token’s relative position within the loop as a single the model. Our censoring procedure removes a ran-
scalar variable (Fig. 2b). This results in initial to- dom subset of a fixed proportion (20%) of the residues
ken embeddings in R29 , which are passed through a from the tokenised representation of the CDR loops
trainable linear projection onto R64 . SCEPTR’s self- and with a 50% chance drops either the full α or β
attention stack then operates at this fixed dimension- chain. To ensure that censoring does not fundamen-
ality (Fig. S16). SCEPTR’s self-attention stack com- tally alter the underlying TCR sequence, the posi-
prises three layers, each with eight attention heads and tional encoding for each token remains fixed relative
a feed-forward dimensionality of 256, and is thus sub- to the original TCR. In addition to the random censor-
stantially simpler than existing models. Our tests sug- ing, views also differ due to dropout noise during inde-
gest that relative position embedding helps SCEPTR pendent model passes. Taken together, this procedure
learn better calibrated TCR co-specificity rules (see maps the minibatch B to the set of 2N TCR views
appendix C). V = {vj }2Nj=1 , where v2i and v2i−1 are two indepen-
dent views of the same TCR σi (i ∈ {1...N }). Where
k ∈ I = {1...2N } is an arbitrary index of a view
C. SCEPTR Pre-training vk ∈ V , let p(k) be the index of the other view gener-
ated from the same TCR, and N (k) = {l ∈ I : l ̸= k}
1. Data be the set of all indices apart from k itself. Let rk
denote SCEPTR’s vector representation of TCR view
vk . Then the autocontrastive loss for minibatch B is
The unlabelled paired-chain αβTCR sequences used computed as follows:
to pre-train SCEPTR were taken from a study by
Tanno et al. [51], which provides 965,523 unique clono- ⊤
types sampled from the blood of 15 healthy human 1 X erk rp(k) /τ
LAC (B) = − log P (4)
subjects. As opposed to traditional single-cell se- 2N r⊤
k rn /τ
k∈I n∈N (k) e
quencing, Tanno et al. used a ligation-based sequenc-
ing method to resolve which α chains paired with Here, τ is a temperature hyper-parameter which we
which β chains. To mitigate potential noise from in- set to 0.05 during training, following previous litera-
correct chain pairing, we applied an extra processing ture [52].
11

We used ADAM (adaptive moment estimation) [72] as the model seeing 100,000 binders for each pMHC.
to perform stochastic gradient descent. We chose a Given our a batch size of 1,024 TCRs, this corre-
minibatch size of 1024 samples and trained for 200 sponded to a total of 1,172 training steps.
epochs, which equated to 143,200 training steps. The Our implementation of supervised contrastive learn-
internal dropout noise of SCEPTR’s self-attention ing closely follows the formulation suggested by
stack was set to 0.1. Khosla et al. [48]. This approach to supervised con-
Our methodology of randomly censoring residues trastive learning combines loss contributions from true
and even entire chains stands in contrast to previous positive pairs, with those from second views of each
work in NLP by Gao et al. [52], who found that rely- positive instance (as in autocontrastive learning) as
ing only on the internal random drop-out noise of the well as all views of all other sample points with the
language model was sufficient for effective autocon- same pMHC label. Let B = {σi }N i=1 be a mini-
strastive learning. However, our experiments suggest batch of N pMHC-annotated TCRs. We use the same
that in the TCR domain, residue and chain censor- procedure as in our autocontrastive framework (see
ing leads to embeddings with better downstream TCR methods III C) to generate two views of each of the
specificity prediction performance (Fig. S5). TCRs, producing a set of 2N views V = {vj }2N j=1 . Let
N
Y = {yi }i=1 be the index-matched pMHC labels for
TCRs in B, and ȳj denote the labels mapped to the
D. SCEPTR fine-tuning with supervised indices of the views in V such that ȳ2i = ȳ2i−1 = yi .
contrastive learning
Now given arbitrary sample view index k, let P (k) =
{l ∈ A(k) : ȳl = ȳk } be the set of all indices whose cor-
1. Data responding samples have the same pMHC label as vk ,
with cardinality |P (k)|. The supervised contrastive
For supervised contrastive fine-tuning we took all loss for TCR minibatch B is:
TCR binders against the six best-sampled pMHC tar-
gets from our labelled TCR dataset, and split them LSC (B) =
into a training, a validation, and a test set such that ⊤
no study used to construct the training or validation 1 X 1 X erk rp /τ
− log P r⊤
(5)
sets contributed any TCRs to the test set (Table SII). 2N |P (k)| n∈N (k) e
k rn /τ
k∈I p∈P (k)

2. Procedure
Each batch during fine-tuning has an equally balanced
number of binders to each of the six pMHCs.
The fine-tuning process involved the joint optimisa-
tion of SCEPTR on MLM and supervised contrastive
learning. As during pre-training, the overall loss for ACKNOWLEDGMENTS
each training step was computed as the unweighted
sum of the MLM and supervised contrastive (Eq. 5) The authors thank Ned Wingreen, Chris Watkins,
losses. The pre-trained state of SCEPTR was used Trevor Graham, Sergio Quezada, Machel Reid,
as the starting point for fine-tuning. With only 200 Linda Li, Rudy Yuen, Sankalan Bhattacharyya, and
TCRs for each target pMHC to train on, we limited Matthew Cowley for useful discussions. YN and MM
the number of learnable parameters by only allow- were supported by Cancer Research UK studentships
ing the weights of the final self-attention layer to be under grants BCCG1C8R and A29287, respectively.
trainable. Additionally, we monitored increases in val- The work of ATM was supported in parts by funding
idation loss for early stopping of fine-tuning, which by the Royal Free Charity.
occurred after 2 epochs, where one epoch is defined The authors declare no competing interests.

CODE AVAILABILITY

https://github.com/yutanagano/sceptr: a readily usable deployment of SCEPTR and its variants.

https://github.com/yutanagano/tcrlm: houses code used for designing and training our models.
https://github.com/yutanagano/libtcrlm: library code powering the above repositories

[1] H. Chi, M. Pepper, and P. G. Thomas, Cell 187, 2052 H. Koohy, Nature Reviews Immunology , 1 (2023).
(2024). [4] P. Dash, A. J. Fiore-Gartland, T. Hertz, G. C. Wang,
[2] M. M. Davis and P. J. Bjorkman, Nature 334, 395 S. Sharma, A. Souquette, J. C. Crawford, E. B.
(1988). Clemens, T. H. O. Nguyen, K. Kedzierska, N. L.
[3] D. Hudson, R. A. Fernandes, M. Basham, G. Ogg, and La Gruta, P. Bradley, and P. G. Thomas, Nature 547,
12

89 (2017). T. J. O’Donnell, and M. R. Min, Frontiers in im-

[5] C. S. Dobson, A. N. Reich, S. Gaglione, B. E. Smith, munology 13, 1014256 (2022).
E. J. Kim, J. Dong, L. Ronsard, V. Okonkwo, D. Ling- [27] L. Deng, C. Ly, S. Abdollahi, Y. Zhao, I. Prinz, and
wood, M. Dougan, et al., Nature methods 19, 449 S. Bonn, Frontiers in Immunology 14, 1128326 (2023).
(2022). [28] D. V. Bagaev, R. M. A. Vroomans, J. Samir, U. Ster-
[6] A. V. Joglekar, M. T. Leonard, J. D. Jeppson, vbo, C. Rius, G. Dolton, A. Greenshields-Watson,
M. Swift, G. Li, S. Wong, S. Peng, J. M. Zaretsky, M. Attaf, E. S. Egorov, I. V. Zvyagin, N. Babel, D. K.
J. R. Heath, A. Ribas, et al., Nature methods 16, 191 Cole, A. J. Godkin, A. K. Sewell, C. Kesmir, D. M.
(2019). Chudakov, F. Luciani, and M. Shugay, Nucleic Acids
[7] S. Gielis, P. Moris, W. Bittremieux, N. De Neuter, Research 48, D1057 (2020).
B. Ogunjimi, K. Laukens, and P. Meysman, Frontiers [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
in Immunology 10 (2019). L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
[8] D. S. Fischer, Y. Wu, B. Schubert, and F. J. Theis, arXiv preprint 1706.03762 (2017).
Molecular Systems Biology 16, e9416 (2020). [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,
[9] E. Jokinen, J. Huuhtanen, S. Mustjoki, M. Heinonen, arXiv preprint (2019).
and H. Lähdesmäki, PLOS Computational Biology [31] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
17, e1008814 (2021). plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
[10] A. Montemurro, V. Schuster, H. R. Povlsen, A. K. try, A. Askell, et al., Advances in neural information
Bentzen, V. Jurtz, W. D. Chronister, A. Crinklaw, processing systems 33, 1877 (2020).
S. R. Hadrup, O. Winther, B. Peters, L. E. Jessen, [32] A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu,
and M. Nielsen, Communications Biology 4, 1 (2021). D. Guo, M. Ott, C. L. Zitnick, J. Ma, et al., Pro-
[11] K. Wu, K. E. Yost, B. Daniel, J. A. Belk, Y. Xia, ceedings of the National Academy of Sciences 118,
T. Egawa, A. Satpathy, H. Y. Chang, and J. Zou, e2016239118 (2021).
bioRxiv 10.1101/2021.11.18.469186 (2021). [33] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu,
[12] G. Croce, S. Bobisse, D. L. Moreno, J. Schmidt, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli,
P. Guillame, A. Harari, and D. Gfeller, Nature Com- A. dos Santos Costa, M. Fazel-Zarandi, T. Sercu,
munications 15, 3211 (2024). S. Candido, and A. Rives, Science 379, 1123 (2023).
[13] A. Weber, J. Born, and M. Rodriguez Martı́nez, [34] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi,
Bioinformatics 37, i237 (2021). Y. Wang, L. Jones, T. Gibbs, T. Feher, C. An-
[14] Y. Jiang, M. Huo, and S. Cheng Li, Briefings in Bioin- gerer, M. Steinegger, D. Bhowmik, and B. Rost, IEEE
formatics 24, bbad086 (2023). Transactions on Pattern Analysis and Machine Intel-
[15] T. Lu, Z. Zhang, J. Zhu, Y. Wang, P. Jiang, X. Xiao, ligence 44, 7112 (2022).
C. Bernatchez, J. V. Heymach, D. L. Gibbons, [35] A. Elnaggar, H. Essam, W. Salah-Eldin,
J. Wang, L. Xu, A. Reuben, and T. Wang, Nature W. Moustafa, M. Elkerdawy, C. Rochereau, and
Machine Intelligence 3, 864 (2021). B. Rost, arXiv preprint 2301.06568 (2023).
[16] X. Lin, J. T. George, N. P. Schafer, K. Ng Chau, M. E. [36] K. E. Wu, H. Chang, and J. Zou, bioRxiv
Birnbaum, C. Clementi, J. N. Onuchic, and H. Levine, 10.1101/2024.05.14.594226 (2024).
Nature Computational Science 1, 362 (2021). [37] F.-Z. Li, A. P. Amini, Y. Yue, K. K. Yang, and
[17] I. Springer, N. Tickotsky, and Y. Louzoun, Frontiers A. X. Lu, bioRxiv preprint 10.1101/2024.02.05.578959
in Immunology 12, 664514 (2021). (2024).
[18] M. Cai, S. Bang, P. Zhang, and H. Lee, Frontiers in [38] M. Wang, J. Patsenker, H. Li, Y. Kluger, and S. H.
Immunology 13 (2022). Kleinstein, Nucleic Acids Research 52, 548 (2024).
[19] P. Moris, J. De Pauw, A. Postovskaya, S. Gielis, [39] J. Barton, A. Gaspariunas, D. A. Yadin, J. Dias,
N. De Neuter, W. Bittremieux, B. Ogunjimi, F. L. Nice, D. H. Minns, O. Snudden, C. Po-
K. Laukens, and P. Meysman, Briefings in Bioinfor- vall, S. V. Tomas, H. Dobson, J. H. R. Farmery,
matics 22, bbaa318 (2021). J. Leem, and J. D. Galson, bioRxiv preprint
[20] M.-D. N. Pham, T.-N. Nguyen, L. S. Tran, Q.-T. B. 10.1101/2024.05.22.594943 (2024).
Nguyen, T.-P. H. Nguyen, T. M. Q. Pham, H.-N. [40] 10x Genomics, A New Way of Exploring Immunity -
Nguyen, H. Giang, M.-D. Phan, and V. Nguyen, Linking Highly Multiplexed Antigen Recognition to
Bioinformatics 39, btad284 (2023). Immune Repertoire and Phenotype (2020).
[21] Y. Gao, Y. Gao, Y. Fan, C. Zhu, Z. Wei, C. Zhou, [41] W. Zhang, P. G. Hawkins, J. He, N. T. Gupta, J. Liu,
G. Chuai, Q. Chen, H. Zhang, and Q. Liu, Nature G. Choonoo, S. W. Jeong, C. R. Chen, A. Dhanik,
Machine Intelligence 5, 236 (2023). M. Dillon, R. Deering, L. E. Macdonald, G. Thurston,
[22] B. P. Y. Kwee, M. Messemaker, E. Marcus, and G. S. Atwal, Science Advances 7, eabf5835 (2021).
G. Oliveira, W. Scheper, C. J. Wu, J. Teuwen, [42] A. Montemurro, H. R. Povlsen, L. E. Jessen, and
and T. N. Schumacher, bioRxiv preprint M. Nielsen, Scientific Reports 13, 16147 (2023).
10.1101/2023.04.25.538237 (2023). [43] K. Mayer-Blackwell, S. Schattgen, L. Cohen-Lavi,
[23] B. Meynard-Piganeau, C. Feinauer, M. Weigt, A. M. J. C. Crawford, A. Souquette, J. A. Gaevert, T. Hertz,
Walczak, and T. Mora, Proceedings of the National P. G. Thomas, P. Bradley, and A. Fiore-Gartland,
Academy of Sciences 121, 10.1101/2023.07.19.549669 Elife 10, e68605 (2021).
(2024). [44] A. Mayer and C. G. Callan, Proceedings of the Na-
[24] R. Goldner Kabeli, S. Zevin, A. Abargel, A. Zilber- tional Academy of Sciences 120, e2213264120 (2023).
berg, and S. Efroni, Science Advances 10, eadk4670 [45] J. Henderson, Y. Nagano, M. Milighetti,
(2024). and A. Tiffeau-Mayer, arXiv preprint
[25] M. Nielsen, A. Eugster, M. F. Jensen, M. Goel, 10.48550/arXiv.2404.12565 (2024).
A. Tiffeau-Mayer, A. Pelissier, S. Valkiers, M. R. [46] A. Tiffeau-Mayer, Physical Review E 109, 064411
Martı́nez, B. Meynard-Piganeeau, V. Greiff, et al., (2024).
ImmunoInformatics (2024). [47] T. Wang and P. Isola, arXiv
[26] F. Grazioli, A. Mösch, P. Machart, K. Li, I. Alqassem, 10.48550/arXiv.2005.10242 (2022).
13

[48] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, [70] J. Deng, J. Guo, J. Yang, N. Xue, I. Kotsia, and
P. Isola, A. Maschinot, C. Liu, and D. Krishnan, arXiv S. Zafeiriou, IEEE Transactions on Pattern Analysis
10.48550/arXiv.2004.11362 (2021). and Machine Intelligence , 1 (2021).
[49] F. Drost, L. Schiefelbein, and B. Schubert, bioRxiv [71] Y. Nagano and B. Chain, Frontiers in Immunology
preprint , 2022.10.24.513533 (2022). 14, 10.3389/fimmu.2023.1276106 (2023).
[50] M. Pertseva, O. Follonier, D. Scarcella, and S. T. [72] D. P. Kingma and J. Ba, arXiv preprint
Reddy, bioRxiv preprint , 2024.04.04.587695 (2024). 10.48550/arXiv.1412.6980 (2017).
[51] H. Tanno, T. M. Gould, J. R. McDaniel, W. Cao, [73] J. M. Heather, M. J. Spindler, M. H. Alonso, Y. I.
Y. Tanno, R. E. Durrett, D. Park, S. J. Cate, W. H. Shui, D. G. Millar, D. S. Johnson, M. Cobbold, and
Hildebrand, C. L. Dekker, L. Tian, C. M. Weyand, A. N. Hata, Nucleic Acids Research 50, e68 (2022).
G. Georgiou, and J. J. Goronzy, Proceedings of the
National Academy of Sciences 117, 532 (2020).
[52] T. Gao, X. Yao, and D. Chen, arXiv
10.48550/arXiv.2104.08821 (2022).
[53] Y. Fang, X. Liu, and H. Liu, Briefings in Bioinformat-
ics 23, bbac378 (2022).
[54] B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li,
in Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
edited by B. Webber, T. Cohn, Y. He, and Y. Liu
(Association for Computational Linguistics, Online,
2020) pp. 9119–9130.
[55] Z. Sethna, Y. Elhanati, C. G. Callan Jr, A. M. Wal-
czak, and T. Mora, Bioinformatics 35, 2974 (2019).
[56] M. Milighetti, Y. Nagano, J. Henderson, U. Hersh-
berg, A. Tiffeau-Mayer, A.-F. Bitbol, and B. Chain,
bioRxiv preprint 2024.05.24.595718 , 2024 (2024).
[57] M. J. Spindler, A. L. Nelson, E. K. Wagner, N. Op-
permans, J. S. Bridgeman, J. M. Heather, A. S. Adler,
M. A. Asensio, R. C. Edgar, Y. W. Lim, et al., Nature
biotechnology 38, 609 (2020).
[58] M. I. Raybould, A. Greenshields-Watson, P. Agar-
wal, B. Aguilar-Sanjuan, T. H. Olsen, O. M. Turn-
bull, N. P. Quast, and C. M. Deane, bioRxiv preprint
10.1101/2024.05.20.594960 , 2024 (2024).
[59] S. Sureshchandra, J. Henderson, E. Levendosky,
S. Bhattacharyya, J. M. Kastenschmidt, A. M. Sorn,
M. T. Mitul, A. Benchorin, K. Batucal, A. Daugherty,
et al., bioRxiv preprint 10.1101/2024.08.17.608295
(2024).
[60] J. D. Bloom, S. T. Labthavikul, C. R. Otey, and F. H.
Arnold, Proceedings of the National Academy of Sci-
ences 103, 5869 (2006).
[61] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch,
and T. Hwa, Proceedings of the National Academy of
Sciences 106, 67 (2009).
[62] J. Jumper, R. Evans, A. Pritzel, T. Green,
M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
R. Bates, A. Žı́dek, A. Potapenko, et al., Nature 596,
583 (2021).
[63] A. Mayer, V. Balasubramanian, T. Mora, and A. M.
Walczak, Proceedings of the National Academy of Sci-
ences 112, 5950 (2015).
[64] P. G. Thomas and J. C. Crawford, Current Opinion
in Systems Biology 18, 36 (2019).
[65] Y. Elhanati, A. Murugan, C. G. Callan, T. Mora, and
A. M. Walczak, Proceedings of the National Academy
of Sciences 111, 9875 (2014).
[66] C. Dens, K. Laukens, W. Bittremieux, and
P. Meysman, Nature Machine Intelligence 5, 1060
(2023).
[67] F. Schroff, D. Kalenichenko, and J. Philbin, in 2015
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2015) pp. 815–823.
[68] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song,
in 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2017) pp. 6738–6746.
[69] S. Chen, Y. Liu, X. Gao, and Z. Han, arXiv
10.48550/arXiv.1804.07573 (2018).
14

Appendix A: Generating TCR vector embeddings using existing protein language models

1. TCR-BERT

The TCR-BERT model was downloaded through HuggingFace at https://huggingface.co/wukevin/

tcr-bert. Since TCR-BERT is trained to read one CDR3 sequence at a time, we generated TCR representa-
tions by generating two independent representations of the α and β chain, and concatenating them together.
The TCR-BERT representation of a chain was generated by feeding the model its CDR3 sequence, then taking
the average pool of the amino acid token embeddings in the 8th self-attention layer, as recommended by the
study authors [11].

2. ESM2

The ESM2 (T6 8M) model was downloaded through HuggingFace at https://huggingface.co/facebook/
esm2_t6_8M_UR50D. ESM2 is trained on full protein sequences, but not protein multimers. Therefore, we
generated ESM2 representations for the α and β chains separately, and concatenated them to produce the
overall TCR representation. To generate the representation of a TCR chain, we first used Stitchr [73] to
reconstruct the full amino acid sequence of a TCR from its CDR3 sequence and V/J gene. Then, the resulting
sequence of each full chain was fed to ESM2. We took the average-pooled result of the amino acid token
embeddings of the final layer to generate the overall sequence representation, as recommended [33].

3. ProtBert

The ProtBert model was downloaded through HuggingFace at https://huggingface.co/Rostlab/prot_

bert. Similarly to ESM2, ProtBert is trained on full protein sequences. Therefore, we again used Stitchr to
generate full TCR chain amino acid sequences, and fed them to ProtBert to generate independent α and β
chain representations. We again as recommended average-pooled the amino acid token embeddings of the final
layer [34].

Appendix B: Learning features within embedding spaces

The focus of the current work has been to use nearest neighbour prediction using PLM embeddings as the
most direct test of data-efficient transfer learning that works with as little as a single reference sequence. If
slightly more data is available, another approach is to train supervised predictors atop PLM embeddings. To
test how much such training can improve prediction performance, we trained linear support vector classifiers
(SVC) on the PLM embeddings provided by different models. In each instance, we trained the classifier to
distinguish reference TCRs from 1000 randomly sampled background TCRs. We outline the methodology in
more detail below.
We find that the SVC predictors for ProtBert, ESM2 and TCR-BERT all perform better than their nearest
neighbour counterparts, but still worse than SCEPTR’s nearest neighbour predictions (Fig. S4). We also trained
an SVC atop SCEPTR, which did not lead to further improvement upon the nearest neighbour prediction
(Fig. S4). These findings highlight how in the low data regime typical of most pMHCs, misalignment of pre-
training to downstream tasks can only be partially remediated by training on reference TCRs.
To train the linear SVCs on top of PLM features, we sampled 1000 random background TCRs from the
training partition of the unlabelled Tanno et al. dataset. We employed a similar strategy to our benchmarking
in section I A to split our dataset of curated specificity-annotated αβTCRs into a reference set and testing
set. For each PLM-pMHC-split combination, we trained a linear SVC using the PLM embeddings of the
reference TCRs as the positives and those of the 1000 background TCRs as the negatives. The same 1000
background TCRs were used across model-pMHC-split combinations to ensure consistency. We accounted for
the imbalance between the number of positive and negative samples used during SVC fitting by weighting the
penalty contributions accordingly. Finally, we tested SVCs using the same benchmarking classification task as
previously described.

Appendix C: Effects of different position embedding methods

To better understand TCR similarity rules as learned by PLMs, we measured the average distance penalty
incurred within a model’s representation space as a result of a single amino acid edit at various points along
the length of the α/β CDR3 loops. To do this, we randomly sampled real TCRs from the testing partition
15

of the Tanno et al. dataset [51] and synthetically introduced single residue edits in one of their CDR3 loops.
Then, we measured the distance between the original TCR and the single edit variant according to a PLM.
For each model, we sampled TCRs until we had observed at least 100 cases of: 1) each type of edit (insertions,
deletions, substitutions) at each position, and 2) substitutions from each amino acid to every other. Since CDR3
sequences vary in length, we categorised the edit locations into one of five bins: C-TERM for edits within the
first one-fifth of the CDR3 sequence counting from the C-terminus, then M1, M2, M3, and N-TERM, in that order.
For this analysis, we investigated SCEPTR and TCR-BERT, since they are the two best performers out of the
PLMs tested (Fig. 1d).
Both SCEPTR and TCR-BERT generally associate insertions and deletions (indels) with a higher distance
penalty compared to substitutions (Fig. S14a). While SCEPTR uniformly penalises indels across the length of
the CDR3, TCR-BERT assigns higher penalties to those closer to the C-terminus. We hypothesised that the
variation in TCR-BERT’s indel penalties is a side-effect of its position embedding system. TCR-BERT, like
many other transformers, encodes a token’s position into its initial embedding in a left-aligned manner using a
stack of sinusoidal functions with varying periods [11, 29, 30] This results in embeddings that are more sensitive
to indels near the C-terminus, which cause a frame-shift in a larger portion of the CDR3 loop and thus lead to a
larger change in the model’s underlying TCR representation. To test this hypothesis, we trained and evaluated
a new SCEPTR variant:

SCEPTR (left-aligned): This variant uses a traditional transformer embedding system with trainable token
representations and left aligned, stacked sinusoidal position embeddings.

While we detect no significant difference in downstream performance between SCEPTR and its left-aligned
variant (Fig. S15), this may be because cases where the differences in their learned rule sets affects performance
are rarely seen in our benchmarking data. The edit penalty profile of the left-aligned variant shows a similar
falloff of indel penalties than TCR-BERT with higher penalties at the C than N-terminals (Fig. S14b). As
their is no clear biological rationale for this observation, these results suggest that SCEPTR’s relative position
encoding might result in a better-calibrated co-specificity ruleset. These preliminary findings add to the ongoing
discussion around how to best encode residue position information in the protein language modelling domain [35].
Interestingly, the penalty falloff seen with SCEPTR (left-aligned) is sharper than that of TCR-BERT, whose
indel penalties plateau past M1. As TCR-BERT is a substantially deeper model (12 self-attention layers, 12
heads each, embedding dimensionality 768), it might be partially able to internally un-learn the left-aligned-
ness of the position information. If this is true, then position embedding choices are particularly important for
training smaller, more efficient models.
16

TFEYVSQPFLMDLE GILGFVFTL YLQPRTFLL

1.0
True Positive Rate 0.8

0.6

0.4

0.2

0.0
NLVPMVATV SPRWYFYYL TTDPSFLGRY
1.0

0.8
True Positive Rate

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate
SCEPTR TCRdist CDR3 Levenshtein TCR-BERT ESM2 (T6 8M) ProtBert

Figure S1. ROC plots for individual pMHCs on the benchmarking task. Each curve shows the nearest
neighbour prediction ROC for a specific model and pMHC for k = 200 reference TCRs. The solid lines correspond to
the mean ROC per model per pMHC, and the shaded regions correspond to the standard deviations of the true positive
rate across data splits. Corresponding per-pMHC AUROC values are provided in table SI.

Table SI. Per-epitope summary of the different models’ perfomances from the nearest neighbour prediction benchmarking
(see section I A) with the number of reference TCRs k=200. The best AUROC per epitope is shown in bold.
SCEPTR TCRdist CDR3 Levenshtein TCR-BERT ESM2 (T6 8M) ProtBert
epitope
GILGFVFTL 0.911 0.904 0.872 0.876 0.831 0.845
NLVPMVATV 0.691 0.648 0.632 0.655 0.652 0.629
SPRWYFYYL 0.728 0.695 0.610 0.637 0.604 0.575
TFEYVSQPFLMDLE 0.976 0.970 0.964 0.966 0.937 0.950
TTDPSFLGRY 0.708 0.720 0.600 0.576 0.579 0.564
YLQPRTFLL 0.775 0.762 0.743 0.698 0.697 0.669
17

0.75

0.70

Mean AUROC
0.65

SCEPTR
0.60 TCRdist
CDR3 Levenshtein
TCR-BERT
0.55 ESM2 (T6 8M)
ProtBert

1 2 5 10 20 50 100 200
Number of Reference TCRs

Figure S2. Benchmarking PLM embeddings on TCR specificity prediction with Montemurro et al.’s post-
processed 10xGenomics dataset included. This is a repeat of the benchmarking study from section I A with a larger
dataset of labelled TCRs. This includes six more sufficiently sampled pMHC specificities (with epitopes ELAGIGILTV,
GLCTLVAML, AVFDRKSDAK, IVTDFSVIK, RAKFKQLL, KLGGALQAK). The trends seen in section I A are recapitulated.

a Nearest-neighbour implementations b Average-distance implementations

0.80

0.75
Mean AUROC

0.70

0.65

0.60

1 2 5 10 20 50 100 200 1 2 5 10 20 50 100 200

Number of Reference TCRs Number of Reference TCRs
SCEPTR TCRdist CDR3 Levenshtein TCR-BERT ESM2 (T6 8M) ProtBert

Figure S3. Nearest neighbour prediction is more performant than using the average distance to all
references. We repeated the benchmarking procedure used to produce Fig. 1d, but instead of making inferences based
on the distance between a query TCR and its closest reference neighbour we averaged its distance to all references. The
results are shown in panel b), with panel a) showing the original nearest neighbour benchmarking results for comparison.
All models perform better when applied through nearest neighbour prediction. Interestingly, when using average distance
prediction all models other than SCEPTR rapidly plateau with increasing reference set size.
18

0.80

0.75

Mean AUROC
0.70

0.65
SCEPTR (NN) ESM2 (NN)
SCEPTR (SVC) ESM2 (SVC)
0.60 TCR-BERT (NN) ProtBert (NN)
TCR-BERT (SVC) ProtBert (SVC)

1 2 5 10 20 50 100 200
Number of Reference TCRs

Figure S4. Benchmarking linear support vector classifiers trained on PLM features on TCR specificity
prediction. Performance of different PLMs applied to few-shot TCR specificity prediction, either through nearest
neighbour prediction (models marked as “NN” in the legend, see section I A) or using a linear support vector classifier
trained atop their TCR featurisations (models marked as “SVC” in the legend, see appendix B). For all PLMs except
SCEPTR, training a linear SVC atop the model’s features improves performance. SCEPTR (NN) outperforms all
methods, even the SVC trained atop its own features.

a b c
0.80

0.75
Mean AUROC

0.70

0.65

0.60

1 2 5 10 20 50 100 200 1 2 5 10 20 50 100 200 1 2 5 10 20 50 100 200

Number of Reference TCRs Number of Reference TCRs Number of Reference TCRs
SCEPTR TCRdist SCEPTR (dropout noise only)

Figure S5. SCEPTR provides competitive performance on single-chain TCR data. Benchmarking results on
a) paired chain data (as in Fig. 1d), or when supplying information about only the b) α or c) β chain. In each instance,
we compared three models: SCEPTR, TCRdist, and a variant of SCEPTR which does not employ any extra noising
operations when producing the two views of the same TCR during autocontrastive learning, solely relying on SCEPTR’s
internal dropout noise (see section III C). In all scenarios, SCEPTR’s performance is on par (α) or better (αβ / β) than
TCRdist. Furthermore, comparing SCEPTR’s performance with that of its dropout noise only variant demonstrates
that residue- and chain- dropping during pre-training improves downstream performance, particularly when applied to
single-chain data.
19

0.20

0.25

0.30

Pearson r
0.35

0.40

0.45

[1.15, 1.20)

[1.25, 1.30)
[1.20, 1.25)
[1.00, 1.05)
[1.05, 1.10)
[1.10, 1.15)

[1.30, 1.35)
[1.35, 1.40)
[1.40, 1.45)
[1.45, 1.50)
[1.50, 1.55)
[1.55, 1.60)
< 1.00

1.60
SCEPTR distance

Figure S6. TCRdist and recombination probabilities are negatively correlated when conditioning on
different SCEPTR bins. Conditioned on a certain level of SCEPTR similarity, pairs of sequences with a high
probability of recombination (pgen ) tend to have lower TCRdist distance. The plot shows the Pearson correlation
coefficient r between TCRdist distances and TCR pgen for TCRs across SCEPTR distance bin. The shaded region
displays 95% confidence intervals around estimated correlation coefficients. This is a companion to Fig. 4, showing the
generality of the negative correlation across SCEPTR bins.

80+
450
70
400

350 60
| d - d | (TCRdist)
TCRdist distance

300 50

250 40

200 30

150 20

100 10

0
0.8 1.0 1.2 1.4 1.6
SCEPTR distance

Figure S7. Large discordance between α and β chain similarity explain some of the variation between
TCRdist and SCEPTR distances. This plot corresponds to Fig. 4a/b in the main text, but points are coloured
according to the absolute difference between the α chain and β chain components of the TCRdist distance. Among
TCR pairs with similar overall TCRdist similarity, those pairs which have a large difference between the α and β chain
TCRdist distances (i.e. where TCRs have one highly similar and another highly dissimilar chain) tend to be assigned
lower SCEPTR distances.
20

0.425

0.400

0.375

0.350

Pearson r
0.325

0.300
maximum (p = )
0.275 arithmetic (p = 1)
geometric (p = 0)
0.250 harmonic (p = 1)
0.225 minimum (p = )

10 5 0 5 10
Power mean exponent p

Figure S8. SCEPTR distances do not average α and β chain similarity arithmetically. Fig. S7 suggests
that similarity on at least one chain is sufficient for a low SCEPTR distance. To test this hypotheiss, we calculated
the Pearson correlation coefficient r between SCEPTR distances and different ways of taking the mean of the α and β
TCRdist components. We interpolated between taking the minimum, arithmetic average, and maximum between the α
and β chain distances, by computing the generalised power mean of the two distances. The generalised power mean of a
1
set of numbers x1 , ..., xn is defined as Mp = n1 n
P
i=1 xi , with exponent p = 1 corresponding to the arithmetic mean,
p

p = −∞ corresponding to taking the minimum and p = ∞ corresponding to taking the maximum. SCEPTR distances
best correlate to power means with exponents p < 1, suggesting that SCEPTR embeddings behave more like taking the
minimum between the chain components (p = −∞), as opposed to the arithmetic average (p = 1). In contrast, the
paired-chain TCRdist is defined as the arithmetic average of the distances of both chains.

a No filtering b 90% sequence identity c 80% sequence identity

1.0

0.9

0.8
AUROC

0.7

0.6

0.5
SPRWYFYYL

SPRWYFYYL

SPRWYFYYL
YLQPRTFLL

YLQPRTFLL

YLQPRTFLL
NLVPMVATV
TFEYVSQPFLMDLE

GILGFVFTL

TTDPSFLGRY

TFEYVSQPFLMDLE

GILGFVFTL

TTDPSFLGRY

NLVPMVATV

TFEYVSQPFLMDLE

GILGFVFTL

TTDPSFLGRY

NLVPMVATV

Epitope Epitope Epitope

SCEPTR (finetuned) SCEPTR TCRdist TCR-BERT

Figure S9. Supervised contrastive learning improves performance also when filtering highly similar TCRs
from the test set. Companion to Fig. 5 showing the effects of filtering public TCR sequences from the test data,
which have exact or near-exact matches in amino acid sequence to any training TCR. Benchmarking on a) all data,
b) excluding sequences with ≥ 90% sequence similarity, or c) ≥ 80% sequence similarity shows that fine-tuning using
supervised contrastive learning goes beyond the memorization of highly similar public TCR motifs. Similar sequences
were identified using a threshold level of α and β CDR3 amino acid sequence identity, and required matching V genes for
both chains. CDR3 sequence identity was quantified as 1 − (dα + dβ )/(ℓα + ℓβ ) where d is the Levenshtein edit distance,
ℓ is the sequence length of the test set TCR’s CDR3, and the subscripts denote the two chains.
21

0.75 SCEPTR
SCEPTR (finetuned)
TCRdist
TCR-BERT
0.70

Mean AUROC
0.65

0.60

0.55

1 2 5 10 20
Number of Reference TCRs

Figure S10. Benchmarking fine-tuned SCEPTR on TCR specificity prediction for unseen pMHCs. Results
of benchmarking fine-tuned SCEPTR (see section I E) against TCRdist, TCR-BERT and the baseline SCEPTR model
on pMHC targets unseen during SCEPTR’s fine-tuning. We again use the nearest neighbour benchmarking framework
from section I A, but now apply it to pMHCs with more than 120 binders, excluding the six pMHCs used for fine-tuning.
We restrict the number k of reference sequences to k ∈ [1, 20], to retain at least 100 positive sequences for the calculation
of the ROC curves for each data split. We find that fine-tuned SCEPTR performs significantly worse compared to the
baseline model.

Table SII. The different studies that contributed TCR data to the training/validation and test splits for the supervised
contrastive learning fine-tuning task. For datasets without a PubMed ID the table indicates the VDJdb github issue
number corresponding to the dataset inclusion.
epitope training/validation test
GILGFVFTL PMID:28636592 PMID:12796775, PMID:18275829, PMID:28250417,
PMID:28931605, PMID:7807026, PMID:28423320,
PMID:28636589, PMID:27645996, PMID:29483513,
PMID:29997621, PMID:34793243, VDJdbID:215
NLVPMVATV PMID:28636592, VDJdbID:332 PMID:19542454, PMID:26429912, PMID:19864595,
PMID:28423320, PMID:16237109, PMID:28636589,
PMID:36711524, PMID:28623251, PMID:9971792,
PMID:17709536, PMID:28934479, PMID:34793243,
VDJdbID:252
SPRWYFYYL PMID:33951417, PMID:35750048 PMID:33945786, PMID:34793243
TFEYVSQPFLMDLE PMID:35750048 PMID:37030296
TTDPSFLGRY PMID:35383307 PMID:35750048
YLQPRTFLL PMID:35383307, PMID:34793243 PMID:34685626, PMID:37030296, PMID:33664060,
PMID:33951417, PMID:35750048, VDJdbID:215
22

TRAJ TRAV CDR3A

6 6
14
5 5
12

4 4 10
H2(X| ) [bits]

8
3 3
6
2 2
4
1 1
2

0 0 0
TRBJ TRBV CDR3B
6 6
14
5 5
12

4 4 10
H2(X| ) [bits]

8
3 3
6
2 2
4
1 1
2

0 0 0
0.85 0.90 0.95 0.85 0.90 0.95 0.85 0.90 0.95
AUROC SCEPTR (finetuned) AUROC SCEPTR (finetuned) AUROC SCEPTR (finetuned)

Figure S11. Sequence diversity of pMHC-specific TCRs used in benchmarking. To investigate why pre-
diction performance of fine-tuned SCEPTR varies across pMHCs, we estimated the coincidence (second-order Rényi)
entropies [45] of a selection of TCR features (TRAV, TRBV, TRAJ, and TRBJ gene usages, and the amino acid sequences
of the α and β chain CDR3 loops) among the TCRs specific to each of the six pMHCs used during fine-tuning. Each
panel displays the feature entropy of TCRs specific to a given pMHC against the fine-tuned SCEPTR variant’s AUROC
score for the same pMHC. The error bars show the standard deviations of the empirical estimates of the coincidence
entropies, which were calculated using the unbiased variance estimator for Simpson’s diversity described in Ref. [46]. The
easiest to predict pMHCs have lower entropy in most features – particularly stark reductions in V and J gene diversity
are observed for the two pMHCs with the highest AUROCs.
23

100

10 1

Cumulative probability Pc( )

10 2

10 3
[Epitope] AUROC SCEPTR (finetuned)
[TFEYVSQPFLMDLE] 0.98
10 4 [GILGFVFTL] 0.96
[YLQPRTFLL] 0.89
[SPRWYFYYL] 0.86
10 5 [TTDPSFLGRY] 0.84
[NLVPMVATV] 0.82

0 100 200 300 400

TCRdist distance

Figure S12. Statistics of pairwise sequence similarities between pMHC-specific TCRs used in benchmark-
ing. Cumulative probabilities of coincidence are plotted against TCRdist distance threshold for each of the six pMHCs
used during SCEPTR fine-tuning. The most easy to predict pMHCs have fewer TCR pairs that are highly dissimilar.
Interestingly, the TFEYVSQPFLMDLE-specific repertoire, the only class II presented pMHC included in the test set,
has the most globally convergent TCRs.

1.0

0.9

0.8
AUROC

0.7 SCEPTR (finetuned) (NN)

SCEPTR (finetuned) (Avg Dist)
SCEPTR (NN)
0.6 SCEPTR (Avg Dist)

0.5
L

FLL
LE

TV
YY

R
VF
D

VA
LG
RT
YF
LM

M
SF
QP
RW
PF

VP
GIL

DP
YL
SQ

NL
TT
YV
TFE

Epitope

Figure S13. Comparing nearest neighbour and average distance implementations of the baseline and fine-
tuned SCEPTR models. Companion to figure 5, comparing the performance of the baseline and fine-tuned versions
of SCEPTR using nearest neighbour (NN) or average distance (Avg dist) prediction (see section I A). For the baseline
model, the nearest neighbour implementation performs better, consistent with the results seen in Fig. S3. In contrast,
the average distance implementation of the fine-tuned model greatly outperforms its nearest neighbour counterpart.
Our primary hypothesis as to why most models including the baseline SCEPTR model perform better through nearest
neighbour prediction (Fig. S3) is that each pMHC has multiple viable binding solutions comprised of TCRs with different
primary sequence features, which means that averaging distance to all reference TCRs across binding solutions dilutes
signal. The fact that the fine-tuned model no longer shows this property may hint at its ability to better resolve these
distinct binding solutions into a single convex cluster.
24

a SCEPTR TCR-BERT b SCEPTR (left-aligned)

0.6
0.5 5
0.5
0.4 4 0.4
distance

0.3 3 0.3
0.2 2 0.2
0.1 1 0.1
0.0 0 0.0
C-TERM M1 M2 M3 N-TERM C-TERM M1 M2 M3 N-TERM C-TERM M1 M2 M3 N-TERM
CDR3 region
insertion deletion substitution

Figure S14. Investigating TCR co-specificity rules as learned by different PLMs. Here we investigate TCR
co-specificity rules as learned by various representation models by measuring the expected distance penalties inucrred by
single residue edits in different regions of the α and β CDR3 loops. We investigate three models: SCEPTR, TCR-BERT,
and a SCEPTR variant which replaces its simplified initial embedding module with one that emulates the traditional
transformer architecture, including a left-aligned position embedding system (see appendix C). The x axis shows different
regions of the CDR3 divided into five bins, where C-TERM represents the first fifth of the loop counting from the C-terminal,
N-TERM represents the last fifth of the loop on the N-terminal end, and the middle regions numbered from the C-terminal
as shown. The y axis hows the expected distance penalty incurred by different types of single edits. The different lines
show the expected penalty curves with respect to insertions (purple), deletions (orange) and substitutions (green). The
error bars show the standard deviations. According to all models, substitutions on average incur a smaller distance
penalty compared to indels. While SCEPTR uniformly penalises indels, both TCR-BERT and the left-aligned SCEPTR
variant assign higher distance penalties to indels closer to the C-terminal.

0.80

0.78

0.76

0.74
Mean AUROC

0.72

0.70

0.68

0.66
SCEPTR
0.64 SCEPTR (left-aligned)

1 2 5 10 20 50 100 200
Number of Reference TCRs

Figure S15. SCEPTR with its simplified embedder module performs similarly to a variant with an
embedder module emulating the traditional transformer architecture. Here we show the results of bench-
marking SCEPTR againt a variant which replaces SCEPTR’s simplified embedder module (see methods III B) with an
implementation emulating the traditional transformer architecture (“left-aligned” variant in plot, see appendix C). The
benchmarking framework as outlined in section I A is used. The number of reference sequences varies along the x axis.
The y axis shows the models’ AUROCs averaged across pMHCs. We detect no significant difference in performance
between the two models.
25

output MHA
output
Static (non-learned) operators
MHA Internals + vector addition
N T
(single-head case)
dot product
+
attn. N layer normalisation
weights T
Feed S softmax
Forward S
T matrix/vector transpose

Nx Learned operators
N T

+ q K V
Inputs/Outputs

Multi-Head column vector (d×1)

context LinearQ LinearK LinearV
Attention
row vector (1×M)

matrix (d×M)
input context
input

Figure S16. A simplified schematic depicting the internals of the transformer self-attention stack. The
schematic is accurate for the case of a single attention head per layer. In the more general case of H attention heads,
each MHA block will have H parallel q, k and v linear projections, each from dimensionality d to dimensionality d/H.
Each parallel set of q, k and v vectors/matrices undergo the series of operations shown in the schematic. Finally, the
final output vector (of shape d/H × 1) from each parallel branch are concatenated together to produce the output of the
MHA block (of shape d × 1).

Roitt'S Essential Immunology 12Th Edition (All Mcqs With Answers)
90% (10)
Roitt'S Essential Immunology 12Th Edition (All Mcqs With Answers)
100 pages
Immunoinformatics, 1st Edition High-Quality Ebook
100% (14)
Immunoinformatics, 1st Edition High-Quality Ebook
17 pages
Harrison SLE
No ratings yet
Harrison SLE
11 pages
Recent Advances in Molecular and Translational Medicine: Updates in Precision Medicine
From Everand
Recent Advances in Molecular and Translational Medicine: Updates in Precision Medicine
Somchai Chutipongtanate
No ratings yet
Immunoinformatics (2008)
No ratings yet
Immunoinformatics (2008)
215 pages
Basic Immunology
100% (2)
Basic Immunology
57 pages
Stem Cells Between Regeneration and Tumorigenesis
From Everand
Stem Cells Between Regeneration and Tumorigenesis
PublishDrive
No ratings yet
M.sc. Microbial Biotechnology
No ratings yet
M.sc. Microbial Biotechnology
35 pages
Hashimoto's Triggers-Advanced Reader Copy-V2
100% (3)
Hashimoto's Triggers-Advanced Reader Copy-V2
332 pages
Protocols used in Molecular Biology
From Everand
Protocols used in Molecular Biology
PublishDrive
5/5 (1)
Kuby Immunology Transplantation
No ratings yet
Kuby Immunology Transplantation
21 pages
Data Representation in Machine Learning Methods With Its Applicat
No ratings yet
Data Representation in Machine Learning Methods With Its Applicat
100 pages
ITP Pathogenesis - Pathophysiology By: Angela Azalia
100% (5)
ITP Pathogenesis - Pathophysiology By: Angela Azalia
47 pages
Antibody Optimization Enabled by Artificial Intelligence Predictions of Binding Affinity and Naturalness
No ratings yet
Antibody Optimization Enabled by Artificial Intelligence Predictions of Binding Affinity and Naturalness
39 pages
Personalized Immunotherapy for Tumor Diseases and Beyond
From Everand
Personalized Immunotherapy for Tumor Diseases and Beyond
PublishDrive
No ratings yet
ScBERT - A Large-Scale Pretrained Deep Langurage Model For Cell Type Annotation of Single-Cell RNA-seq Data
No ratings yet
ScBERT - A Large-Scale Pretrained Deep Langurage Model For Cell Type Annotation of Single-Cell RNA-seq Data
35 pages
NCM112 Module 2 Day 2 Immune System
No ratings yet
NCM112 Module 2 Day 2 Immune System
12 pages
Lymph
50% (2)
Lymph
94 pages
1 2021.07.06.451273v1.full
No ratings yet
1 2021.07.06.451273v1.full
48 pages
Immunochemistry
No ratings yet
Immunochemistry
25 pages
Demyelinating Diseases
100% (1)
Demyelinating Diseases
76 pages
Immunogenetics
0% (1)
Immunogenetics
46 pages
Immuno End Sem
No ratings yet
Immuno End Sem
32 pages
Xóa 6
No ratings yet
Xóa 6
31 pages
Beyond MHC Binding: Immunogenicity Prediction Tools To Refine Neoantigen Selection in Cancer Patients
No ratings yet
Beyond MHC Binding: Immunogenicity Prediction Tools To Refine Neoantigen Selection in Cancer Patients
22 pages
2020 - Transformer Protein Language Models Are Unsupervised Structure Learners
No ratings yet
2020 - Transformer Protein Language Models Are Unsupervised Structure Learners
24 pages
2019 - Evaluating Protein Transfer Learning With TAPE
No ratings yet
2019 - Evaluating Protein Transfer Learning With TAPE
20 pages
An Introd Epitope Predic Met Softyang 2009
No ratings yet
An Introd Epitope Predic Met Softyang 2009
20 pages
High-Throughput and High-Dimensional Single-Cell Analysis of Antigen-Specific CD8+ T Cells
No ratings yet
High-Throughput and High-Dimensional Single-Cell Analysis of Antigen-Specific CD8+ T Cells
29 pages
(2024NMI) Sliding-Attention Transformer Neural Architecture For Predicting T Cell Receptor-Antigen-Human Leucocyte Antigen Binding
No ratings yet
(2024NMI) Sliding-Attention Transformer Neural Architecture For Predicting T Cell Receptor-Antigen-Human Leucocyte Antigen Binding
20 pages
Paper1 Decodingdlforbinfdingaffinit1y
No ratings yet
Paper1 Decodingdlforbinfdingaffinit1y
32 pages
Optimization of Therapeutic Antibodies by Predicting Antigen Specificity From Antibody Sequence Via Deep Learning
No ratings yet
Optimization of Therapeutic Antibodies by Predicting Antigen Specificity From Antibody Sequence Via Deep Learning
16 pages
Hudson Et Al. - 2023 - Can We Predict T Cell Specificity With Digital Biology and Machine Learning
No ratings yet
Hudson Et Al. - 2023 - Can We Predict T Cell Specificity With Digital Biology and Machine Learning
11 pages
NIHMS1845306-supplement-1845306 Sup Materials
No ratings yet
NIHMS1845306-supplement-1845306 Sup Materials
22 pages
Bbae 105
No ratings yet
Bbae 105
17 pages
Computational and Experimental Evaluation of The I
No ratings yet
Computational and Experimental Evaluation of The I
23 pages
The immuneML Ecosystem For Machine Learning Analysis of
No ratings yet
The immuneML Ecosystem For Machine Learning Analysis of
22 pages
Andreatta 2018
No ratings yet
Andreatta 2018
13 pages
Textbook of Allergy For The Clinician - 2nd Edition Entire PDF Ebook
No ratings yet
Textbook of Allergy For The Clinician - 2nd Edition Entire PDF Ebook
16 pages
DeepECA - An End-To-End Learning Framework For Protein Contact Prediction From A Multiple Sequence Alignment
No ratings yet
DeepECA - An End-To-End Learning Framework For Protein Contact Prediction From A Multiple Sequence Alignment
17 pages
Jitc 2024 May 12 5 Inline Supplementary Material 1
No ratings yet
Jitc 2024 May 12 5 Inline Supplementary Material 1
17 pages
An Overview of Machine Learning Methods For Monotherapy Drug Response Prediction
No ratings yet
An Overview of Machine Learning Methods For Monotherapy Drug Response Prediction
18 pages
Towards Foundation Models For Knowledge
No ratings yet
Towards Foundation Models For Knowledge
22 pages
2024 05 24 595730v1 Full
No ratings yet
2024 05 24 595730v1 Full
22 pages
Antiberta
No ratings yet
Antiberta
12 pages
Aranha 2020
No ratings yet
Aranha 2020
12 pages
bp2 3
No ratings yet
bp2 3
13 pages
BioinformaticsResourcesoolsConformational B CellEpitopePrediction
No ratings yet
BioinformaticsResourcesoolsConformational B CellEpitopePrediction
12 pages
Bbae 380
No ratings yet
Bbae 380
11 pages
Project Reference
No ratings yet
Project Reference
16 pages
Neural Networks To Learn Protein Sequence-Function Relationships From Deep Mutational Scanning Data
No ratings yet
Neural Networks To Learn Protein Sequence-Function Relationships From Deep Mutational Scanning Data
12 pages
DeepPFP - A Multi-task-Aware Architecture For Protein Function Prediction
No ratings yet
DeepPFP - A Multi-task-Aware Architecture For Protein Function Prediction
10 pages
1 s2.0 S1093326323000864 Main
No ratings yet
1 s2.0 S1093326323000864 Main
17 pages
Bioinformatics: Prediction of MHC Class II-binding Peptides Using An Evolutionary Algorithm and Artificial Neural Network
No ratings yet
Bioinformatics: Prediction of MHC Class II-binding Peptides Using An Evolutionary Algorithm and Artificial Neural Network
10 pages
Paper 1-IGFold
No ratings yet
Paper 1-IGFold
15 pages
Cold Spring Harb Perspect Biol-2023-Barton-cshperspect.a041462
No ratings yet
Cold Spring Harb Perspect Biol-2023-Barton-cshperspect.a041462
18 pages
Exploration of Protein Sequence Embeddings For Protein-Ligand Binding Site Detection
No ratings yet
Exploration of Protein Sequence Embeddings For Protein-Ligand Binding Site Detection
6 pages
Immuno in For Matics
No ratings yet
Immuno in For Matics
50 pages
Epitope Predictions: Abbreviations
No ratings yet
Epitope Predictions: Abbreviations
19 pages
ODLCP A Novel Methodology To Predict Immune System Strength by Using Optimized Deep Learning and Classification Principle
No ratings yet
ODLCP A Novel Methodology To Predict Immune System Strength by Using Optimized Deep Learning and Classification Principle
6 pages
Peerj Cs 275
No ratings yet
Peerj Cs 275
17 pages
Journal Pone 0296737
No ratings yet
Journal Pone 0296737
15 pages
B Cell Epitopes White Paper
No ratings yet
B Cell Epitopes White Paper
4 pages
Why Deep Models Often Cannot Beat Non-Deep Counterparts On Molecular Property Prediction?
No ratings yet
Why Deep Models Often Cannot Beat Non-Deep Counterparts On Molecular Property Prediction?
11 pages
Chapter 22 Lymphatic System Power Point
No ratings yet
Chapter 22 Lymphatic System Power Point
62 pages
COACH YZhang
No ratings yet
COACH YZhang
8 pages
Sequence-Only Prediction of Binding Affinity Changes: A Robust and Interpretable Model For Antibody Engineering
No ratings yet
Sequence-Only Prediction of Binding Affinity Changes: A Robust and Interpretable Model For Antibody Engineering
8 pages
Imm 155 1
No ratings yet
Imm 155 1
2 pages
Ansari and Raghava - in Silico Models For BCE Recognition and Signalling
No ratings yet
Ansari and Raghava - in Silico Models For BCE Recognition and Signalling
10 pages
Research Poster
No ratings yet
Research Poster
1 page
Improved Prediction of MHC Class I and Class II Epitopes - Nielsen Et Al 2004
No ratings yet
Improved Prediction of MHC Class I and Class II Epitopes - Nielsen Et Al 2004
10 pages
Cancer Textbook: 1
From Everand
Cancer Textbook: 1
Aliasghar Tabatabaei Mohammadi
No ratings yet
Authors: Ibel Carri, María Marcela Barrio, Morten Nielsen: Background
No ratings yet
Authors: Ibel Carri, María Marcela Barrio, Morten Nielsen: Background
1 page
Accurate Approximation Method For Prediction of Class I MHC Affinities For Peptides - Lundegaard, Lund & Nielsen 2008
No ratings yet
Accurate Approximation Method For Prediction of Class I MHC Affinities For Peptides - Lundegaard, Lund & Nielsen 2008
2 pages
Immunology Serology 2021
No ratings yet
Immunology Serology 2021
31 pages
Belatacept A New Toy
No ratings yet
Belatacept A New Toy
20 pages
B Cell and T Cell Development
No ratings yet
B Cell and T Cell Development
2 pages
Untitled
No ratings yet
Untitled
23 pages
Widoasti Putri Utami - 22010120410005 PDF
No ratings yet
Widoasti Putri Utami - 22010120410005 PDF
20 pages
Immunology Questions
No ratings yet
Immunology Questions
186 pages
Moravec Lymphatic System Slides
No ratings yet
Moravec Lymphatic System Slides
43 pages
Kit Insert BD Multitest
No ratings yet
Kit Insert BD Multitest
50 pages
Dr. Lamia El Wakeel, PhD. Lecturer of Clinical Pharmacy Ain Shams University
No ratings yet
Dr. Lamia El Wakeel, PhD. Lecturer of Clinical Pharmacy Ain Shams University
19 pages
Immediate Download Medical Immunology, 7th Edition Gabriel Virella Ebooks 2024
100% (1)
Immediate Download Medical Immunology, 7th Edition Gabriel Virella Ebooks 2024
55 pages
Art 34329
No ratings yet
Art 34329
9 pages
Janeway's Immunobiology Notes
No ratings yet
Janeway's Immunobiology Notes
8 pages
Rajendran 2018
No ratings yet
Rajendran 2018
38 pages
ImmunoSero Immunology Overview Notes
No ratings yet
ImmunoSero Immunology Overview Notes
4 pages
1 s2.0 S104453232100066X Main
No ratings yet
1 s2.0 S104453232100066X Main
17 pages
Tuberculosis: Nature Reviews Disease Primers October 2016
No ratings yet
Tuberculosis: Nature Reviews Disease Primers October 2016
25 pages