Mutscan-A Flexible R Package For Efficient End-To-End Analysis of Multiplexed Assays of Variant Effect Data
Mutscan-A Flexible R Package For Efficient End-To-End Analysis of Multiplexed Assays of Variant Effect Data
Mutscan-A Flexible R Package For Efficient End-To-End Analysis of Multiplexed Assays of Variant Effect Data
*Correspondence:
[email protected]; Abstract
[email protected] Multiplexed assays of variant effect (MAVE) experimentally measure the effect of large
1
Friedrich Miescher Institute numbers of sequence variants by selective enrichment of sequences with desirable
for Biomedical Research, Basel, properties followed by quantification by sequencing. mutscan is an R package for flex-
Switzerland
2
SIB Swiss Institute ible analysis of such experiments, covering the entire workflow from raw reads up to
of Bioinformatics, Basel, statistical analysis and visualization. The core components are implemented in C++ for
Switzerland efficiency. Various experimental designs are supported, including single or paired
3
University of Basel, Basel,
Switzerland reads with optional unique molecular identifiers. To find variants with changed relative
abundance, mutscan employs established statistical models provided in the edgeR and
limma packages. mutscan is available from https://github.com/fmicompbio/mutscan.
Keywords: Deep mutational scanning, Multiplexed assays of variant effect, R package
Background
A major question in biology is that of how sequence and function are related. The
advances made in modern sequencing technology have resulted in an exponential
increase in whole-genome and exome sequencing data over the past few decades and
genome-wide association studies (GWAS) have found statistical associations between
certain genetic variants and phenotypes or diseases [1]. However, the phenotypic con-
sequences of a large fraction of variants identified in the human genome remain elu-
sive [2], which is why these variants have been termed variants of uncertain significance
(VUS). For example, 41.8% of variants currently listed in ClinVar are characterized as
VUS [3]. Therefore, a pressing objective has been to find ways to annotate these variants
in an efficient way.
Over the past decades, multiplexed assays of variant effect (MAVE) have revolution-
ized the study of sequence-function relationships by enabling the simultaneous assess-
ment of the functional consequences of thousands of sequence variants on a given
phenotype. For example, a large library of variants is created by mutating a sequence of
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate-
rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Soneson et al. Genome Biology (2023) 24:132 Page 2 of 22
interest (deep mutational scanning (DMS)), and this library is exposed to a pooled selec-
tive assay which results in an enrichment of variants with high activity in the given assay
and a depletion of variants with low activity [4]. The frequency of each variant before
and after selection can be quantified using high-throughput sequencing (Fig. 1A). Vari-
ant counts can be obtained by either sequencing the variants directly or using molecu-
lar barcodes that uniquely identify each variant. The latter can reduce sequencing costs
and increase read quality [5]. Enrichment scores calculated from the variant frequen-
cies can be used to infer molecular function and thus the functional effect of a mutation
relative to the wild-type sequence [4]. The variety of experimental designs that can be
used in MAVE emphasizes the value of these assays and their flexibility in addressing
diverse biological questions. They have been used to examine activities of proteins, such
as protein–protein interaction (PPI) [6–8], E3 ubiquitin ligase activity [8], protein abun-
dance [7, 9], receptor binding [10], aggregation [11, 12], and activity within signaling
pathways [13]. The functional assays used to achieve enrichment or depletion of variants
are equally diverse and include for example fitness or cell growth [6, 7, 11, 12], different
reporter assays coupled with fluorescence-activated cell sorting (FACS) [9, 10, 13, 14],
and protein display [8, 15].
The growth of the field has been further driven by the decreasing cost of sequencing
and the simplified construction of large libraries thanks to the commercial large-scale
synthesis of DNA oligonucleotides [16]. New technical developments that allow the syn-
thesis of large libraries of entire synthetic genes will probably result in even larger librar-
ies [17]. Recently, a database was launched with the aim to collect the rich data gathered
from MAVE assays in a central place with a unified structure to make it accessible to the
scientific community [18]. This, however, also calls for streamlined and more standard-
ized analysis methods, including rigorous statistical analysis that considers the possible
sources of error in MAVE experiments and therefore allows to make confident state-
ments about the true functional consequences of variants. Several tools have been pub-
lished to address this demand (Table 1), the most elaborate and widely used among these
are Enrich2 [19] and DiMSum [20].
Here, we present mutscan, a novel R package that provides a unified, flexible inter-
face to the analysis of MAVE experiments, covering the entire workflow from FASTQ
files to count tables and statistical analysis and visualization. The core read processing
module is implemented in C++ , which enables the analysis of large sequencing experi-
ments within reasonable time and memory constraints. mutscan is easy to install and
use, has a flexible interface that encompasses a broad range of experimental designs, and
employs established statistical testing frameworks developed for count data. We apply
mutscan, as well as Enrich2 and DiMSum, to several experimental MAVE data sets and
show that while estimated counts and enrichment scores are often highly concordant
between methods, mutscan is generally able to process the data faster, with lower mem-
ory requirements, and more efficient use of multi-core processing. Given the variety of
Soneson et al. Genome Biology (2023) 24:132 Page 4 of 22
Table 1 Comparison of different tools for the analysis of MAVE data. In our evaluations, we compare
mutscan to DiMSum and Enrich2, as these are widely used in the field, and align with mutscan in
terms of their scope and aim
Aspect mutscan DiMSum [20] Enrich2 [19]
MAVE experimental designs, the ever-increasing scale of MAVE experiments, and the
democratization of the field, we believe that its flexibility, efficiency, and ease of access
will make mutscan an important addition to the MAVE analysis tool ecosystem.
Results
Example data sets
The results presented below are obtained by applying mutscan and other tools (Table 1)
to four deep mutational scanning data sets (Table 2). These data sets represent a vari-
ety of typical MAVE experimental designs and have been previously used for evaluation
purposes [20].
Table 2 Overview of deep mutational scanning data sets used in this study
Data set Number Library type Molecule Activity measured Total number
of of reads (Mio.)
replicates
part of the workflow processes each sequencing library independently. Hence, addi-
tional samples can easily be added to an experiment without the need to re-process
the existing samples. In this step, reads that do not adhere to the user specifications
(e.g., too low base quality, too many or forbidden mutations compared to a provided
wild-type sequence) are filtered out, and the remaining ones are used to tabulate
the number of reads (or unique UMI sequences, if applicable) corresponding to each
observed sequence variant. For increased processing speed, this step can be paral-
lelized. In the second part, the output from all samples in the experiment is com-
bined into a joint SummarizedExperiment object [26], containing the merged count
matrix, a summary of the filtering applied to each individual sample, and additional
information about the detected variants, such as the nucleotide and amino acid
sequence; the number of mutated bases, codons, and amino acids; and the type of
mutations (silent, non-synonymous, stop). Finally, the merged object can be used as
the input to functions generating diagnostic plots and reports, as well as statistical
analysis functions that estimate log-fold changes and find variants that are increas-
ing or decreasing significantly in abundance during the selection process. Since the
data is represented as a SummarizedExperiment object, it can also be directly used
as input to a wide range of analysis and visualization tools from the Bioconductor
ecosystem [27].
with the digestFastqs() function from mutscan. This step extracts the sequences of the
variable regions corresponding to the FOS and JUN variants from the paired reads, com-
pares them to the provided wild-type sequences and identifies the differences, and tab-
ulates the number of reads and unique UMIs for each identified variant combination.
Since the variable regions of the forward and reverse reads correspond to variants of dif-
ferent proteins encoded at two different loci and do not share any common sequence, we
instruct digestFastqs() to process them separately rather than attempting to merge them.
This also allows us to submit separate wild-type sequences for the variable regions in the
forward and reverse reads. We retain only reads with at most one mutated codon in each
of the two proteins.
The output of this initial processing step is a list for each sample, containing a
count table, a filtering summary, and a record of the parameters that were used
(Fig. 1B). While these objects can be explored as they are, it is more convenient to
merge them into a joint SummarizedExperiment [26] object for downstream analysis
(Fig. 1C), which is done via the summarizeExperiment() function in mutscan. The
resulting object contains a matrix with the UMI counts for all variants in all six sam-
ples, as well as a summary of the number of reads filtered out at each step, and any
metadata provided for the samples (replicate ID, condition, optical density, etc.), and
feeds directly into the diagnostic plot and statistical analysis functions in mutscan,
including plotPairs() and plotFiltering(), the outputs of which are shown in Fig. 2. We
observe that, as expected, we find more unique variants with multiple base muta-
tions, but the observed abundance of individual variants with multiple mutations is
markedly lower than for variants with no or a single mutated base (Fig. 2A, B). Using
mutscan to visualize the filtering process further illustrates that across all samples,
the main reasons for read pairs being filtered out are that they contain an adapter or
that they contain more than the allowed number of mutated codons (Fig. 2C, D).
Next, we use mutscan to investigate the concordance among the six samples, by plot-
ting the estimated variant counts (Fig. 2E). As expected, the correlation within each
type of sample (input/output) is considerably higher than the correlation between input
and output samples, indicating that the selection step indeed influences the sample
composition.
After the initial quality assessment, we use mutscan to summarize the counts for
variants with the same amino acid sequence. From this matrix, we then estimate a
protein–protein interaction score for each variant and replicate, indicating the effect
of the variant relative to the wild-type sequence as described by [6]. Focusing only
on variants with a mutated amino acid in either FOS or JUN (but not both), we can
generate a heatmap summarizing the impact of each single amino acid mutation on
the overall interaction score (Fig. 2F). These heatmaps serve as the basis for interpret-
ing the mechanisms by which mutations impact the molecular activity studied. For
instance, as was observed in the original paper [6] for this data set, positions crucial
for heterodimerization are highly sensitive to mutations. Any substitution of the leu-
cine at positions 4, 11, 18, and 25, which form the hydrophobic core of the interaction
interface, is detrimental. Positions involved in salt bridge formation across the inter-
face, for instance, between position 21 in Fos and position 26 in Jun, are also typically
detrimental for heterodimer formation, although the magnitude of the effect is lower
Soneson et al. Genome Biology (2023) 24:132 Page 7 of 22
Fig. 2 Results from the mutscan re-analysis of the FOS/JUN protein interaction data set from [6]. A The
number of variants detected with different numbers of mutated bases. B The average abundance of variants
with different numbers of mutated bases. While a larger number of different variants with two or more
mutations are observed, these are generally much less abundant than variants with no or a single mutated
base. C, D Diagnostic plots of read filtering performed by digestFastqs(). In this data set, the main reasons
for filtering out read pairs are that they either contain an adapter sequence or that the number of mutated
codons exceeds the defined threshold of maximum one mutated codon per protein. E Pairs plot displaying
the concordance of the observed counts across the six samples. Generally, high correlations are observed
between the three input samples as well as between the three output samples, indicating good robustness
of the measurements. The correlation between the input and output samples is considerably lower, reflecting
the impact of the selection. F Heatmap showing the estimated protein–protein interaction score for all
single-amino acid variants of each of the two proteins. Red color indicates an increased interaction, while
blue signifies decreased interaction
than at core positions. The heatmap can also reveal positions where the wild-type
appears sub-optimal, such as position 8 in Fos where substitution of the hydrophilic
wild-type threonine by more hydrophobic amino acids leads to an increase in interac-
tion score, as one would expect for a position at the hydrophobic core of the inter-
face. The presence of this sub-optimal residue in the wild-type sequence might thus
result from an evolutionary trade-off with other properties of Fos, such as interaction
with other partners.
Soneson et al. Genome Biology (2023) 24:132 Page 8 of 22
mutated codons that are allowed in the identified variants. This choice will further
impact the naming of the variants (in terms of the codon or nucleotide deviations
from the closest wild-type sequence).
– Collapsing of similar sequences: If no wild-type sequences are provided, the user
has the option to collapse variants (unique variable sequences) with at most a
given number of mutations between them. The collapsing is done in a greedy way,
starting from the most abundant variant, and can be limited to only collapsing
variants with a large enough abundance ratio.
– Processing only a subset of the reads: For testing purposes, it is often useful to
process only a small subset of the reads. mutscan allows the user to limit the pro-
cessing to the first N reads in the FASTQ file, where N is specified by the user.
– Various filtering criteria: mutscan implements a range of filtering criteria, includ-
ing the number of “N” bases in the variable and/or UMI sequence, the number of
mutations in the constant and/or variable sequence (if a reference sequence is pro-
vided), the base quality of the identified mutations and/or the average base quality
in the read, the presence of forbidden codons (specified using IUPAC code), or the
invalid overlap between forward and reverse reads for merging. The output object
contains a table listing the number of reads filtered out by each of the criteria.
– Export of excluded reads (with a reason for exclusion) to FASTQ files for further
investigation: In cases where reads are filtered out for any of the reasons listed
above, it may be helpful to be able to process these further. mutscan can write
reads that are filtered out to a (pair of ) FASTQ file(s), including the reason for
exclusion in the read identifier.
– Estimation of sequencing error rate: If the input reads contain a constant region,
mutscan estimates the sequencing error rate by counting the number of mis-
matches compared to the expected sequence across the reads. The error rate is
further stratified by the base quality reported in the FASTQ file.
– Nucleotide- or amino acid-based analysis: The main output from digestFastqs()
is represented in base or codon space. However, the corresponding amino acid
information is recorded, and the count matrix can easily be collapsed on the
amino acid level. In addition, mutscan reports the type of mutations (silent, non-
synonymous, stop) present in each variant.
– Flexible analysis frameworks: The statistical testing module in mutscan is based
on established packages for the analysis of digital gene expression data (edgeR
[29] and limma-voom [30]). Both tools allow the user to specify an arbitrary
(fixed-effect) design and thus provides excellent flexibility for testing complex
hypotheses, not limited to paired comparisons of input and output samples.
Moreover, several different normalizations are available, allowing calculation of
both “absolute” and “relative” log-fold changes (e.g., changes relative to a wild-
type reference).
Soneson et al. Genome Biology (2023) 24:132 Page 10 of 22
Fig. 3 mutscan p-value distributions for null comparisons. For each data set, repeated null data sets were
generated by artificially splitting the replicates into two approximately equally sized groups. For each such
artificial null data set, mutscan (with the method set to edgeR and limma, respectively) was used to fit a
model and test whether the log-fold change between input and output samples differed significantly
between the two artificial groups. The colored densities represent the individual data splits, while the dark
gray density represents the union of p-values from all data splits. Since the groupings are artificial, uniform
p-value distributions are expected. While technical differences among the samples, and the low sample size
in general, imply that not all comparisons provide exactly uniform p-value distributions, we do not observe
a systematic bias in the p-values from mutscan. Only variants with more than 50 counts in all input samples
were considered for this analysis. The number of retained features is indicated in each panel
Fig. 4 Comparison of computational performance metrics for mutscan, Enrich2, and DiMSum. Generally,
mutscan processes the included data sets faster and with a lower memory footprint than competing
methods. In addition, only small amounts of data are being read from and written to disk during the
processing. The digestFastqs() metrics for the Li_tRNA_sel30 data set are averaged across the five runs on the
single input sample, since only one run is required for mutscan. Total I/O volumes are separated in input and
output, indicated above by I and O, respectively. RSS, resident set size
as the whole sample processing is performed by a single function that traverses the input
files only once and without the need to create intermediate files on disk for transfer of
data between different software tools. The total runtime of DiMSum, which is reading
and writing much larger volumes of data to disk, also likely depends more strongly on
the performance of the storage system. Another consequence of mutscan’s design is that
a larger fraction of the time is spent on actions that are parallelizable, which can be seen
by its higher average CPU load (400–600% for mutscan when run with 10 cores, com-
pared with 100–200% for DiMSum with 10 cores). Enrich2 is not parallelizable and thus
exhibits a constant load of 100%.
Soneson et al. Genome Biology (2023) 24:132 Page 13 of 22
While the evaluations mentioned above were all run with 10 cores, we also investi-
gated how the key performance parameters scaled with the number of cores provided to
mutscan when running the digestFastqs() function on a single pair of FASTQ files (Addi-
tional file 2: Fig. S1). The results suggest that while the average load increases linearly
with an increasing number of cores, indicating good scalability of the parallel parts of
the code, the benefit in terms of decreased total execution time is significantly reduced
when more than 10 cores are used. This is likely explained by the constant runtime con-
sumed by the serial parts in the code. In addition, for a fixed number of cores, the raw
data processing performed by digestFastqs() scales linearly with the number of input
reads in terms of execution time and close to linearly in terms of memory requirement
(Additional file 2: Fig. S2).
Fig. 5 Comparison of counts estimated by mutscan, DiMSum, and Enrich2. For each data set, a representative
sample is shown (indicated in the respective figure titles, together with the data set). In general, the counts
estimated by the three methods show a high correlation, and deviations are likely explained by differences
in aspects such as how the set of allowed or forbidden mutations is specified, how the filtering of low-quality
sequences is implemented, and whether mismatches or partial matches are allowed in the specified primer
sequence
Fig. 6 Comparison of enrichment scores estimated by mutscan, DiMSum, and Enrich2. For mutscan, the
values are logFCs estimated by either edgeR or limma. For DiMSum, they correspond to the enrichment
score derived from all replicates. For Enrich2, they are the averages of the scores for the replicates. The lower
correlations seen in the Bolognesi_TDP43_290_331 data set, and to some extent, the Diss_FOS_JUN data set
reflect the less stringent variant filtering criteria used in these data sets (up to 3 mutated bases for Bolognesi_
TDP43_290_331 and up to two mutated codons for Diss_FOS_JUN, see the “Methods” section). Considering
only variants with fewer recorded mutations leads to higher correlations, more comparable to the Diss_FOS
data set where only up to two mutated bases were allowed (Additional file 2: Figs. S11-S14)
S15). Again, most of the discrepancies are contributed by variants with more mutated
bases (Additional file 2: Fig. S13).
Discussion
We are describing mutscan, a new package to process and analyze multiplexed assays of
variant effect data. mutscan is an R package and does not depend on external software
libraries beyond other R packages; thus, it is easy to install and use across all major oper-
ating systems. At the same time, it provides a high degree of interoperability with other
tools, since the summarized data is represented in a SummarizedExperiment object,
which can be directly used with a wide range of analysis and visualization functions
within the Bioconductor ecosystem.
Soneson et al. Genome Biology (2023) 24:132 Page 16 of 22
For many of the MAVE studies to date (including the ones used for evaluation in
this study), the readout is the actual DNA sequence of the target protein(s) of inter-
est. However, the field is increasingly moving towards instead sequencing unique
barcodes associated with these variants, as this provides a way of better distinguish-
ing true variants and sequencing errors, and also simplifies the analysis of longer
protein sequences. mutscan supports also this type of experimental setup. If no ref-
erence sequence is provided, the variants (barcodes) will be represented by their
sequence. Moreover, observed sequences can be collapsed if they are within a given
distance, and have at least a pre-specified ratio of abundance in the sample. Current
development work for mutscan includes streamlining the analysis workflow for large
barcode sequencing experiments further, including the mapping of barcodes to true
variants.
Conclusions
We have described mutscan, a flexible, easy-to-use R package for processing and sta-
tistical analysis of multiplexed assays of variant effect data. mutscan is designed in a
modular way and is directly applicable also to other types of data aimed at identify-
ing and tabulating substitution variants compared to a provided reference sequence,
or tabulating unique sequences directly, potentially after collapsing variants within a
certain distance. By leveraging established tools for statistical analysis of count data,
the analytical framework provides a high degree of flexibility to address a variety of
practical questions.
Methods
Figure 1 provides an overview of the functionality implemented in mutscan. A typical
workflow can often be summarized in three main steps: (1) processing of individual
samples/sequencing libraries; (2) aggregation of the output from the individual samples
into a single, combined object; and (3) analysis and visualization.
1. If applicable, search for user-specified adapter sequences and remove any read (pair)
where these are detected.
2. Split the read (pair) into components as specified by the user (see Box 1 for details);
in particular, extract the constant and variable parts of the reads. In the process, filter
out reads that are not compatible with the user-specified composition.
3. Reverse complement the forward and/or reverse constant and variable sequences if
requested.
4. If requested, merge the forward and reverse variable sequences. The user can specify
the minimum and maximum overlaps, the minimum and maximum lengths of the
Soneson et al. Genome Biology (2023) 24:132 Page 17 of 22
merged sequence, and the maximal fraction of allowed mismatches. If no valid over-
lap is found satisfying these criteria, the read pair is filtered out.
5. Filter out reads where the average base quality in the variable sequence is below the
user-specified threshold or where the number of Ns in the variable or UMI sequence
exceeds an imposed threshold.
6. If one or more wild-type/reference sequences are provided, compare the extracted
variable region to these and find the closest match. If no wild-type sequence is found
within an imposed mismatch limit, or if more than one wild-type sequence provides
an equally good optimal match, the read (pair) is filtered out. Reads will also be fil-
tered out if the base quality of the identified mutation(s) is below an imposed thresh-
old, or if the mutated codon(s) matches a user-specified list of forbidden codons.
Separate sets of wild-type sequences can be provided for the forward and reverse
reads in a pair, if appropriate.
7. Similarly, compare the extracted constant sequences to provided reference
sequences, and filter out reads where the difference exceeds a given threshold, or
where more than one reference sequence provides equally good optimal matching.
8. If the read has not been filtered out in any of the steps above, store the observed
sequence as well as an assigned “mutant name,” consisting of the name of the closest
wild-type sequence together with the positions and sequences of the mutated nucle-
otides or codons. If no wild-type sequences are provided, the mutant name is the
observed variable sequence.
Finally, if one or more constant regions are included in the reads, mutscan tabulates the
number of mismatches for each base quality and returns the table.
nucleotide substitutions. For the Li_tRNA_sel30 data set, we allowed any number of muta-
tions but filtered the quantified variants to only retain those with a count exceeding 2000 in
all the input samples and exceeding 200 in all output samples. For the Diss_FOS_JUN data
set, we instructed mutscan to allow up to one mutated codon in each of the two variable
sequences, and only allowed mutated codons encoded by the “NNS” IUPAC code. For DiM-
Sum, we limited the total number of mutated amino acids to two (across the concatenated
wild-type sequence), indicated that the mutagenesis was done on the codon level, and pro-
vided the IUPAC code for the allowed nucleotide sequence. Before comparison, we further
filtered out the DiMSum variants where the two mutated amino acids occurred in the same
protein. We also removed variants with both non-synonymous and silent mutations from the
mutscan output, as the default setting in DiMSum is to exclude these. Configuration files for
all methods are available from https://github.com/fmicompbio/mutscan_manuscript_2022.
Supplementary Information
The online version contains supplementary material available at https://doi.org/10.1186/s13059-023-02967-0.
Additional file 1. Available from https://doi.org/10.5281/zenodo.7896393. Reprocessing of FOS/JUN data from Diss
& Lehner. This file provides a reproducible record describing how mutscan was used to reprocess the data from a
previous publication.
Additional file 2: Figure S1. Computational performance metrics for mutscan’s digestFastqs() function run with dif-
ferent numbers of cores, processing a single input sample from the Li_tRNA_sel30 dataset. The black dots represent
the average across five independent runs, each indicated by a smaller red dot. The dashed curves connect the aver-
age values for different numbers of cores. RSS -resident set size. Figure S2. Execution time and maximum memory
required by the digestFastqs() function when processing different numbers of reads (achieved by setting the maxN-
Reads argument of the digestFastqs() function to N, which limits the processing to the first N reads in the FASTQ file).
The black dots represent the average across five independent runs, each indicated by a smaller red dot. The dashed
line in (A) is a linear regression line, while the dashed curve in (B) connects the average values for different numbers
of cores. RSS - resident set size. Figure S3. Comparison of the variants detected by mutscan, DiMSum and Enrich2 in
the Diss_FOS dataset, stratified by the number of mutated bases in the variant. All variants with up to two mutated
bases are consistently detected by all three tools. The abundance represents the average log10(count+1) across
samples where the variant was quantified. Figure S4. Comparison of the observed counts for variants detected by
mutscan, DiMSum and Enrich2 in the Diss_FOS dataset, stratified by the number of mutated bases in the variant.
Figure S5. Comparison of the variants detected by mutscan and DiMSum in the Diss_FOS_JUN dataset, stratified by
the number of mutated amino acids in the variant. Most variants are found consistently with both tools. The ones
found by a single tool tend to have a low read count. The abundance represents the average log10(count+1) across
samples where the variant was quantified. Figure S6. Comparison of the observed counts for variants detected by
mutscan and DiMSum in the Diss_FOS_JUN dataset, stratified by the number of mutated amino acids in the variant.
Figure S7. Comparison of the variants detected by mutscan and DiMSum in the Bolognesi_TDP43_290_331 dataset,
stratified by the number of mutated bases in the variant. Almost all variants with up to two mutations are consist-
ently detected by both methods. The ones found by a single tool tend to have a low read count. The abundance
represents the average log10(count+1) across samples where the variant was quantified. Figure S8. Comparison of
the observed counts for variants detected by mutscan and DiMSum in the Bolognesi_TDP43_290_331 dataset, strati-
fied by the number of mutated bases in the variant. Figure S9. Comparison of the variants detected by mutscan and
DiMSum in the Li_tRNA_sel30 dataset, stratified by the number of mutated bases in the variant. Only variants with
up to six mutations are shown. Most variants are found with both tools, and the ones found by a single tool tend to
have a lower read count. The abundance represents the average log10(count+1) across samples where the variant
was quantified. Figure S10. Comparison of the observed counts for variants detected by mutscan and DiMSum in
Soneson et al. Genome Biology (2023) 24:132 Page 20 of 22
the Li_tRNA_sel30 dataset, stratified by the number of mutated bases in the variant. Only variants with up to six
mutations are shown. Figure S11. Comparison of fitness scores estimated by mutscan, DiMSum and Enrich2 in the
Diss_FOS dataset, stratified by the number of mutated bases in the variant. The agreement between the fitness
scores from the different methods is very high for variants with a single mutation, and decreases as the number of
mutations increases (and alongside that, the average abundance decreases). Figure S12. Comparison of fitness
scores estimated by mutscan and DiMSum in the Diss_FOS_JUN dataset, stratified by the number of mutated amino
acids in the variant. The agreement between the fitness scores from the different methods is very high for variants
with a single mutation, and decreases as the number of mutations increases. Figure S13. Comparison of fitness
scores estimated by mutscan and DiMSum in the Bolognesi_TDP43_290_331 dataset, stratified by the number of
mutated bases in the variant. The agreement between the fitness scores from the different methods is very high for
variants with a single mutation, and decreases as the number of mutations increases. Figure S14. Comparison of
fitness scores estimated by mutscan and DiMSum in the Li_tRNA_sel30 dataset, stratified by the number of mutated
bases in the variant. Figure S15. Comparison of fitness scores estimated by DiMSum for individual replicates in the
four example data sets.
Additional file 3. Review history.
Acknowledgements
The authors would like to thank the current and former members of the Diss, Thomä and Stadler groups at FMI for the
discussions, testing, and feedback.
Peer review information
Anahita Bishop was the primary editor of this article and managed its editorial process and peer review in collaboration
with the rest of the editorial team.
Review history
The review history is available as Additional file 3.
Authors’ contributions
CS: conceptualization, methodology, software, formal analysis, writing—original draft, and visualization. AMB: investiga-
tion, writing—original draft, and validation. GD: conceptualization, resources, validation, writing —review and editing,
and project administration. MBS: conceptualization, methodology, software, and writing—original draft. The authors
read and approved the final manuscript.
Funding
This work was supported by the Novartis Research Foundation (all authors) and SNF Project grant 197593 (GD, AMB). The fund-
ing bodies did not have any role in the design of the study; the collection, analysis, and interpretation of the data; or writing of
the manuscript.
Declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
References
1. Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nature
Reviews Methods Primers. 2021;1(1):1–21.
2. Burke W, Parens E, Chung WK, Berger SM, Appelbaum PS. The challenge of genetic variants of uncertain clinical
significance: a narrative review. Ann Intern Med. 2022;175(7):994–1000.
3. Pir MS, Bilgin HI, Sayici A, Coşkun F, Torun FM, Zhao P, et al. ConVarT: a search engine for matching human genetic
variants with variants from non-human species. Nucleic Acids Res. 2022;50(D1):D1172–8.
4. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801–7.
5. Fowler DM, Stephany JJ, Fields S. Measuring the activity of protein variants on a large scale using deep mutational
scanning. Nat Protoc. 2014;9(9):2267–84.
6. Diss G, Lehner B. The genetic landscape of a physical interaction. Elife. 2018;7:e32472. Available from: https://doi.org/10.
7554/eLife.32472.
7. Faure AJ, Domingo J, Schmiedel JM, Hidalgo-Carcedo C, Diss G, Lehner B. Mapping the energetic and allosteric
landscapes of protein binding domains. Nature. 2022;604(7904):175–83.
8. Starita LM, Young DL, Islam M, Kitzman JO, Gullingsrud J, Hause RJ, et al. Massively parallel functional analysis of
BRCA1 RING domain variants. Genetics. 2015;200(2):413–22.
9. Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, et al. Multiplex assessment of protein variant
abundance by massively parallel sequencing. Nat Genet. 2018;50(6):874–82.
10. Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KHD, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2
receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182(5):1295-310.e20.
11. Bolognesi B, Faure AJ, Seuma M, Schmiedel JM, Tartaglia GG, Lehner B. The mutational landscape of a prion-like
domain. Nat Commun. 2019;10(1):1–12.
12. Seuma M, Faure AJ, Badia M, Lehner B, Bolognesi B. The genetic landscape for amyloid beta fibril nucleation accu-
rately discriminates familial Alzheimer’s disease mutations. Elife. 2021;1(10):e63364.
13. Jones EM, Lubock NB, Venkatakrishnan AJ, Wang J, Tseng AM, Paggi JM, et al. Structural and functional characteriza-
tion of G protein–coupled receptors with deep mutational scanning. Elife. 2020;21(9):e54895.
14. Carmody PJ, Zimmer MH, Kuntz CP, Harrington HR, Duckworth KE, Penn WD, et al. Coordination of -1 programmed
ribosomal frameshifting by transcript and nascent chain features revealed by deep mutational scanning. Nucleic
Acids Res. 2021;49(22):12943–54.
15. Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, et al. High-resolution mapping of protein
sequence-function relationships. Nat Methods. 2010;7(9):741–6.
16. Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, et al. Global analysis of protein folding using
massively parallel design, synthesis, and testing. Science. 2017;357(6347):168–75.
17. Plesa C, Sidore AM, Lubock NB, Zhang D, Kosuri S. Multiplexed gene synthesis in emulsions for exploring protein
functional landscapes. Science. 2018;359(6373):343–7.
18. Rubin AF, Min JK, Rollins NJ, Da EY, Esposito D, Harrington M, et al. MaveDB v2: a curated community database with
over three million variant effects from multiplexed functional assays [Internet]. bioRxiv. 2021. p. 2021.11.29.470445.
Available from: https://www.biorxiv.org/content/10.1101/2021.11.29.470445v1. Cited 1 Dec 2021
19. Rubin AF, Gelman H, Lucas N, Bajjalieh SM, Papenfuss AT, Speed TP, et al. A statistical framework for analyzing deep
mutational scanning data. Genome Biol. 2017;18(1):150.
20. Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep
mutational scanning data and diagnosing common experimental pathologies. Genome Biol. 2020;21(1):207.
21. Andrews S. FastQC: a quality control tool for high throughput sequence data [Online] [Internet]. 2010. http://www.
bioinformatics.babraham.ac.uk/projects/fastqc/.
22. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17(1):10–2.
23. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics. PeerJ.
2016;18(4):e2584.
24. Zorita E, Cuscó P, Filion GJ. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015;31(12):1913–9.
25. Li C, Zhang J. Multi-environment fitness landscapes of a tRNA gene. Nat Ecol Evol. 2018;2(6):1025–32.
26. Morgan M, Obenchain V, Hester J, Pagès H. SummarizedExperiment: SummarizedExperiment container [Internet].
2022. https://bioconductor.org/packages/SummarizedExperiment.
27. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. Orchestrating high-throughput genomic
analysis with Bioconductor. Nat Methods. 2015;12(2):115–21.
28. Cornish-Bowden A. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations
1984. Nucleic Acids Res. 1985;13(9):3021–30.
29. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital
gene expression data. Bioinformatics. 2010;26(1):139–40.
30. Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-seq read
counts. Genome Biol. 2014;15:R29.
31 Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snake-
make. F1000Res. 2021;10:33.
32. Soo VWC, Swadling JB, Faure AJ, Warnecke T. Fitness landscape of a dynamic RNA structure. PLoS Genet. 2021;17(2):e1009353.
33. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
Genome Biol. 2014;15(12):550.
34. Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng.
1998;5(1):46–55.
35. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-
sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
36. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data.
Genome Biol. 2010;11(3):R25.
Soneson et al. Genome Biology (2023) 24:132 Page 22 of 22
37. Lun ATL, Smyth GK. csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding
windows. Nucleic Acids Res. 2016;44(5):e45.
38. Myint L, Avramopoulos DG, Goff LA, Hansen KD. Linear models enable powerful differential activity analysis in mas-
sively parallel reporter assays. BMC Genomics. 2019;20(1):209.
39. Lun ATL, Chen Y, Smyth GK. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using
quasi-likelihood methods in edgeR. Methods Mol Biol. 2016;1418:391–416.
40. Soneson C, Bendel AM, Diss G, Stadler MB. mutscan. GitHub. 2023.https://github.com/fmicompbio/mutscan.
41. Soneson C, Bendel AM, Diss G, Stadler MB. mutscan v0.2.31. Zenodo. 2023.https://doi.org/10.5281/zenodo.7129132.
42. Soneson C, Bendel AM, Diss G, Stadler MB. mutscan v0.2.35. Zenodo. 2023. https://doi.org/10.5281/zenodo.7702318.
43. Diss G, Lehner B. The genetic landscape of a physical interaction. GSE102901. Gene Expression Omnibus. 2018.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102901.
44. Bolognesi B, Lehner B. The mutational landscape of a Prion-like domain. GSE128165. Gene Expression Omnibus.
2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128165.
45. Li C, Zhang J. Multi-environment fitness landscapes of a tRNA gene. GSE111508. Gene Expression Omnibus. 2018.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111508.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.