0% found this document useful (0 votes)
20 views11 pages

Accurate Detection of Mosaic Variants in Sequencing Data Without Matched Controls

Uploaded by

chunyin226
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Accurate Detection of Mosaic Variants in Sequencing Data Without Matched Controls

Uploaded by

chunyin226
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

LETTERS

https://doi.org/10.1038/s41587-019-0368-8

Accurate detection of mosaic variants in


sequencing data without matched controls
Yanmei Dou1, Minseok Kwon1, Rachel E. Rodin2,3,4,5, Isidro Cortés-Ciriano1,8, Ryan Doan2,3,4,
Lovelace J. Luquette! !1,6, Alon Galor1, Craig Bohrson1,6, Christopher A. Walsh2,3,4 and Peter J. Park! !1,7*

Detection of mosaic mutations that arise in normal develop- Incorporation of read-level features in a flexible framework is
ment is challenging, as such mutations are typically present critical for distinguishing real mutations from artifacts16,17. In place
in only a minute fraction of cells and there is no clear matched of filters with hard thresholds, recent methods such as DeepVariant16
control for removing germline variants and systematic and Strelka2 (ref. 17) use machine learning to combine relevant read-
artifacts. We present MosaicForecast, a machine-learning level features to improve detection of germline and cancer somatic
method that leverages read-based phasing and read-level variants, respectively. Another component in accurate detection of
features to accurately detect mosaic single-nucleotide vari- mosaic SNVs in silico is read-based phasing3,7,8,18, in which a can-
ants and indels, achieving a multifold increase in specificity didate mosaic mutation and a nearby germline variant are checked
compared with existing algorithms. Using single-cell sequenc- for haplotype consistency—that is, a true mosaic mutation should
ing and targeted sequencing, we validated 80–90% of the generate one and only one additional haplotype. A major disadvan-
mosaic single-nucleotide variants and 60–80% of indels tage of phasing, however, is that only a small fraction (~10–30%)
detected in human brain whole-genome sequencing data. of variants are phasable using short-read sequencing18, and phasing
Our method should help elucidate the contribution of mosaic may be ambiguous in nondiploid or low-mappability regions6.
somatic mutations to the origin and development of disease. We developed MosaicForecast, which leverages multiple read-
A single individual harbors multiple populations of cells with level features over phasable sites to build a genome-wide prediction
distinct genotypes due to somatic mutations arising postzygoti- model for finding mosaic mutations in the absence of a matched
cally1. Such diversity of genotypes in an individual is referred to as control sample. It consists of three major steps (Fig. 1a): (1) gen-
somatic mosaicism. Analysis of mosaic mutations in nondisease eration of a training set by read-based phasing; (2) construction
samples enables exploration of lineage patterns during develop- of a random forest (RF) model based on read-level features related
ment and characterization of mutational mechanisms operative in to the quality and category of variants, such as VAF, read depth,
normal cells2–6. Recent studies have also demonstrated that somatic mismatches per read and strand bias (Fig. 1b and Supplementary
mutations contribute to many diseases besides cancer1,7–11. Table 1); and (3) genome-wide prediction of mosaic SNVs. The
Identification of mosaic mutations in genome sequencing data underlying idea is similar to that of DeepVariant16 and Strelka2
remains challenging in two key aspects. First, whereas functionally (ref. 17) in that a nonlinear model that combines informative read-
relevant cancer mutations typically confer proliferative advantage level features is trained using a machine-learning framework and
and thus have relatively high variant allele fractions (VAFs), most then applied to a test set. The main difference is that, to over-
mosaic mutations are present in a small number of cells and have come the problem that high-quality training data are not available,
1. very low VAFs. In the extreme case, those occurring in postmitotic MosaicForecast uses phasable sites in building a training set. We
cells are present only in a single cell and are detectable only by sin- introduce another modeling step using multinomial logistic regres-
gle-cell sequencing, as we have done recently10. As standard cancer sion to improve the training set when some experimental validation
mutation callers typically have a lower VAF limit of 2–5% (refs. 12,13), data are available (Fig. 1c). As an illustrative example, we applied the
detection of mutations with lower VAFs requires a more sensitive tool to analyze whole-genome sequencing (WGS) data from brain
bioinformatic method and/or higher sequencing depth. Second, tissues of 60 autism spectrum disorder and 15 neurotypical indi-
mosaic mutations that arise early in development generally exist viduals, sequenced at ~250× (150-base pair (bp), paired-end reads).
2. in multiple tissues2,14. Thus, the conventional approach of using a To assemble a training set of high-confidence mosaic mutations,
paired control tissue for filtering germline variants and systematic we first identify a lenient set of candidate mosaic variants. We used
errors would exclude such early occurring mutations. Several meth- MuTect2 in its tumor-only mode for its high sensitivity, but other
ods have been employed to detect mosaic single-nucleotide vari- algorithms can be used (see Methods and Supplementary Fig. 1). To
ants (SNVs) from nontumor tissues, such as the use of a germline remove germline variants and recurrent artifacts, we filtered vari-
variant caller15 with higher ploidy assumptions8 or a combination ants present in the Genome Aggregation Database (gnomAD)19.
of somatic mutation callers3,7,9,14. Additional filtering leveraging trio In addition, since the likelihood that somatic mutations occur at
data to exclude germline variants7–9,15 is also common. However, the same position in different individuals is vanishingly small, we
validation rates in these studies have been modest. also removed variants found in any other samples in the dataset

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 2Division of Genetics and Genomics, Manton Center for Orphan
1

Disease, and Howard Hughes Medical Institute, Boston Children’s Hospital, Boston, MA, USA. 3Departments of Neurology and Pediatrics, Harvard Medical
School, Boston, MA, USA. 4Broad Institute of MIT and Harvard, Cambridge, MA, USA. 5Harvard/MIT MD–PhD Program, Harvard Medical School, Boston,
MA, USA. 6Bioinformatics and Integrative Genomics PhD program, Harvard Medical School, Boston, MA, USA. 7Ludwig Center at Harvard, Boston, MA,
USA. 8Present address: European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. *e-mail: [email protected]

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


LETTERS NATURE BIOTECHNOLOGY

a Genotype refinement RF model


of phasable sites based on phasable sites
Input Phasing Candidate sites
Phasing prediction model
Hap = 2 Phasable 31 read-level features
Hap = 2
Sequencing reads Initial: Phasing Hap = 3
Hap > 3

Refined genotypes prediction model train Tree 1 Tree 2


Phasing + 31 read-level features

Het
Initial variant calls Hap = 3 Orthogonal validation Repeat/CNV
Refine: train
(reference-free) Refhom Tree 3
Mosaic
Nonphasable

Multinomial logistic regression

Het Genome-wide
Orthogonal validation Extrapolate: Repeat/CNV prediction
Hap > 3 Refhom of mosaics
Amplicon seq Mosaic
Single cell
Trio

Validation of nonphasable sites

b Feature importance
0 20 40 60 80 100 Strand bias Mismatches per read Read1/read2 bias Read map positions
Variant allele fraction R1
Het genotype likelihood R1
Refhom genotype likelihood R2
Mosaic genotype likelihood R2
Mismatches per alt reads
Mismatches per ref reads Candidate mosaic variants
MapQ difference of ref/alt reads Mismatches
Proportion of clipped ref reads
Proportion of indel reads at the mutant position
d Nonrepeat region
Difference of ref/alt reads mapping positions
No. nonphasable sites
Strand bias Mutect2
Difference of ref/alt base query positions 0 200 400 600 800
Difference of ref/alt base query positions

(31 features in total) 8.9%

c Phasing
MosaicForecast-Phase
0 20 40 60 80

Hap = 2 Hap = 3 Hap > 3


3 76.4%
6 5 13 10
6
5%
6% 6% MosaicForecast-Refine
2% 2% 4%
47 0 20 40 60 80
50% Validation
37%
35 85.0% Het
98% 89%
Mosaic
139 219
Repeat region Refhom
Mutect2
Repeat
Genotype refinement 0k 1k 2k 3k 4k

Het Mosaic Repeat/CNV/Refhom


7 5 14 1%
61 1 2
5 2%
1% 4% 5%
4% 2% 3% MosaicForecast-Phase
10%
0 20 40 60 80

50.0%
96% 85% 90%
152 44
246 MosaicForecast-Refine
0 20 40 60 80
Validation
77.1%
Het Mosaic Refhom Repeat

Fig. 1 | Framework of MosaicForecast to detect mosaic SNVs from bulk sequencing data. a, Candidate mosaics were classified as hap!=!2, hap!=!3 or
hap!>!3 by read-based phasing, and an RF model was trained to predict the phasing by using over 30 read-level features as covariates. The model was
then applied to nonphasable sites to predict their genotypes. Given a list of experimentally evaluated sites, the model could be further improved by an
additional genotype-refinement step. b, The relative importance of the features from the RF model for the brain WGS data, with four examples of read-
level features. c, In total, 483 phasable sites were orthogonally evaluated by single-cell, trio and targeted sequencing data. After genotype refinement, the
phasable sites classified as hap!=!2, hap!=!3 and hap!>!3 were converted to het, mosaic, repeat/CNV and refhom for training. d, We applied MosaicForecast
to nonphasable MuTect2 candidate mosaics and evaluated them in single-cell, trio and targeted sequencing data. In nonrepeat regions, the precision
increased from 8.9% (MuTect2) to 76% for the phasing prediction model and 85% for the refined genotypes prediction model; in RepeatMasker regions, it
increased from 1% (MuTect2) to 50% in the phasing prediction model and 77% in the refined genotypes prediction model. The unit k stands for 1000.

(75 minus the one being analyzed). We observed that removing multiple tissues from the same individual, recurrent variants may
recurrent variants did not result in loss of sensitivity (Supplementary be true mosaics; thus, a filtering scheme with an appropriate panel
Fig. 2). For some experimental designs, for example, comparing of ‘normals’ should be chosen to remove germline variants as well as

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


NATURE BIOTECHNOLOGY LETTERS
a b
Validation MosaicForecast-Refine MosaicHunter
(single cell + lonTorrent) 2% MosaicForecast-Phase GATK-HC-P2
5k MuTect2 GATK-HC-P5
4,667
Het
4k Mosaic Nonrepeat Repeat
Refhom 1.0 12
1465 (26 cells)
3k Repeat 15
4638 (10 cells)
15
4643 (10 cells) 12
2k 0.8 16 10
7%
Candidate mosaics

13
962
1k 547 11
342
0.6 14

Precision
98 13
94 61%
81% 10
13
80 13
80 73 8 12
39%
0.4 8
7
60 54 8
5 7
47% 8
46 47
38 40 10
40 34 0.2
26 3 12
14
20 12 6 20
9
7 8 7 10 14 9
2 2 3 0 20 15
2 15
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
2

se

t2
te
-P

-P

in

ec
ha

un
ef
C

uT
Recall
-P

t-R
H

cH

M
K-

K-

st

as

ai
a
AT

AT

os
ec

ec
G

M
or

or
cF

cF
ai

ai
os

os
M

Fig. 2 | Comparison among algorithms. a, Candidate mosaics (both phasable and nonphasable) in the three individuals with single-cell data were
evaluated (see Methods). b, Precision and recall are plotted separately for the nonrepeat and repeat regions (as defined by RepeatMasker) and for each
individual. The number inside each symbol corresponds to the number of validated mosaics.

minimizing the risk of including artifacts that arise due to misalign- Fig. 5 and Supplementary Table 5; we define ‘repeat region’ to
ment or index hopping20. include interspersed repeats and low-complexity sequences iden-
We then classified the phasable variants (those for which a germ- tified by RepeatMasker22). However, we noticed that many vari-
line SNP is contained in the same read or its mate pair) into three ants are clustered, in mostly repeat or nondiploid regions, when all
categories depending on the number of observed haplotypes (hap): individuals are considered together (~46% of those predicted to be
(1) hap = 2, consistent with heterozygous germline variants; (2) hap ≥ 3 were enriched in regions that together span only ~19 Mb;
hap = 3, consistent with mosaics; and (3) hap > 3, suggestive of low- see Methods and Supplementary Table 6). After removing these
mappability regions, presence of copy number variations (CNVs) or clustered variants, the overall validation rate for the phasing pre-
sequencing-associated/other artifacts (Supplementary Fig. 3). For diction model increased to 76% (55 of 72) in nonrepeat regions
our brain data, ~25% of candidate mosaics were phasable with at and 49% (28 of 57) in repeat regions (Fig. 1d and Supplementary
least one germline SNP (Supplementary Table 2). Table 5). This constitutes a sevenfold increase (nonrepeat regions)
To determine whether the true genotypes can be inferred from and a 43-fold increase (repeat regions) in precision compared with
the haplotype categories, we evaluated 483 phasable sites in selected the initial MuTect2 calls, with minimal loss of sensitivity.
samples for which three additional data types are available: single- With experimental validation data at phasable sites, we can fur-
cell WGS, amplicon-based targeted sequencing and trio WGS (see ther improve our prediction model. Because the haplotype num-
Methods and Supplementary Tables 2 and 3). The single-cell WGS ber was only moderately correlated with the mosaic status (Fig. 1c,
dataset of three individuals we have published previously5,10 pro- top), we reasoned that an intermediate model that defines the
vides an excellent resource for orthogonal validation, as the lineage genotype more accurately using validation data could generate a
information as well as allele fraction across cells allow us to distin- better training set for the subsequent RF model. Visual inspection in
guish mosaics from heterozygous SNPs, germline repeat/CNV vari- the principal component space of the read-based features revealed
ants and technical artifacts (see Methods and Supplementary Fig. 4); that some hap = 3 variants clustered with variants that were found
trio data (for two individuals) are useful for distinguishing mosaic to be repeat/CNV or reference-homozygous, suggesting that read-
and germline variants (Supplementary Table 3). This analysis level data can help refine the genotype predictions (Supplementary
(Fig. 1c) revealed that although the ‘hap = 3’ category was enriched Fig. 6a–c). With a multinomial logistic regression model incorpo-
for true mosaic mutations (50%), the rest of the hap = 3 sites turned rating the read-level features (Fig. 1a), we converted the genotyp-
out to be false positives, classified as repeat/CNV regions (37%), ing categories from haplotype counts to ‘het’, ‘mosaic’, ‘refhom’ and
germline heterozygous (6%) and reference-homozygous (6%). ‘repeat’. The refined categories were in much better agreement with
Variants labeled as ‘hap = 2’ were mostly germline heterozygous, and the orthogonally evaluated calls: whereas only 50% of hap = 3 vari-
variants labeled as ‘hap > 3’ were mostly false positives as expected. ants were validated mosaics, 85% of the ‘mosaic’ predictions from
To identify mosaic variants, we first built an RF model using the regression model were validated mosaics (Fig. 1c). The resulting
over 30 read-level features as predictors and the haplotype number model was then applied to all phasable sites to generate their four-
(hap = 2, hap = 3, hap > 3) as the response, at all phasable sites on category genotype labels (Supplementary Table 4).
diploid chromosomes (Supplementary Table 4). Then we applied Using the phasable sites and their refined genotypes as a training
this ‘phasing prediction model’ genome-wide, excluding nonunique set, we predicted mosaics genome-wide. We built an RF classifica-
mapping regions21 (see Methods). This model resulted in modest tion model (‘refined genotypes prediction model’) on all phasable
validation rates for hap = 3 sites, with 67% (55 of 82) in nonrepeat sites with over 30 features as covariates and the refined genotypes
regions and 34% (28 of 82) within repeat regions (Supplementary as the response. We then applied the RF model to the 135,250

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


LETTERS NATURE BIOTECHNOLOGY

a b 5% FDR
c d
1.0 Mosaic deletions Mosaic insertions
PCR-free
120 95%
(105 of 111) PCR-based 1.0 1.0
(104 of 110) (104 of 109)
105 104 104
79%
100 (92 of 96) 0.8 74%
(88 of 92)
evaluated with Ion Torrent

92
0.8 11 0.8 58% 60%
Nonphasable mosaics

88
48
80
0.6

Sensitivity

Validation rate

Validation rate
7 25
0.6 0.6
60
0.4
67% Read depths 0.4 0.4
40 (30 of 45)
(28 of 42) (28 of 41)
(25 of 38) 250X
30 28 28
(20 of 30)
200X 3
25 0.2 9
20 20 150X 0.2 12 0.2 7
100X 1 1
1 1 1 5
0 0 50X 1
0 0 0
250X 200X 150X 100X 50X 0 0.2 0.4 0.6 0.8 1.0 Hap = 3 Nonphasable Hap = 3 Nonphasable
Read depth FDR Validation Validation

Het Mosaic Refhom Repeat Het Mosaic Refhom Repeat

Fig. 3 | Impact of read depth on sensitivity and detection of mosaic indels. a, At each coverage, a different RF model was trained on the phasable sites
and predictions were made on nonphasable sites. Amplicon-sequencing data were used for validation. Although fewer true mosaics were identified at
lower coverages, the sensitivity did not drop substantially (for example, at 50×, MosaicForecast was able to detect ~80% of real variants identified at
250×). b, Similar to a but using simulated data. The sensitivity was ~70% at 50×. c, >70% of mosaic deletions called by MosaicForecast were validated by
IonTorrent; the hap!=!3 sites and nonphasable sites had similar validation rates. d, Similar to c but for mosaic insertions. FDR, false discovery rate.

nonphasable candidate mutations (Supplementary Table 7). Sites using PCR-based libraries, the validation rate was ~61% (42 of 68;
within nonunique mapping regions21 (mappability score = 0) as 68%, 42 of 62, on diploid and 0 of 6 for haploid chromosomes).
well as sites within clustered regions (Supplementary Fig. 6d) were The lower validation rate for the PCR-based samples is likely
excluded. Among the 2,220 predicted (nonphasable) mosaics, 95 due to the PCR-induced biases, as reflected in a significantly higher
randomly selected sites were evaluated using orthogonal data (same proportion of G>T mutations (odds ratio = 4.1, P < 1 × 10−15,
validation method as for phasable sites). As shown in Fig. 2a, 78 Fisher’s exact test; Supplementary Fig. 8b), which are associated
(82%) were confirmed as true mosaics (85% in nonrepeat regions with oxidative damage25. If we focus on nonphasable sites from
and 77% in repeat regions). Top-ranked features of the RF model are diploid chromosomes (Fig. 3a), validation rates were ~95% (105 of
listed in Supplementary Fig. 6e. 111) and ~67% (30 of 45) for PCR-free and PCR-based samples,
We compared the performance of MosaicForecast with respectively. In addition, the validation rates were similar in non-
that of GATK HaplotypeCaller (GATK-HC)23, MuTect212 and repeat regions (87%, 118 of 136) and repeat regions (85%, 17 of
MosaicHunter24 using three different approaches. First, we inspected 20). Among the 177 MosaicForecast mosaics in nonrepeat regions
the variants called by all methods in three 250× WGS brain samples confirmed by IonTorrent, GATK-HC-p5 was only able to detect
for which single-cell WGS data are available5,10 (both phasable and ~62% (109 of 177), followed by MosaicHunter (~59%, 105 of 177)
nonphasable sites; leave-one-out cross-validation for three individ- and GATK-HC-p2 (~20%, 35 of 177). Among the 26 MosaicForecast
uals). Although the lineage information in single cells provides a mosaics in repeat regions confirmed by IonTorrent, GATK-HC-p5
very useful way to benchmark algorithm performance, one limita- and GATK-HC-p2 were only able to detect ~58% (15 of 26) and
tion of this approach is that variants with low allele fraction have a ~19% (5 of 26), respectively (MosaicHunter does not make
proportionally low chance of being sampled if the number of cells calls in repeat regions). A large fraction of low-VAF (≤0.05) mosaics
is small; thus, we used deep sequencing on the IonTorrent platform were called by MosaicForecast but missed by both MosaicHunter
to further examine those variants identified as refhom by single- and GATK-HC (~52%, 48 of 92; Supplementary Fig. 8c), indicat-
cell data (see Methods, Supplementary Fig. 7 and Supplementary ing that MosaicForecast is particularly advantageous for detecting
Table 8). The results show that MosaicForecast-Phase and -Refine low-VAF mutations.
models achieve precision that is typically several-fold higher than Third, we tested the haplotype numbers for the extra variants
other tools, while maintaining high sensitivity (Fig. 2a). GATK-HC identified by the other callers. Across the 75 individuals, the other
with ploidy two (GATK-HC-p2) frequently misclassified hetero- methods called 1–80 times more mutations than MosaicForecast,
zygous SNPs as mosaics; MuTect2 and GATK-HC with ploidy but read-based phasing showed that a large proportion of phas-
five (GATK-HC-p5) most often misclassify repeat/CNV variants; able sites from these tools had two haplotypes or more than three
and MosaicHunter could only detect variants within nonrepeat haplotypes, inconsistent with mosaic variants (Figs. 1c and 2a).
regions, thus losing ~50% of true mosaics. At the individual level For example, the percentages of hap > 3 variants were 58%, 49%
(Fig. 2b), the precision was ~92% (24 of 26), ~81% (25 of 31) and and 26% for MuTect2, GATK-HC-p5 and GATK-HC-p2, respec-
~73% (24 of 33) for the MosaicForecast-Refine model, suggesting tively; another 51% of GATK-HC-p2 were hap = 2. These numbers
a consistently high validation rate. MosaicForecast was also able to indicate that the false positive rates are indeed very high for other
detect more low-allele fraction variants with VAF ≤ 0.05 (30 of 41) methods (Supplementary Fig. 9).
than MosaicHunter (14), GATK-HC-p5 (4) and GATK-HC-p2 (0) To determine the performance of our model across VAFs, we
(Supplementary Fig. 8a). applied the model to a simulated dataset containing spike-in muta-
As a second mode of validation, we evaluated candidate mosa- tions (see Methods). We found that the model had similarly good
ics called by MosaicForecast-Refine in the 75 individuals using performance over a relatively wide range of VAFs, from 0.02 to 0.3,
amplicon-based sequencing (~30,000× on IonTorrent). Of the as reflected in the receiver operating characteristic (ROC) curves
75, the IonTorrent validation rate (Supplementary Table 9) was (Supplementary Fig. 10a,b). It performed substantially worse when
~94% (161 of 171) for the 64 samples that were sequenced using the VAFs approached 0.5, as it becomes impossible to separate
PCR-free libraries (~95%, 149 of 157, for diploid and ~86%, 12 of somatic variants from germline variants; in that case, a case-control
14, for haploid chromosomes). For the remaining 11 sequenced scheme would be a better choice.

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


NATURE BIOTECHNOLOGY LETTERS
To examine the detection power of MosaicForecast as a function (for example, two adjacent base substitutions) as well as clumped
of read depth, we simulated lower coverage data by down-sampling variants (for example, an SNV and a nearby insertion). We called
from the original 250× brain WGS data, and trained one RF model three such events in the three individuals with single-cell sequenc-
at each read depth using only the features extracted from the phas- ing data, and two of the events were found in at least one mutant
able sites in the corresponding down-sampled data. Although power cell, suggestive of true mosaics (Supplementary Fig. 15).
decreases with coverage as expected, MosaicForecast was still able In summary, MosaicForecast substantially improves the detec-
to detect ~80% (108 of 135) of the validated 250× variants at 50× tion of mosaic SNVs and indels from reference-free sequencing
(Fig. 3a and Supplementary Table 10). We also applied the brain data, confirming that proper incorporation of various read-level
WGS-trained models to the HapMap sample NA12878 to determine features in a nonlinear classifier provides an effective way to dis-
whether a model trained in one dataset could be applied to another tinguish real mosaic mutations from germline variants (especially
dataset. For testing, we generated simulated mutations in the 300× from CNV/repeat regions) and other artifacts. The strong perfor-
WGS data for NA12878 (ref. 26) following a realistic allele fraction dis- mance of MosaicForecast is made possible by training predictive
tribution of early embryonic mosaics and down-sampled to 50–250× models on phasable sites—a method for constructing a highly con-
(see Methods and Supplementary Fig. 10c,d). MosaicForecast was fident training set of mosaic variants in silico, without having to
sensitive in detecting simulated mosaics and effective in removing carry out labor-intensive experimental validations. Identification
nonmosaic sites across read depths: at a 5% false discovery rate, it of mosaic mutations in various nontumor tissues by the proposed
detected ~95% of the spike-in mutations at 250×. When the training method will help gain insights into the origin and propagation of
and simulation were performed at lower depths, ~90% of the spiked- somatic mutations in development and disease.
in mutations were detected at 100× and ~70% at 50× (Fig. 3b). We
also found that the models were robust across different read depths Online content
(see Methods and Supplementary Fig. 11). Any methods, additional references, Nature Research reporting
Although MosaicForecast used variants derived from MuTect2 summaries, source data, extended data, supplementary informa-
as an initial set, it could also start with variants identified by other tion, acknowledgements, peer review information; details of author
tools. Validation using single-cell and IonTorrent data (see Methods contributions and competing interests; and statements of data and
and Supplementary Table 11) shows that MosaicForecast (trained code availability are available at https://doi.org/10.1038/s41587-
on MuTect2 calls) substantially raises the specificity of mosaic 019-0368-8.
mutations with a minor loss in sensitivity, from 39% (38 of 98) to
90% (27 of 30) for GATK-HC-p2, from 7% (54 of 742) to 84% (42 Received: 4 December 2018; Accepted: 23 November 2019;
of 50) for GATK-HC-p5 and from 47% (34 of 73) to 61% (31 of 51) Published: xx xx xxxx
for MosaicHunter (Supplementary Fig. 12). For maximal sensitivity,
we could generate an input set simply by using SAMtools mpileup References
scanning, for example, by taking all sites with ≥2% nonreference 1. Biesecker, L. G. & Spinner, N. B. A genomic view of mosaicism and human
bases as potential mutations. Using single-cell data to evaluate the disease. Nat. Rev. Genet. 14, 307–320 (2013).
sites called only by SAMtools, only a tiny fraction (1.8%, 104 of 2. Bae, T. et al. Different mutational rates and mechanisms in human cells at
pregastrulation and neurogenesis. Science 359, 550–555 (2018).
5,876) of a large mutation set was validated. With MosaicForecast, 3. Ju, Y. S. et al. Somatic mutations reveal asymmetric cellular dynamics in the
we achieved a striking improvement in the validation rate (45.9%, early human embryo. Nature 543, 714–718 (2017).
78 of 170; Supplementary Fig. 12). Compared with the MuTect2 4. Ye, A. Y. et al. A model for postzygotic mosaicisms quantifies the allele
validation result (73 of 89; Fig. 2a), only a few more true variants fraction drift, mutation rate, and contribution to de novo mutations. Genome
were captured at the expense of many false variants. We note that Res. 28, 943–951 (2018).
5. Lodato, M. A. et al. Somatic mutation in single human neurons tracks
the single-cell-based validation strategy becomes less accurate as developmental and transcriptional history. Science 350, 94–98 (2015).
the VAF decreases, so the specificity and sensitivity for mpileup- 6. Dou, Y., Gold, H. D., Luquette, L. J. & Park, P. J. Detecting somatic mutations
based variants is likely to be an underestimate. in normal cells. Trends Genet. 34, 545–557 (2018).
In addition to SNVs, MosaicForecast is capable of identifying 7. Dou, Y. et al. Postzygotic single-nucleotide mosaicisms contribute to the
etiology of autism spectrum disorder and autistic traits and the origin of
mosaic indels (see Methods). Using the MuTect2-based approach
mutations. Hum. Mutat. 38, 1002–1013 (2017).
as before, we obtained 59,977 candidates (22,893 deletions, 37,084 8. Freed, D. & Pevsner, J. The contribution of mosaic variants to autism
insertions) from the nonrepeat regions of the 75 individuals. For spectrum disorder. PLoS Genet. 12, e1006245 (2016).
mosaic deletions, an RF model trained using all 831 phasable sites 9. Krupp, D. R. et al. Exonic mosaic mutations contribute risk for autism
(Supplementary Table 12 and Supplementary Fig. 13a–c) pre- spectrum disorder. Am. J. Hum. Genet. 101, 369–390 (2017).
10. Lodato, M. A. et al. Aging and neurodegeneration are associated with
dicted 1,356 sites as hap = 3. With additional filtering criteria (see increased mutations in single human neurons. Science 359, 555–559 (2018).
Methods), 102 sites were classified as confident mosaic deletions. All 11. Yang, X. et al. Genomic mosaicism in paternal sperm and multiple parental
of the high-confidence mosaic deletions were from PCR-free sam- tissues in a Dravet syndrome cohort. Sci. Rep. 7, 15677 (2017).
ples. When evaluated using IonTorrent sequencing (see Methods 12. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure
and Supplementary Fig. 13d), ~75% (59 of 79) were validated as and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
13. Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection
true deletions, with phasable (79%, 11 of 14) and nonphasable in cancer using whole-genome sequencing. Nat. Commun. 6, 10001 (2015).
(~74%, 48 of 65) sites having similar validation rates (Fig. 3c and 14. Huang, A. Y. et al. Distinctive types of postzygotic single-nucleotide
Supplementary Table 13). Two sites in the three individuals with mosaicisms in healthy individuals revealed by genome-wide profiling of
single-cell data (and at least one mutant cell) were both confirmed multiple organs. PLoS Genet. 14, e1007395 (2018).
with lineage information (Supplementary Fig. 13e,f). Following 15. Lim, E. T. et al. Rates, distribution and implications of postzygotic mosaic
mutations in autism spectrum disorder. Nat. Neurosci. 20, 1217–1224 (2017).
the same approach (Supplementary Fig. 14), 134 confident mosaic 16. Poplin, R. et al. A universal SNP and small-indel variant caller using deep
insertions were found, and ~59% (32 of 54) from PCR-free sam- neural networks. Nat. Biotechnol. 36, 983–987 (2018).
ples were validated (Fig. 3d and Supplementary Table 14). None 17. Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic
of the mosaic insertions from PCR-based samples were validated variants. Nat. Methods 15, 591–594 (2018).
(Supplementary Table 14). The excessive error rates of mosaic indels 18. Bohrson, C. L. et al. Linked-read analysis identifies mutations in single-cell
DNA-sequencing data. Nat. Genet. 51, 749–754 (2019).
in PCR-based samples are likely to be caused by replication slippage 19. Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes
of DNA polymerase27. In the process of detecting mosaic SNVs and reveals the spectrum of loss-of-function intolerance across human protein-
indels, we also identified several mosaic multi-nucleotide variants coding genes. Preprint at bioRxiv https://doi.org/10.1101/531210 (2019).

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


LETTERS NATURE BIOTECHNOLOGY
20. Costello, M. et al. Characterization and remediation of sample index swaps 25. Chen, L., Liu, P., Evans, T. C. Jr. & Ettwiller, L. M. DNA damage is a
by non-redundant dual indexing on massively parallel sequencing platforms. pervasive cause of sequencing errors, directly confounding variant
BMC Genomics 19, 332 (2018). identification. Science 355, 752–756 (2017).
21. Karimzadeh, M., Ernst, C., Kundaje, A. & Hoffman, M. M. Umap and 26. Zook, J. M. et al. Extensive sequencing of seven human genomes to
Bismap: quantifying genome and methylome mappability. Nucleic Acids Res. characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
46, e120 (2018). 27. McInerney, P., Adams, P. & Hadi, M. Z. Error rate comparison during
22. Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0 (2013–2015). polymerase chain reaction by DNA polymerase. Mol. Biol. Int. 2014,
23. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands 287430 (2014).
of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
24. Huang, A. Y. et al. MosaicHunter: accurate detection of postzygotic Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
single-nucleotide mosaicism through next-generation sequencing of unpaired, published maps and institutional affiliations.
trio, and paired samples. Nucleic Acids Res. 45, e76 (2017). © The Author(s), under exclusive licence to Springer Nature America, Inc. 2020

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


NATURE BIOTECHNOLOGY LETTERS
Methods less so. Almost all of the candidate mosaics from the latter set came from regions
Datasets. WGS data from the prefrontal cortex of 75 individuals (~250×) and from with many mismatches and were unlikely to be true mosaics. Based on observation,
the blood of two pairs of parents (~50×) were obtained by our group. Single-cell instead of experimentally evaluating all sites evaluated by single cells as refhom,
WGS data from the neurons of three individuals were previously obtained5,10. All we only selectively evaluated sites called by ≥2 tools. We should note that since
data were 150-bp, paired-end reads. FASTQ files and high-confidence SNV calls28 we did not have more of the same DNA extraction for the three individuals (1465,
for NA12878 (~300×, 150 bp, paired-end) generated by the Genome in a Bottle 4643, 4638) sent out for WGS sequencing, we could only settle for the second best
Consortium (https://ftp-trace.ncbi.nlm.nih.gov/giab/) were down-sampled to by using extracted DNA from nearby tissues for targeted sequencing. As a result,
different read depths. WGS data for 30 tumor tissues (~80×, 100 bp, paired-end) the validation rate could be slightly under-estimated overall. But the situation is the
and their matched normal samples (~40×) as well as consensus variant calls were same for all callers.
obtained from the International Cancer Genome Consortium13. Reads were aligned For trio data, we considered the variants detected in the child and (1) absent
to GRCh37 with decoy (hs37d5) using Burrows-Wheeler Aligner (BWA)29. from both parents as real mosaics, (2) present in either parent with <20% VAF
as CNV/repeat and (3) present in either parent with ~50% VAF as heterozygous.
Variant calling. Five methods were used to call mosaic SNVs in the brain Two individuals (UMB5939 and UMB5771) had bulk WGS data from both parents
bulk data: MuTect2 (v.3.5 nightly-r2016-04-25-g7a7b7cd, variants tagged as (~50×), and their variant calls were evaluated with trio data.
‘str_contraction’, ‘triallelic_site’ or ‘t_lod_fstar’ were excluded); GATK-HC (v.3.5- As for IonTorrent, with additional testing of candidates in the 75 individuals,
0), with ploidy set to two (GATK-HC-p2) or five (GATK-HC-p5), keeping only we evaluated a total of 242 candidate mosaic SNPs called by MosaicForecast
variants tagged as ‘PASS’; MosaicHunter (v.1.1), with the minimum VAF threshold (Supplementary Table 9). To compare the performance of different tools, 115
adjusted from the default 5 to 2% to enable detection of mosaics with lower VAFs; variants from different tools were evaluated by ultra-deep targeted sequencing
and SAMtools30 (v.1.9) mpileup with sites with ≥2% nonreference bases (alt allele (Supplementary Table 3). These included 54 with WGS VAF ≤ 0.05, 24 with
count ≥ 3, with mapping quality ≥ 20 and base quality ≥ 20) called as putative WGS VAF ∈ (0.05, 0.2) and 37 with VAF > 0.2. Candidate mosaic indels were also
mutations. Variants within segmental duplication regions according to the UCSC evaluated with IonTorrent targeted sequencing. For each site, two or three different
Genome Browser database31 were removed. Variants at <0.02 VAF calculated by pairs of primers were designed and three sets of PCR products were generated.
each tool were excluded. Variants with VAF ≥ 0.4 or present in gnomAD19 were In addition, an experimental control was adopted as a comparison with the case
removed as likely germline variants. Variants called in multiple individuals by samples. Given that IonTorrent sequencing has a high rate of indel errors33, only
each method were excluded. For indels, variants present in the relevant ‘panel of variants present in the case and absent from control, or present in case samples
normals’ (PoN) or gnomAD as well as those present in repeat regions, including with much higher allele fractions than in controls, were regarded as true mosaic
simple repeats32, RepeatMasker regions and segmental duplication regions, indels (Supplementary Fig. 13d). For mosaic insertions specifically, since the
were excluded. Variants with VAF < 0.02 or ≥0.4 as calculated by MuTect2 were hap = 3 and hap > 3 sites were not well distinguished in the principal component
removed. Outliers with >20 deletions per sample were excluded, and variants analysis (PCA) space (Supplementary Fig. 14), we expected higher false positive
with >350× read depth, variants with ultra-low VAF calculated by MosaicForecast rate, and checked the read alignments using IGV as an additional filter before
(<0.01), variants with a high proportion (≥10%) of clipped reference reads, IonTorrent evaluation (Supplementary Table 14). Sites from regions with excessive
variants within clustered regions and variants present in gnomAD were removed. mismatches, sites with the mutant alleles completely linked with nearby low-AF
Further information is available in the Nature Research Reporting Summary. variants in the reads and sites with misalignment issues were classified as repeat or
Variants generated by MuTect2 with a PoN were used as input variants to het variants. These false positive sites from IGV plots were included to calculate the
MosaicForecast, due to the high sensitivity of MuTect2. For brain data, one validation rate of mosaic insertions in Fig. 3d.
individual (UMB5308) likely to have contamination problems (obtained excess
number of low-VAF mutations) was excluded. Sites with extra-high read depths Read-based phasing. To identify germline SNPs near the SNVs detected by
(≥2 fold) and sites with ≥1.5 fold read depths and ≥20% VAF were marked as MuTect2, we scanned reads mapped up to 1 kbp away from each candidate site.
‘low-confidence’ and excluded. We evaluated the sensitivities of different tools in After excluding reads with low mapping qualities (<20) or with low base quality
two ways: first, we generated spike-in mutations at allele fractions of 0.01, 0.02, at the mutant position (<20), a two-tailed binomial test was applied to remove
0.03, 0.05, 0.1, 0.2, 0.3, 0.4 and 0.5 in the BAM files of NA12878 subsampled variants whose VAFs deviated from 0.5 (P ≤ 0.05). Variants with relatively low read
at 50–250×; called variants from the BAM files with MuTect2, MosaicHunter, depth (<20×) were also filtered.
GATK-HC-p2 and GATK-HC-p5; and compared their sensitivities in detecting After obtaining a set of high-confidence SNPs, we first computed the haplotype
mosaics at different allele fractions and read depths. MuTect2 achieved the highest numbers along the genome using consecutive pairs of germline SNPs to determine
sensitivity in all circumstances (Supplementary Fig. 1a). Second, we also called whether a region was nondiploid; if so, candidate mosaics in the region were
candidate mosaics from real bulk sequencing data (250×) from three individuals excluded as false positives (Supplementary Fig. 3a). Next, each candidate mosaic
with multiple single cells with the four different tools, and evaluated the variants was phased with as many nearby germline SNPs as possible and classified as
using the single-cell data. MuTect2 was able to detect >97% (98 of 101) of real follows: (1) those leading to three haplotypes were treated as potential mosaics
mosaics called by different tools, whereas MosaicHunter, GATK-HC-p2 and (Fig. 1a and Supplementary Fig. 3b); (2) those leading to two haplotypes were
GATK-HC-p5 were only able to detect 34, 38 and 54% of real mosaics, respectively treated as heterozygous mutations (Supplementary Fig. 3c); and (3) those leading
(Supplementary Fig. 1b). to more than three haplotypes were treated as false positives, as mosaic mutations
arising in a diploid organism can only define three haplotypes (Fig. 1a and
Orthogonal validation of variants. SNV candidates were evaluated by single-cell Supplementary Fig. 3d).
sequencing, trio sequencing or ultra-deep amplicon resequencing (Supplementary
Table 3). To evaluate variants using single-cell data in the three individuals (1465, Read-level features. The 31 features are described in Supplementary Table 1.
4643 and 4638)5,10, we constructed lineage trees with single-cell mutations assigned Two of the features, ‘mapq_p’ and ‘mapq_difference’, encode mapping qualities;
to different clades (Supplementary Table 8 and Supplementary Fig. 4). Mutations three account for the number of mismatches per read (‘major_mismatches_mean’,
that were only present in the cells assigned to the same clade were regarded as real ‘minor_mismatches_mean’, ‘mismatches_p’); and six are calculated using base
mosaic mutations, mutations absent from all cells were regarded as refhom and qualities (‘baseq_p’, ‘baseq_t’, ‘ref_baseq1b_p’, ‘ref_baseq1b_t’, ‘alt_baseq1b_p’,
mutations present in multiple cells and assigned to conflicting clades were regarded ‘alt_baseq1b_t’). The remaining features are read depth, VAF, genotyping
as germline variants (het or repeat). To further classify germline variants (repeat or likelihoods, strand bias, biases of the read pairs towards the ref/alt alleles, bias of
het), we used an empirical threshold: if mutant cells on average had ≥20% VAF, we the sequencing cycle towards the ref/alt alleles, read mapping position bias, bias of
classified them as het variants; and if mutant cells on average had <20% VAF, we the base query position of the ref/alt alleles, local mappability score21, proportion of
classified then as repeat variants (Supplementary Fig. 4e). We further checked all clipped reads, multiallelic examination, GC content and the three-nucleotide base
het variants, and, if a variant had ultra-high read depth in the bulk data (>300×) context of the mutation.
and the bulk allele fraction deviated significantly from 50% (P < 0.001, two-tailed Although most of these features were calculated by comparing positions or
binom test), we re-classified them as repeat variants. Moreover, ‘linked’ variants qualities of reference alleles/reads with alternative alleles/reads, we also compared
close to each other with all alt alleles on the same reads were also classified as the qualities of alleles at the mutant position and at 1-bp downstream of the mutant
repeat variants. position (‘ref_baseq1b_p’, ‘ref_baseq1b_t’, ‘alt_baseq1b_p’, ‘alt_baseq1b_t’), since
To address the concern that some variants judged as refhom by the limited systematic sequencing errors have been reported to have lower base quality at
number of single cells could be real mosaic variants, we experimentally evaluated the mutant position34. We also estimated genotype likelihoods of four different
those refhom sites by using IonTorrent deep sequencing. But, to pick an genotypes (refhom, het, althom, mosaic) based on Bernoulli sampling24,35 to
informative set for IonTorrent validation, we first categorized the sites into different capture sequencing errors and ref/alt allele read-depth biases, assuming that the
groups based on the caller used and checked the read alignments using Integrative real mutant allele fractions are 0 (refhom), 0.5 (het), 1 (althom) and uniformly
Genomics Viewer (IGV). By extensively checking IGV plots (with some examples distributed between 0 and 1 (mosaic). The formulas to calculate the four genotypes
in Supplementary Fig. 7b), we found that: (1) candidate sites called by ≥2 tools are as follows:
were more likely to be true mosaics than those sites called by only one tool, and ! "
(2) candidate sites called by MosaicForecast and MosaicHunter were substantially depth
LðG ¼ hetjDataÞ ¼ 0:5depth
more convincing, whereas sites called by GATK-HC-P5 and MuTect2 were much r

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


LETTERS NATURE BIOTECHNOLOGY
! "Y
depth depth from diploid chromosomes as the training set and used the read-level features
LðG ¼ refhomjDataÞ ¼ Pðri ¼ ref jqi ; oi Þ
r i¼1 described above for phasable sites as covariates. We used the R package caret38
to build the RF model. The model was trained to predict the haplotype numbers
! "Y for the phasing prediction model and it was trained to predict the four refined
depth depth # $ genotypes (refhom, het, repeat and mosaic) assigned to phasable sites for the
LðG ¼ althomjDataÞ ¼ P r i ¼ altjqi ; oi
r i¼1
‘refined genotype prediction model’. To train models applicable to sequencing data
with different read depths, reads for all candidate sites in the 250× training set
Z ! "
1
depth were down-sampled to 50×, 100×, 150× and 200×, respectively, and all of the read-
LðG ¼ mosaicjDataÞ ¼ θr ð1 $ θÞdepth$r dθ ¼ βðr þ 1; depth $ r þ 1Þ level features of phasable sites were extracted from the sampled reads to build the
0 r
corresponding RF models (Supplementary Table 10).
depth
X Evaluation of brain WGS-trained model in WGS data with different read
r¼ Pðr i ¼ altjqi ; oi Þ
i¼1
depths. We trained models with phasable sites from the brain WGS data (down-
sampled to 50–250× depth) and tested on nonphasable sites from the brain WGS
where ri denotes read i, oi denotes observed allele on read i at the mutant position, data at 50–250× as well as on the simulated data constructed using NA12878
qi denotes base quality on read i at the mutant position and θ denotes the real (Supplementary Fig. 11). To evaluate the validation rate in real WGS data, reads for
mutant allele fraction, 𝐺 denotes genotype, 𝛽 denotes beta function and L all candidate sites called by MuTect2 in the 250× data (from the three individuals
denotes likelihood. with single-cell sequencing data available) were down-sampled to 50×, 100×,
Compared with SNVs, indels tend to cause alignment uncertainty problems 150× and 200×, respectively, and all of the read-level features for nonphasable sites
and a merely position-based method would no longer be adequate. We therefore were extracted from the sampled reads. We then applied the brain WGS-trained
modified several read-level features and designed several new features/filters to adapt RF models trained with phasable sites at 50–250× read depths to nonphasable
MosaicForecast for calling mosaic indels. Specifically, ‘alt reads’ were re-defined as sites in the three individuals. When evaluated with single-cell or IonTorrent data
reads carrying an alt allele or reads clipped at the mutant position; candidate sites (Supplementary Table 3), performance was only slightly better when training and
within simple repeats and homopolymer regions were filtered; candidate sites with testing datasets had similar coverages. For example, ~74% of variants (40 of 54)
≥10% reference reads being soft- or hard-clipped were filtered; and when computing were validated as true mosaics when applying a model trained at 50× WGS data
the difference of baseQ at the mutant position and neighboring position, the 1-bp to 50× test data, whereas ~66% (40 of 61) were validated when applying a model
neighboring position was re-defined as read regions excluding mutant indels. All of trained at 250× WGS data to 50× test data (Supplementary Fig. 11).
the read-level features were computed using custom Python scripts that relied on the
PySam30 library (https://github.com/pysam-developers/pysam). Simulation of mosaic mutations and extraction of false sites. We also evaluated
the performance of MosaicForecast in simulated datasets at different read depths.
Identification of regions with clustered mutations. To identify nondiploid The 300× WGS data for the HapMap sample NA12878 (ref. 26) were down-sampled
regions that are likely to be enriched for artifacts, we applied the phasing to 50×, 100×, 150×, 200× and 250× using SAMtools30. Spike-in mosaic mutations
prediction model on all MuTect2-PoN calls from the 75 individuals and extracted with expected allele fractions of 0.02, 0.03, 0.05, 0.1 and 0.3 were generated for
variants predicted to be hap = 3 or hap > 3. We then selected regions with ≥3 each case (Supplementary Fig. 10). These simulated mosaics were randomly
consecutive hap ≥ 3 sites among the 75 individuals (distance < 5,000 bp between selected and mixed in proportion (4:4:4:2:1) to mimic the real early embryo mosaic
adjacent variants). The clustered hap = 3 variants in these mostly repeat regions mutations in nontumor tissues, assuming constant mutation rate per cell division
had significantly higher read depths (P < 1 × 10−15, two-tailed Wilcoxon’s rank sum (Supplementary Fig. 10c). To simulate a set of high-quality and correctly phased
test) and lower mappability scores (P < 1 × 10−15, two-tailed Wilcoxon’s rank sum mosaic variants, simulated mutations were generated in BAM files by converting
test) than nonclustered hap = 3 sites, suggesting that these regions are likely to be alternative alleles of the high-confidence heterozygous SNPs39 to reference alleles
nondiploid or otherwise error-prone regions that should be blacklisted. Validation with Bernoulli sampling. In the 250× data, the spike-in mutations were generated
using single-cell, IonTorrent and trio data for clustered hap = 3 sites showed that at higher density at VAFs (0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4) and were used to
~99% of them were false positives (Supplementary Table 6). determine the performance of the models across a wider variety of VAFs.
To extract a set of false variants from real sequencing data, candidate sites were
Genotype refinement. Given that the genotype cannot be inferred accurately called from the down-sampled BAM files (down-sampled from the original HapMap
when the haplotype number is three or more (Fig. 1c), we first performed PCA sample NA12878, without spike-in mutations) with MuTect2 (version 3.5 nightly-
of all variants called by MuTect2 (tumor-only mode) using all read features to r2016-04-25-g7a7b7cd). Variants at <0.02 VAF calculated by MuTect2 were excluded.
determine whether real mosaics can be distinguished from false positive sites in Variants with VAF ≥ 0.4 calculated by MuTect2, or present in the gnomAD whole-
the PCA space (Supplementary Fig. 5). When we projected the experimentally genome database19 with ≥0.1% MAF, were excluded. We then applied phasing on all
evaluated phasable sites onto the PCA space, we found that the variants validated candidate variants, and heterozygous SNPs were chosen as those with 2 haplotypes;
as mosaic, heterozygous, reference-homozygous and repeat/CNV variants form sites with a misalignment issue within nondiploid regions were chosen as those
distinct clusters (Supplementary Fig. 6a,c), suggesting that the read-level features with >3 haplotypes. The two kinds of mutations were used as simulated false sites
could be used to separate real mosaic mutations from germline variants and other (Supplementary Fig. 10d). We then applied the pretrained RF models at different read
false positive calls with higher accuracy than haplotype information alone. depths to predict mosaics in simulated datasets and evaluate performance.
Analysis of the PCA space revealed that some of the candidate mosaic variants
with hap = 3 clustered with hap > 3 variants. Validation data showed that those Statistics. Thirteen of 31 read-level features were calculated with scipy (v1.2.1), by
hap = 3 variants were repeat/CNV or reference-homozygous (Supplementary Fig. doing two-tailed Wilcoxon’s rank sum test, two-tailed t-test or two-tailed Fisher’s
6a). For example, genotyping likelihoods, difference of ref/alt allele query position, exact test, to compare base qualities, mapping qualities, and positions of ref
difference of read mapping positions and difference of ref/alt read mapping qualities alleles/reads and alt alleles/reads; to compare base qualities at the mutant position
were the main features contributing to PC1; difference of mismatches per ref/alt read and neighboring positions; and to evaluate strand bias and read1/read2 biases.
and difference of ref/alt allele base qualities were important features contributing to Refer to Supplementary Table 1 for more details. Other statistical tests for each
PC2 (PC1 and PC2 were the first two principal components) (Supplementary Fig. analysis were calculated with R (v.3.6.1) and are described in the Methods. Further
6b). Repeat/CNV sites tended to have lower base qualities and more mismatches information is available in the Nature Research Reporting Summary.
per alt read, different base query positions for ref/alt alleles, different read mapping
positions and different mapping qualities for ref/alt reads, and thus were better Reporting Summary. Further information on research design is available in the
separated from real mosaic variants along PC1 and PC2 (Supplementary Fig. 6a,b). Nature Research Reporting Summary linked to this article.
We thus reasoned that genotype labels of phasable sites could be better predicted
using these first five principal components, which collectively explained ~50% Data availability
of the variance (Supplementary Fig. 5d). We used phasing as well as the first five The WGS data are available at the National Institute of Mental Health Data Archive
principal components for experimentally evaluated phasable sites as covariates to (https://nda.nih.gov/study.html?id=644).
model their true genotypes using multinomial linear regression (Supplementary
Table 4). The resulting model was used to predict refined genotype labels for the
remaining phasable sites, and the three-category genotype labels of all phasable sites Code availability
(hap = 2, hap = 3 and hap > 3) were converted to four-category genotype lables (het, MosaicForecast is implemented in Python and R. The source code, documentation
mosaic, repeat and refhom). The R package glmnet (ref. 36) was used to build the and examples are available at https://github.com/parklab/MosaicForecast/.
multinomial regression model, and the R package mlr (ref. 37) was used to visualize
the classification as shown in Supplementary Fig. 6c. References
28. Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based
Construction of the RF model. To construct an RF classification model applicable approaches for calling variants in clinical sequencing applications. Nat. Genet.
to both phasable and nonphasable candidate mosaics, we used all phasable sites 46, 912–918 (2014).

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


NATURE BIOTECHNOLOGY LETTERS
29. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows– Acknowledgements
Wheeler transform. Bioinformatics 26, 589–595 (2010). This work was supported by National Institutes of Health grants (nos. U01MH106883,
30. Li, H. et al. The Sequence Alignment/Map format and SAMtools. R01NS032457, T32HG002295 and T32GM007753); by the Harvard Ludwig Center;
Bioinformatics 25, 2078–2079 (2009). and by a Horizon 2020 grant (no. 703543). We thank C. Chen, H. Gold, C. Chu,
31. Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. V. Viswanadham and G. Nelson for their helpful comments.
Nucleic Acids Res. 47, D853–D858 (2019).
32. Benson, G. Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res. 27, 573–580 (1999).
Author contributions
Y.D. developed the algorithm and performed the analysis, under supervision by P.J.P.
33. Bragg, L. M., Stone, G., Butler, M. K., Hugenholtz, P. & Tyson, G. W. Shining
M.K. generated call sets from MuTect2 and GATK haplotype callers. R.E.R. and R.D.
a light on dark sequencing: characterising errors in Ion Torrent PGM data.
evaluated candidate sites with targeted sequencing, supervised by C.A.W. I.C.C., L.J.L.,
PLoS Comput. Biol. 9, e1003031 (2013).
A.G., C.B. and M.K. helped to refine the algorithm and contributed to editing of the
34. Meacham, F. et al. Identification and correction of systematic error in
manuscript. Y.D. and P.J.P. wrote the manuscript. All authors discussed the results and
high-throughput sequence data. BMC Bioinformatics 12, 451 (2011).
contributed to the final manuscript.
35. Huang, A. Y. et al. Postzygotic single-nucleotide mosaicisms in whole-
genome sequences of clinically unremarkable individuals. Cell Res. 24,
1311–1327 (2014). Competing interests
36. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized The authors declare no competing interests.
linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
37. Bischl, B. et al. mlr: Machine Learning in R. J. Mach. Learn. Res. 17, Additional information
1–5 (2016). Supplementary information is available for this paper at https://doi.org/10.1038/
38. Kuhn, M. Building predictive models in R using the caret package. J. Stat. s41587-019-0368-8.
Softw. 28, 26 (2008).
39. Zook, J. M. et al. Integrating human sequence data sets provides a resource of Correspondence and requests for materials should be addressed to P.J.P.
benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014). Reprints and permissions information is available at www.nature.com/reprints.

NATURE BIOTECHNOLOGY | www.nature.com/naturebiotechnology


View publication stats

You might also like