Accurate Detection of Mosaic Variants in Sequencing Data Without Matched Controls
Accurate Detection of Mosaic Variants in Sequencing Data Without Matched Controls
https://doi.org/10.1038/s41587-019-0368-8
Detection of mosaic mutations that arise in normal develop- Incorporation of read-level features in a flexible framework is
ment is challenging, as such mutations are typically present critical for distinguishing real mutations from artifacts16,17. In place
in only a minute fraction of cells and there is no clear matched of filters with hard thresholds, recent methods such as DeepVariant16
control for removing germline variants and systematic and Strelka2 (ref. 17) use machine learning to combine relevant read-
artifacts. We present MosaicForecast, a machine-learning level features to improve detection of germline and cancer somatic
method that leverages read-based phasing and read-level variants, respectively. Another component in accurate detection of
features to accurately detect mosaic single-nucleotide vari- mosaic SNVs in silico is read-based phasing3,7,8,18, in which a can-
ants and indels, achieving a multifold increase in specificity didate mosaic mutation and a nearby germline variant are checked
compared with existing algorithms. Using single-cell sequenc- for haplotype consistency—that is, a true mosaic mutation should
ing and targeted sequencing, we validated 80–90% of the generate one and only one additional haplotype. A major disadvan-
mosaic single-nucleotide variants and 60–80% of indels tage of phasing, however, is that only a small fraction (~10–30%)
detected in human brain whole-genome sequencing data. of variants are phasable using short-read sequencing18, and phasing
Our method should help elucidate the contribution of mosaic may be ambiguous in nondiploid or low-mappability regions6.
somatic mutations to the origin and development of disease. We developed MosaicForecast, which leverages multiple read-
A single individual harbors multiple populations of cells with level features over phasable sites to build a genome-wide prediction
distinct genotypes due to somatic mutations arising postzygoti- model for finding mosaic mutations in the absence of a matched
cally1. Such diversity of genotypes in an individual is referred to as control sample. It consists of three major steps (Fig. 1a): (1) gen-
somatic mosaicism. Analysis of mosaic mutations in nondisease eration of a training set by read-based phasing; (2) construction
samples enables exploration of lineage patterns during develop- of a random forest (RF) model based on read-level features related
ment and characterization of mutational mechanisms operative in to the quality and category of variants, such as VAF, read depth,
normal cells2–6. Recent studies have also demonstrated that somatic mismatches per read and strand bias (Fig. 1b and Supplementary
mutations contribute to many diseases besides cancer1,7–11. Table 1); and (3) genome-wide prediction of mosaic SNVs. The
Identification of mosaic mutations in genome sequencing data underlying idea is similar to that of DeepVariant16 and Strelka2
remains challenging in two key aspects. First, whereas functionally (ref. 17) in that a nonlinear model that combines informative read-
relevant cancer mutations typically confer proliferative advantage level features is trained using a machine-learning framework and
and thus have relatively high variant allele fractions (VAFs), most then applied to a test set. The main difference is that, to over-
mosaic mutations are present in a small number of cells and have come the problem that high-quality training data are not available,
1. very low VAFs. In the extreme case, those occurring in postmitotic MosaicForecast uses phasable sites in building a training set. We
cells are present only in a single cell and are detectable only by sin- introduce another modeling step using multinomial logistic regres-
gle-cell sequencing, as we have done recently10. As standard cancer sion to improve the training set when some experimental validation
mutation callers typically have a lower VAF limit of 2–5% (refs. 12,13), data are available (Fig. 1c). As an illustrative example, we applied the
detection of mutations with lower VAFs requires a more sensitive tool to analyze whole-genome sequencing (WGS) data from brain
bioinformatic method and/or higher sequencing depth. Second, tissues of 60 autism spectrum disorder and 15 neurotypical indi-
mosaic mutations that arise early in development generally exist viduals, sequenced at ~250× (150-base pair (bp), paired-end reads).
2. in multiple tissues2,14. Thus, the conventional approach of using a To assemble a training set of high-confidence mosaic mutations,
paired control tissue for filtering germline variants and systematic we first identify a lenient set of candidate mosaic variants. We used
errors would exclude such early occurring mutations. Several meth- MuTect2 in its tumor-only mode for its high sensitivity, but other
ods have been employed to detect mosaic single-nucleotide vari- algorithms can be used (see Methods and Supplementary Fig. 1). To
ants (SNVs) from nontumor tissues, such as the use of a germline remove germline variants and recurrent artifacts, we filtered vari-
variant caller15 with higher ploidy assumptions8 or a combination ants present in the Genome Aggregation Database (gnomAD)19.
of somatic mutation callers3,7,9,14. Additional filtering leveraging trio In addition, since the likelihood that somatic mutations occur at
data to exclude germline variants7–9,15 is also common. However, the same position in different individuals is vanishingly small, we
validation rates in these studies have been modest. also removed variants found in any other samples in the dataset
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. 2Division of Genetics and Genomics, Manton Center for Orphan
1
Disease, and Howard Hughes Medical Institute, Boston Children’s Hospital, Boston, MA, USA. 3Departments of Neurology and Pediatrics, Harvard Medical
School, Boston, MA, USA. 4Broad Institute of MIT and Harvard, Cambridge, MA, USA. 5Harvard/MIT MD–PhD Program, Harvard Medical School, Boston,
MA, USA. 6Bioinformatics and Integrative Genomics PhD program, Harvard Medical School, Boston, MA, USA. 7Ludwig Center at Harvard, Boston, MA,
USA. 8Present address: European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK. *e-mail: [email protected]
Het
Initial variant calls Hap = 3 Orthogonal validation Repeat/CNV
Refine: train
(reference-free) Refhom Tree 3
Mosaic
Nonphasable
Het Genome-wide
Orthogonal validation Extrapolate: Repeat/CNV prediction
Hap > 3 Refhom of mosaics
Amplicon seq Mosaic
Single cell
Trio
b Feature importance
0 20 40 60 80 100 Strand bias Mismatches per read Read1/read2 bias Read map positions
Variant allele fraction R1
Het genotype likelihood R1
Refhom genotype likelihood R2
Mosaic genotype likelihood R2
Mismatches per alt reads
Mismatches per ref reads Candidate mosaic variants
MapQ difference of ref/alt reads Mismatches
Proportion of clipped ref reads
Proportion of indel reads at the mutant position
d Nonrepeat region
Difference of ref/alt reads mapping positions
No. nonphasable sites
Strand bias Mutect2
Difference of ref/alt base query positions 0 200 400 600 800
Difference of ref/alt base query positions
c Phasing
MosaicForecast-Phase
0 20 40 60 80
50.0%
96% 85% 90%
152 44
246 MosaicForecast-Refine
0 20 40 60 80
Validation
77.1%
Het Mosaic Refhom Repeat
Fig. 1 | Framework of MosaicForecast to detect mosaic SNVs from bulk sequencing data. a, Candidate mosaics were classified as hap!=!2, hap!=!3 or
hap!>!3 by read-based phasing, and an RF model was trained to predict the phasing by using over 30 read-level features as covariates. The model was
then applied to nonphasable sites to predict their genotypes. Given a list of experimentally evaluated sites, the model could be further improved by an
additional genotype-refinement step. b, The relative importance of the features from the RF model for the brain WGS data, with four examples of read-
level features. c, In total, 483 phasable sites were orthogonally evaluated by single-cell, trio and targeted sequencing data. After genotype refinement, the
phasable sites classified as hap!=!2, hap!=!3 and hap!>!3 were converted to het, mosaic, repeat/CNV and refhom for training. d, We applied MosaicForecast
to nonphasable MuTect2 candidate mosaics and evaluated them in single-cell, trio and targeted sequencing data. In nonrepeat regions, the precision
increased from 8.9% (MuTect2) to 76% for the phasing prediction model and 85% for the refined genotypes prediction model; in RepeatMasker regions, it
increased from 1% (MuTect2) to 50% in the phasing prediction model and 77% in the refined genotypes prediction model. The unit k stands for 1000.
(75 minus the one being analyzed). We observed that removing multiple tissues from the same individual, recurrent variants may
recurrent variants did not result in loss of sensitivity (Supplementary be true mosaics; thus, a filtering scheme with an appropriate panel
Fig. 2). For some experimental designs, for example, comparing of ‘normals’ should be chosen to remove germline variants as well as
13
962
1k 547 11
342
0.6 14
Precision
98 13
94 61%
81% 10
13
80 13
80 73 8 12
39%
0.4 8
7
60 54 8
5 7
47% 8
46 47
38 40 10
40 34 0.2
26 3 12
14
20 12 6 20
9
7 8 7 10 14 9
2 2 3 0 20 15
2 15
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
2
se
t2
te
-P
-P
in
ec
ha
un
ef
C
uT
Recall
-P
t-R
H
cH
M
K-
K-
st
as
ai
a
AT
AT
os
ec
ec
G
M
or
or
cF
cF
ai
ai
os
os
M
Fig. 2 | Comparison among algorithms. a, Candidate mosaics (both phasable and nonphasable) in the three individuals with single-cell data were
evaluated (see Methods). b, Precision and recall are plotted separately for the nonrepeat and repeat regions (as defined by RepeatMasker) and for each
individual. The number inside each symbol corresponds to the number of validated mosaics.
minimizing the risk of including artifacts that arise due to misalign- Fig. 5 and Supplementary Table 5; we define ‘repeat region’ to
ment or index hopping20. include interspersed repeats and low-complexity sequences iden-
We then classified the phasable variants (those for which a germ- tified by RepeatMasker22). However, we noticed that many vari-
line SNP is contained in the same read or its mate pair) into three ants are clustered, in mostly repeat or nondiploid regions, when all
categories depending on the number of observed haplotypes (hap): individuals are considered together (~46% of those predicted to be
(1) hap = 2, consistent with heterozygous germline variants; (2) hap ≥ 3 were enriched in regions that together span only ~19 Mb;
hap = 3, consistent with mosaics; and (3) hap > 3, suggestive of low- see Methods and Supplementary Table 6). After removing these
mappability regions, presence of copy number variations (CNVs) or clustered variants, the overall validation rate for the phasing pre-
sequencing-associated/other artifacts (Supplementary Fig. 3). For diction model increased to 76% (55 of 72) in nonrepeat regions
our brain data, ~25% of candidate mosaics were phasable with at and 49% (28 of 57) in repeat regions (Fig. 1d and Supplementary
least one germline SNP (Supplementary Table 2). Table 5). This constitutes a sevenfold increase (nonrepeat regions)
To determine whether the true genotypes can be inferred from and a 43-fold increase (repeat regions) in precision compared with
the haplotype categories, we evaluated 483 phasable sites in selected the initial MuTect2 calls, with minimal loss of sensitivity.
samples for which three additional data types are available: single- With experimental validation data at phasable sites, we can fur-
cell WGS, amplicon-based targeted sequencing and trio WGS (see ther improve our prediction model. Because the haplotype num-
Methods and Supplementary Tables 2 and 3). The single-cell WGS ber was only moderately correlated with the mosaic status (Fig. 1c,
dataset of three individuals we have published previously5,10 pro- top), we reasoned that an intermediate model that defines the
vides an excellent resource for orthogonal validation, as the lineage genotype more accurately using validation data could generate a
information as well as allele fraction across cells allow us to distin- better training set for the subsequent RF model. Visual inspection in
guish mosaics from heterozygous SNPs, germline repeat/CNV vari- the principal component space of the read-based features revealed
ants and technical artifacts (see Methods and Supplementary Fig. 4); that some hap = 3 variants clustered with variants that were found
trio data (for two individuals) are useful for distinguishing mosaic to be repeat/CNV or reference-homozygous, suggesting that read-
and germline variants (Supplementary Table 3). This analysis level data can help refine the genotype predictions (Supplementary
(Fig. 1c) revealed that although the ‘hap = 3’ category was enriched Fig. 6a–c). With a multinomial logistic regression model incorpo-
for true mosaic mutations (50%), the rest of the hap = 3 sites turned rating the read-level features (Fig. 1a), we converted the genotyp-
out to be false positives, classified as repeat/CNV regions (37%), ing categories from haplotype counts to ‘het’, ‘mosaic’, ‘refhom’ and
germline heterozygous (6%) and reference-homozygous (6%). ‘repeat’. The refined categories were in much better agreement with
Variants labeled as ‘hap = 2’ were mostly germline heterozygous, and the orthogonally evaluated calls: whereas only 50% of hap = 3 vari-
variants labeled as ‘hap > 3’ were mostly false positives as expected. ants were validated mosaics, 85% of the ‘mosaic’ predictions from
To identify mosaic variants, we first built an RF model using the regression model were validated mosaics (Fig. 1c). The resulting
over 30 read-level features as predictors and the haplotype number model was then applied to all phasable sites to generate their four-
(hap = 2, hap = 3, hap > 3) as the response, at all phasable sites on category genotype labels (Supplementary Table 4).
diploid chromosomes (Supplementary Table 4). Then we applied Using the phasable sites and their refined genotypes as a training
this ‘phasing prediction model’ genome-wide, excluding nonunique set, we predicted mosaics genome-wide. We built an RF classifica-
mapping regions21 (see Methods). This model resulted in modest tion model (‘refined genotypes prediction model’) on all phasable
validation rates for hap = 3 sites, with 67% (55 of 82) in nonrepeat sites with over 30 features as covariates and the refined genotypes
regions and 34% (28 of 82) within repeat regions (Supplementary as the response. We then applied the RF model to the 135,250
a b 5% FDR
c d
1.0 Mosaic deletions Mosaic insertions
PCR-free
120 95%
(105 of 111) PCR-based 1.0 1.0
(104 of 110) (104 of 109)
105 104 104
79%
100 (92 of 96) 0.8 74%
(88 of 92)
evaluated with Ion Torrent
92
0.8 11 0.8 58% 60%
Nonphasable mosaics
88
48
80
0.6
Sensitivity
Validation rate
Validation rate
7 25
0.6 0.6
60
0.4
67% Read depths 0.4 0.4
40 (30 of 45)
(28 of 42) (28 of 41)
(25 of 38) 250X
30 28 28
(20 of 30)
200X 3
25 0.2 9
20 20 150X 0.2 12 0.2 7
100X 1 1
1 1 1 5
0 0 50X 1
0 0 0
250X 200X 150X 100X 50X 0 0.2 0.4 0.6 0.8 1.0 Hap = 3 Nonphasable Hap = 3 Nonphasable
Read depth FDR Validation Validation
Fig. 3 | Impact of read depth on sensitivity and detection of mosaic indels. a, At each coverage, a different RF model was trained on the phasable sites
and predictions were made on nonphasable sites. Amplicon-sequencing data were used for validation. Although fewer true mosaics were identified at
lower coverages, the sensitivity did not drop substantially (for example, at 50×, MosaicForecast was able to detect ~80% of real variants identified at
250×). b, Similar to a but using simulated data. The sensitivity was ~70% at 50×. c, >70% of mosaic deletions called by MosaicForecast were validated by
IonTorrent; the hap!=!3 sites and nonphasable sites had similar validation rates. d, Similar to c but for mosaic insertions. FDR, false discovery rate.
nonphasable candidate mutations (Supplementary Table 7). Sites using PCR-based libraries, the validation rate was ~61% (42 of 68;
within nonunique mapping regions21 (mappability score = 0) as 68%, 42 of 62, on diploid and 0 of 6 for haploid chromosomes).
well as sites within clustered regions (Supplementary Fig. 6d) were The lower validation rate for the PCR-based samples is likely
excluded. Among the 2,220 predicted (nonphasable) mosaics, 95 due to the PCR-induced biases, as reflected in a significantly higher
randomly selected sites were evaluated using orthogonal data (same proportion of G>T mutations (odds ratio = 4.1, P < 1 × 10−15,
validation method as for phasable sites). As shown in Fig. 2a, 78 Fisher’s exact test; Supplementary Fig. 8b), which are associated
(82%) were confirmed as true mosaics (85% in nonrepeat regions with oxidative damage25. If we focus on nonphasable sites from
and 77% in repeat regions). Top-ranked features of the RF model are diploid chromosomes (Fig. 3a), validation rates were ~95% (105 of
listed in Supplementary Fig. 6e. 111) and ~67% (30 of 45) for PCR-free and PCR-based samples,
We compared the performance of MosaicForecast with respectively. In addition, the validation rates were similar in non-
that of GATK HaplotypeCaller (GATK-HC)23, MuTect212 and repeat regions (87%, 118 of 136) and repeat regions (85%, 17 of
MosaicHunter24 using three different approaches. First, we inspected 20). Among the 177 MosaicForecast mosaics in nonrepeat regions
the variants called by all methods in three 250× WGS brain samples confirmed by IonTorrent, GATK-HC-p5 was only able to detect
for which single-cell WGS data are available5,10 (both phasable and ~62% (109 of 177), followed by MosaicHunter (~59%, 105 of 177)
nonphasable sites; leave-one-out cross-validation for three individ- and GATK-HC-p2 (~20%, 35 of 177). Among the 26 MosaicForecast
uals). Although the lineage information in single cells provides a mosaics in repeat regions confirmed by IonTorrent, GATK-HC-p5
very useful way to benchmark algorithm performance, one limita- and GATK-HC-p2 were only able to detect ~58% (15 of 26) and
tion of this approach is that variants with low allele fraction have a ~19% (5 of 26), respectively (MosaicHunter does not make
proportionally low chance of being sampled if the number of cells calls in repeat regions). A large fraction of low-VAF (≤0.05) mosaics
is small; thus, we used deep sequencing on the IonTorrent platform were called by MosaicForecast but missed by both MosaicHunter
to further examine those variants identified as refhom by single- and GATK-HC (~52%, 48 of 92; Supplementary Fig. 8c), indicat-
cell data (see Methods, Supplementary Fig. 7 and Supplementary ing that MosaicForecast is particularly advantageous for detecting
Table 8). The results show that MosaicForecast-Phase and -Refine low-VAF mutations.
models achieve precision that is typically several-fold higher than Third, we tested the haplotype numbers for the extra variants
other tools, while maintaining high sensitivity (Fig. 2a). GATK-HC identified by the other callers. Across the 75 individuals, the other
with ploidy two (GATK-HC-p2) frequently misclassified hetero- methods called 1–80 times more mutations than MosaicForecast,
zygous SNPs as mosaics; MuTect2 and GATK-HC with ploidy but read-based phasing showed that a large proportion of phas-
five (GATK-HC-p5) most often misclassify repeat/CNV variants; able sites from these tools had two haplotypes or more than three
and MosaicHunter could only detect variants within nonrepeat haplotypes, inconsistent with mosaic variants (Figs. 1c and 2a).
regions, thus losing ~50% of true mosaics. At the individual level For example, the percentages of hap > 3 variants were 58%, 49%
(Fig. 2b), the precision was ~92% (24 of 26), ~81% (25 of 31) and and 26% for MuTect2, GATK-HC-p5 and GATK-HC-p2, respec-
~73% (24 of 33) for the MosaicForecast-Refine model, suggesting tively; another 51% of GATK-HC-p2 were hap = 2. These numbers
a consistently high validation rate. MosaicForecast was also able to indicate that the false positive rates are indeed very high for other
detect more low-allele fraction variants with VAF ≤ 0.05 (30 of 41) methods (Supplementary Fig. 9).
than MosaicHunter (14), GATK-HC-p5 (4) and GATK-HC-p2 (0) To determine the performance of our model across VAFs, we
(Supplementary Fig. 8a). applied the model to a simulated dataset containing spike-in muta-
As a second mode of validation, we evaluated candidate mosa- tions (see Methods). We found that the model had similarly good
ics called by MosaicForecast-Refine in the 75 individuals using performance over a relatively wide range of VAFs, from 0.02 to 0.3,
amplicon-based sequencing (~30,000× on IonTorrent). Of the as reflected in the receiver operating characteristic (ROC) curves
75, the IonTorrent validation rate (Supplementary Table 9) was (Supplementary Fig. 10a,b). It performed substantially worse when
~94% (161 of 171) for the 64 samples that were sequenced using the VAFs approached 0.5, as it becomes impossible to separate
PCR-free libraries (~95%, 149 of 157, for diploid and ~86%, 12 of somatic variants from germline variants; in that case, a case-control
14, for haploid chromosomes). For the remaining 11 sequenced scheme would be a better choice.