Squigulator Sup
Squigulator Sup
For each simulation, the analysis workflow and evaluation was described exactly as above.
1
Experimental
Squigulator
DeepSimulator
FigS1. Comparison of Squigulator and DeepSimulator to real experimental ONT data. Genome browser view shows
basecalled reads (Guppy SUP model) aligned to the human reference genome (hg38). The top track shows real experimen-
tal data from ONT sequencing of NA12878 genomic DNA (R9.4.1 PromethION flow cells). The middle track shows simulat-
ed NA12878 data from Squigulator with -x dna-r9-prom pre-set configuration. The bottom track shows simulated NA12878
data from DeepSimulator running in context-independent mode. Blue triangle markers show the location of NA12878 SNVs
that were incorporated into the simulation, and are correctly detected by Clair3. Red triangle markers show the presence
of systematic errors in basecalled reads from DeepSimulator, which are erroneously detected as SNVs by Clair3.
a Per-read basecalling accuracy on real experimental data
0.98
Median read:reference identity
0.96
0.94
0.92
fast
0.90
hac
sup
0.88
0 10 20 30 40 50
Mean dwell-time
0.8
SNV F-score (Clair3)
0.6
0.4 fast
hac
0.2
sup
0.0
0 10 20 30 40 50
Mean dwell-time
FigS2. Parameter exploration regarding Guppy basecalling sequencing accuracy. (a) Guppy basecalling accuracy, as
measured by read:reference identity score distributions, on real experimental NA12878 data with Guppy’s FAST, HAC or
SUP models. (b) Guppy basecalling accuracy, as measured by read:reference identity score medians, for repeated experi-
ments in which the mean dwell time (--dwell-mean) is varied, while other parameters are held at default. Experiment was
repeated with FAST, HAC and SUP basecalling models. Default value --dwell-mean=9 (for R9.4.1 flow cell). (c) Accuracy of
SNV detection, as measured by F-score, by Clair3 on the same datasets and basecalling models as above (colours are
matched).
Supplementary Table 1: Comparison of Minimap2 alignment statistics for experimental vs simulated
NA12878 datasets.
Basecalled data was generated using Guppy HAC model.
NA12878 NA12878
experimental simulated
2
Supplementary Table 2: Comparison of Clair3 and Nanopolish SNV detection statistics for experimental vs
simulated NA12878 datasets
Basecalled data was generated using Guppy SUP model.
Data Variant Score True False True False precisio sensitivit f_measu
caller threshold positives positive positiv negativ n y re
baseline s es call es
Experiment Clair3 None 34302 118 34304 160 0.9966 0.9954 0.996
al
2.15 34302 118 34304 160 0.9966 0.9954 0.996
Simulated Clair3 None 33881 403 33883 581 0.9882 0.9831 0.9857
3
Supplementary Table 3: Comparison of Squigulator and DeepSimulator run-time and memory usage.
Run times and peak RAM usage were measured during simulations of NA12878 data from chr22 at ~30X using
16 CPUs.
4
SUPPLEMENTARY FIGURE LEGENDS
FigS1. Comparison of Squigulator and DeepSimulator to real experimental ONT data. Genome browser view shows
basecalled reads (Guppy SUP model) aligned to the human reference genome (hg38). The top track shows real
experimental data from ONT sequencing of NA12878 genomic DNA (R9.4.1 PromethION flow cells). The middle track
shows simulated NA12878 data from Squigulator with -x dna-r9-prom pre-set configuration. The bottom track shows
simulated NA12878 data from DeepSimulator running in context-independent mode. Blue triangle markers show the
location of NA12878 SNVs that were incorporated into the simulation, and are correctly detected by Clair3. Red triangle
markers show the presence of systematic errors in basecalled reads from DeepSimulator, which are erroneously
detected as SNVs by Clair3.
FigS2. Parameter exploration regarding Guppy basecalling sequencing accuracy. (a) Guppy basecalling accuracy, as
measured by read:reference identity score distributions, on real experimental NA12878 data with Guppy’s FAST, HAC
or SUP models. (b) Guppy basecalling accuracy, as measured by read:reference identity score medians, for repeated
experiments in which the mean dwell time (--dwell-mean) is varied, while other parameters are held at default.
Experiment was repeated with FAST, HAC and SUP basecalling models. Default value --dwell-mean=9 (for R9.4.1 flow
cell). (c) Accuracy of SNV detection, as measured by F-score, by Clair3 on the same datasets and basecalling models as
above (colours are matched).