0% found this document useful (0 votes)

19 views8 pages

Squigulator Sup

The document details the generation of experimental and simulated NA12878 datasets for benchmarking, utilizing ONT PromethION sequencing and Squigulator software. It outlines the analysis workflow, including basecalling, alignment, and variant calling, along with parameter exploration experiments and comparisons between Squigulator and DeepSimulator. Results are presented in tables and figures, highlighting differences in accuracy, runtime, and memory usage between the two simulation methods.

Uploaded by

huiyingw0516

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views8 pages

Squigulator Sup

Uploaded by

huiyingw0516

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

SUPPLEMENTARY METHODS

Generating the experimental NA12878 dataset

The experimental dataset used in benchmarking experiments was generated by sequencing genomic DNA
from the human NA12878 reference sample on an ONT PromethION device. Unsheared DNA libraries were
prepared using the ONT LSK109 ligation library prep, and two R9.4.1 flow-cells were used to generate ~30×
genome coverage. The data is available on NCBI Sequence Read Archive at Bioproject PRJNA744329.

Generating the simulated NA12878 dataset

The simulated NA1278 dataset was generated using Squigulator, with the intention to emulate the real
experimental dataset above. Bcftools consensus (v1.16) was used to incorporate high-confidence NA12878
variants (SNVs and indels) from Genome in a Bottle (v3.3.2) into the human reference genome sequence
(hg38; FASTA format). To minimise computational resources for resulting benchmark experiments, we
restricted this to chr22. The commands used were as follows:
bcftools consensus --haplotype 1 -f chr22.fa giab_na12878.vcf > hap1.fa
bcftools consensus --haplotype 2 -f chr22.fa giab_na12878.vcf > hap2.fa
cat hap1.fa hap2.fa > na12878_chr22.fa
These commands generate two separate chr22 reference sequences with variants incorporated from
NA12878 haplotype 1 and haplotype 2, respectively (homozygous variants are incorporated into both
references). We then used Squigulator to generate simulated nanopore signal data from this custom diploid
reference. To match the data to the NA12878 experimental dataset, we used the -x dna-r9-prom pre-set
parameter configuration. We adjusted the read-length mean, read-length standard deviation and sequencing
depth so as to approximate the equivalent metrics measured from the experimental dataset. The command
used was as follows:
squigulator na12878_chr22.fa -o reads.blow5 -n 135000 -r 10800 -x dna-r9-prom -t 8 -K 4096

Details of analysis workflow and evaluation with RTG

Signal data was basecalled with ONT’s Guppy software (using the Buttery-eel wrapper for SLOW5 data access;
Buttery-eel v0.0.1 on Guppy v6.0.6). Basecalled reads were aligned to the hg38 reference genome with no
alternate contigs using Minimap2 (v2.17). Alignment statistics were derived with Samtools stats (v1.9).
Reference:read identity scores were retrieved using Paftools, which is a companion tool in the Minimap2
repository:
samtools view reads.bam -h chr22 | paftools.js sam2paf -p - | awk '{print $10/$11}'
Variant calling was performed separately using Nanopolish (v0.14.0) and Clair3 (v0.1-r11;
r941_prom_sup_g5014). Variant evaluation was performed using rtg vcfeval against the GIAB NA12878 high
confidence truth-set (the same callset that was used during the simulation) with QUAL field as the --vcf-score-
field. The commands used for basecalling, alignment, variant calling and evaluation were as follows:
buttery-eel -i reads.blow5 -o reads.fastq --guppy_bin ont-guppy-6.0.6/bin --port 5887 --config
dna_r9.4.1_450bps_${MODEL}_prom.cfg -x cuda:all --chunk_size 1500 --max_queued_reads 1000 #
MODEL is fast or hac or sup

minimap2 -x map-ont -a -t32 --secondary=no hg38noAlt.fa reads.fastq > reads.sam

$SAMTOOLS sort -@32 reads.sam > reads.bam
$SAMTOOLS index reads.bam

run_clair3.sh --threads=32 --include_all_ctgs --bam_fn=reads.bam --ref_fn=hg38noAlt.fa --platform=ont

--model_path=r941_prom_sup_g5014/ --output=out/ --sample_name=reads --enable_phasing --
longphase_for_phasing
nanopolish variants -o output.vcf -w ${1} -r reads.fastq -g hg38noAlt.fa -b reads.bam -p 2 -t 4 -q cpg --fix-
homopolymers

rtg RTG_MEM=32G vcfeval -b highconf_PGandRTGphasetransfer.vcf.gz -c merge_output.vcf.gz -t

hg38noAlt.sdf -o compare_clair --region chr22:1-50818468 -e highconf_nosomaticdel_noCENorHET7.bed --
vcf-score-field QUAL

Details of parameter exploration experiment

For the parameter exploration experiments presented in Fig3 and FigS2, we repeated the simulation and
analysis workflows described above, each time varying the simulation parameters. We independently varied
the dwell-time mean (--dwell-mean), dwell-time standard deviation (--dwell-std) and amplitude noise factor
(--amp-noise), whilst holding the other parameters at the default value. Example commands are as follows:
squigulator na12878_chr22.fa -o reads.blow5 -n 135000 -r 10800 -t 8 -K4096 -x dna-r9-prom --amp-noise
<FACTOR> --dwell-mean <MEAN> --dwell-std <STD>

For each simulation, the analysis workflow and evaluation was described exactly as above.

Details for DeepSimulator comparison

DeepSimulator 1.5 main branch on Github (https://github.com/liyu95/DeepSimulator) has an install.sh script
for building a conda environment and setting up various other tools required. This script does not work with
conda v4+ and thus modifications were made to successfully install DeepSimulator. Similarly, the
deepsimulatr.sh script for running the DeepSimulator pipeline needed modifications to work with conda v4+.
Basecalling was excluded from the pipeline when running benchmarks.
To generate simulated libraries for comparison with Squigulator, the following commands were run:

## for context-independent mode:

deep_simulator.sh -i na12878_chr22_1.fa -o chr22_1_context_ind -n 67500 -l 10800 -c 16
deep_simulator.sh -i na12878_chr22_2.fa -o chr22_2_context_ind -n 67500 -l 10800 -c 16
## for context-dependent mode:
deep_simulator.sh -i na12878_chr22_1.fa -o na12878_chr22_1_context_dep -n 67500 -l 10800 -M 0
deep_simulator.sh -i na12878_chr22_2.fa -o na12878_chr22_2_context_dep -n 67500 -l 10800 -M 0

The modified deep_simulator.sh scripts can be found here: https://github.com/Psy-

Fer/DeepSimulator_benchmark

1
Experimental
Squigulator
DeepSimulator

Simulated NA12787 SNV

False-positive SNV (DeepSimulator)

FigS1. Comparison of Squigulator and DeepSimulator to real experimental ONT data. Genome browser view shows
basecalled reads (Guppy SUP model) aligned to the human reference genome (hg38). The top track shows real experimen-
tal data from ONT sequencing of NA12878 genomic DNA (R9.4.1 PromethION flow cells). The middle track shows simulat-
ed NA12878 data from Squigulator with -x dna-r9-prom pre-set configuration. The bottom track shows simulated NA12878
data from DeepSimulator running in context-independent mode. Blue triangle markers show the location of NA12878 SNVs
that were incorporated into the simulation, and are correctly detected by Clair3. Red triangle markers show the presence
of systematic errors in basecalled reads from DeepSimulator, which are erroneously detected as SNVs by Clair3.
a Per-read basecalling accuracy on real experimental data

10000 Guppy basecall

model
8000 SUP
6000 HAC
FAST
4000
2000
0
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

b Impact of changes in dwell-time mean on basecalling accuracy

0.98
Median read:reference identity

0.96

0.94

0.92
fast
0.90
hac
sup
0.88
0 10 20 30 40 50
Mean dwell-time

c Impact of changes in dwell-time mean on SNV detection

1.0

0.8
SNV F-score (Clair3)

0.6

0.4 fast
hac
0.2
sup
0.0
0 10 20 30 40 50
Mean dwell-time

FigS2. Parameter exploration regarding Guppy basecalling sequencing accuracy. (a) Guppy basecalling accuracy, as
measured by read:reference identity score distributions, on real experimental NA12878 data with Guppy’s FAST, HAC or
SUP models. (b) Guppy basecalling accuracy, as measured by read:reference identity score medians, for repeated experi-
ments in which the mean dwell time (--dwell-mean) is varied, while other parameters are held at default. Experiment was
repeated with FAST, HAC and SUP basecalling models. Default value --dwell-mean=9 (for R9.4.1 flow cell). (c) Accuracy of
SNV detection, as measured by F-score, by Clair3 on the same datasets and basecalling models as above (colours are
matched).
Supplementary Table 1: Comparison of Minimap2 alignment statistics for experimental vs simulated
NA12878 datasets.
Basecalled data was generated using Guppy HAC model.

NA12878 NA12878
experimental simulated

sequences 135,083 134,999

reads mapped 134,001 134,987

reads unmapped 1,082 12

reads MQ0 661 153

total length 1,458,924,348 1,430,013,633

bases mapped (cigar) 1,491,898,469 1,430,001,381

mismatches 154,930,196 76,290,809

error rate 1.04E-01 5.34E-02

average length 10800 10592

maximum length 187345 87341

average quality 20.1 18

insertions (1-base) 11,508,645 10,218,115

deletions (1-base) 16,023,165 16,965,569

2
Supplementary Table 2: Comparison of Clair3 and Nanopolish SNV detection statistics for experimental vs
simulated NA12878 datasets
Basecalled data was generated using Guppy SUP model.

Data Variant Score True False True False precisio sensitivit f_measu
caller threshold positives positive positiv negativ n y re
baseline s es call es

Experiment Clair3 None 34302 118 34304 160 0.9966 0.9954 0.996
al
2.15 34302 118 34304 160 0.9966 0.9954 0.996

Nanopol None 32743 1532 32735 1719 0.9553 0.9501 0.9527

ish
20.9 32713 1479 32705 1749 0.9567 0.9492 0.953

Simulated Clair3 None 33881 403 33883 581 0.9882 0.9831 0.9857

6.48 33804 255 33807 658 0.9925 0.9809 0.9867

Nanopol None 33418 317 33409 1044 0.9906 0.9697 0.98

ish
21.7 33418 316 33409 1044 0.9906 0.9697 0.9801

3
Supplementary Table 3: Comparison of Squigulator and DeepSimulator run-time and memory usage.
Run times and peak RAM usage were measured during simulations of NA12878 data from chr22 at ~30X using
16 CPUs.

Squigulator Deep Simulator (context Deep Simulator (context

independent) dependent)

Execution time 156.194 3939.58 seconds 130.8 hours

seconds

Peak RAM usage 0.477 GB 2.117 GB 49.39 GB

4
SUPPLEMENTARY FIGURE LEGENDS

FigS1. Comparison of Squigulator and DeepSimulator to real experimental ONT data. Genome browser view shows
basecalled reads (Guppy SUP model) aligned to the human reference genome (hg38). The top track shows real
experimental data from ONT sequencing of NA12878 genomic DNA (R9.4.1 PromethION flow cells). The middle track
shows simulated NA12878 data from Squigulator with -x dna-r9-prom pre-set configuration. The bottom track shows
simulated NA12878 data from DeepSimulator running in context-independent mode. Blue triangle markers show the
location of NA12878 SNVs that were incorporated into the simulation, and are correctly detected by Clair3. Red triangle
markers show the presence of systematic errors in basecalled reads from DeepSimulator, which are erroneously
detected as SNVs by Clair3.

FigS2. Parameter exploration regarding Guppy basecalling sequencing accuracy. (a) Guppy basecalling accuracy, as
measured by read:reference identity score distributions, on real experimental NA12878 data with Guppy’s FAST, HAC
or SUP models. (b) Guppy basecalling accuracy, as measured by read:reference identity score medians, for repeated
experiments in which the mean dwell time (--dwell-mean) is varied, while other parameters are held at default.
Experiment was repeated with FAST, HAC and SUP basecalling models. Default value --dwell-mean=9 (for R9.4.1 flow
cell). (c) Accuracy of SNV detection, as measured by F-score, by Clair3 on the same datasets and basecalling models as
above (colours are matched).

Acosta Vs Ochoa
0% (1)
Acosta Vs Ochoa
3 pages
PROPOSAL Syringe4 Needle Assemble INDIA 20180212 MR - Rohit Shaha
No ratings yet
PROPOSAL Syringe4 Needle Assemble INDIA 20180212 MR - Rohit Shaha
31 pages
Euler's Path
50% (2)
Euler's Path
10 pages
CVP Analysis 2
50% (2)
CVP Analysis 2
7 pages
Food Packaging: Unit 1 - Metals
No ratings yet
Food Packaging: Unit 1 - Metals
22 pages
Biological Sequence Determination: Protein
No ratings yet
Biological Sequence Determination: Protein
68 pages
Canicosa Contract To Sell
No ratings yet
Canicosa Contract To Sell
5 pages
Biological Sequence Determination: Protein
No ratings yet
Biological Sequence Determination: Protein
68 pages
MS9882 10 Military Fasteners Com
No ratings yet
MS9882 10 Military Fasteners Com
2 pages
Biology Meets Programming 101
No ratings yet
Biology Meets Programming 101
79 pages
Mason - A Read Simulator For Second Generation Sequencing Data
No ratings yet
Mason - A Read Simulator For Second Generation Sequencing Data
19 pages
Opportunities and Challenges in Long-Read Sequencing Data Analysis
No ratings yet
Opportunities and Challenges in Long-Read Sequencing Data Analysis
16 pages
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
No ratings yet
Brief Guide For NGS Transcriptomics: From Gene Expression To Genetics
120 pages
Uses of Cotton Fibre
No ratings yet
Uses of Cotton Fibre
88 pages
Java MCQ
No ratings yet
Java MCQ
55 pages
Industry, Commerce, Trade
100% (2)
Industry, Commerce, Trade
9 pages
Implementation of A Read Mapping Tool Based On The Pigeon-Hole Principle
No ratings yet
Implementation of A Read Mapping Tool Based On The Pigeon-Hole Principle
38 pages
Readfish Enables Targeted Nanopore Sequencing of Gigabase-Sized Genomes
No ratings yet
Readfish Enables Targeted Nanopore Sequencing of Gigabase-Sized Genomes
27 pages
Group3 CaseStudy3
No ratings yet
Group3 CaseStudy3
7 pages
Charles Vaughner, Cross-Appellants v. F.J. Pulito, Cross-Appellee v. General Accident Insurance Company of America, the Camden Fire Insurance Association, Potomac Insurance Company of Illinois and Pennsylvania General Insurance Company, Third-Party, 804 F.2d 873, 3rd Cir. (1986)
No ratings yet
Charles Vaughner, Cross-Appellants v. F.J. Pulito, Cross-Appellee v. General Accident Insurance Company of America, the Camden Fire Insurance Association, Potomac Insurance Company of Illinois and Pennsylvania General Insurance Company, Third-Party, 804 F.2d 873, 3rd Cir. (1986)
9 pages
DNA Library
No ratings yet
DNA Library
73 pages
CSE Lecture02.note
No ratings yet
CSE Lecture02.note
22 pages
Memo Example 1
No ratings yet
Memo Example 1
7 pages
R M L A D E A: Eview of Achine Earning Lgorithms in Ifferential Xpression Nalysis
No ratings yet
R M L A D E A: Eview of Achine Earning Lgorithms in Ifferential Xpression Nalysis
14 pages
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
No ratings yet
Intro To NGS - Torsten Seemann - PeterMac - 27 Jul 2012
51 pages
A Graphical, Interactive and GPU-enabled Workflow To Process Long-Read Sequencing Data
No ratings yet
A Graphical, Interactive and GPU-enabled Workflow To Process Long-Read Sequencing Data
8 pages
Optimization of Oxford Nanopore Technology Sequencing
No ratings yet
Optimization of Oxford Nanopore Technology Sequencing
14 pages
A Universal SNP and Small-Indel Variant Caller Using Deep Neural Networks
No ratings yet
A Universal SNP and Small-Indel Variant Caller Using Deep Neural Networks
6 pages
T Proc Notices Notices 100 K Notice Doc 97750 263071004
No ratings yet
T Proc Notices Notices 100 K Notice Doc 97750 263071004
14 pages
Performance of Neural Network Basecalling Tools For Oxford Nanopore Sequencing
No ratings yet
Performance of Neural Network Basecalling Tools For Oxford Nanopore Sequencing
10 pages
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
No ratings yet
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
8 pages
Accelerated Nanopore Basecalling With SLOW5 Data Format
No ratings yet
Accelerated Nanopore Basecalling With SLOW5 Data Format
8 pages
Perceived Guest House Brand Value The Influence of Web Interactivity On Brand Image and Brand Awareness
No ratings yet
Perceived Guest House Brand Value The Influence of Web Interactivity On Brand Image and Brand Awareness
29 pages
Workshop Practice 1: Reading and Manipulating Short Reads
No ratings yet
Workshop Practice 1: Reading and Manipulating Short Reads
16 pages
Datasheet Motor PM55L 048 HPG9
100% (7)
Datasheet Motor PM55L 048 HPG9
1 page
DNA Sequencing: Present Status and Future Challenges: Elaine Mardis Washington University Genome Sequencing Center
No ratings yet
DNA Sequencing: Present Status and Future Challenges: Elaine Mardis Washington University Genome Sequencing Center
26 pages
Dissertation On Intellectual Property Rights
100% (2)
Dissertation On Intellectual Property Rights
7 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
Ledesma vs. CA Notes
No ratings yet
Ledesma vs. CA Notes
4 pages
High-Performance Virus Detection System by Using Deep Learning
No ratings yet
High-Performance Virus Detection System by Using Deep Learning
9 pages
Notes
No ratings yet
Notes
6 pages
Sequence Alignment Algorithm Overview
No ratings yet
Sequence Alignment Algorithm Overview
1 page
Documents - Pub Introduction To Next Generation Sequencing and Variant Calling Karin Kassahn
No ratings yet
Documents - Pub Introduction To Next Generation Sequencing and Variant Calling Karin Kassahn
74 pages
Sequencing Review
No ratings yet
Sequencing Review
5 pages
Illumina
No ratings yet
Illumina
68 pages
RC1665 - Mindi Puspita Anggraeni
No ratings yet
RC1665 - Mindi Puspita Anggraeni
5 pages
SP24 Genetics Exam 3 LA Review
No ratings yet
SP24 Genetics Exam 3 LA Review
62 pages
BioInformatics For Newbies Dantelan
No ratings yet
BioInformatics For Newbies Dantelan
46 pages
Beginner's Guide To Using The DESeq2 Package
No ratings yet
Beginner's Guide To Using The DESeq2 Package
32 pages
Blank en Berg Pittsburgh 2011 Ngs
No ratings yet
Blank en Berg Pittsburgh 2011 Ngs
59 pages
Biffi-Sm Rev 1
No ratings yet
Biffi-Sm Rev 1
81 pages
Titanic: Mohit Kothari Roger Tanuatmadja Gautam Akiwate
No ratings yet
Titanic: Mohit Kothari Roger Tanuatmadja Gautam Akiwate
18 pages
NGSand App
No ratings yet
NGSand App
41 pages
BFG Chapter09 NGS v04
No ratings yet
BFG Chapter09 NGS v04
123 pages
Long Read Sequencing in Deciphering Human Genetics To A Greater Depth
No ratings yet
Long Read Sequencing in Deciphering Human Genetics To A Greater Depth
15 pages
NLC Accomplishment Report 2024-2025
No ratings yet
NLC Accomplishment Report 2024-2025
5 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
A Practical Guide To NGS 08 05 17 Digital
No ratings yet
A Practical Guide To NGS 08 05 17 Digital
76 pages
Class X Unit 3 DBMS
No ratings yet
Class X Unit 3 DBMS
78 pages
03 - Product Specification
No ratings yet
03 - Product Specification
4 pages
Deepmind TriplebarpaperJune2024
No ratings yet
Deepmind TriplebarpaperJune2024
89 pages
Molecular Systems Biology-NTU COOL
No ratings yet
Molecular Systems Biology-NTU COOL
70 pages
NGSand App
No ratings yet
NGSand App
41 pages
Module 3.1 - Training Certificate - Folayeni - Awosika
No ratings yet
Module 3.1 - Training Certificate - Folayeni - Awosika
1 page
Biochemistry Lab Report Day 3
No ratings yet
Biochemistry Lab Report Day 3
5 pages
Module5 Session1 Prac
No ratings yet
Module5 Session1 Prac
3 pages
GenPIP - In-Memory Acceleration Acceleration of Genome Analysis Via Tight Integration of Basecalling and Read Mapping
No ratings yet
GenPIP - In-Memory Acceleration Acceleration of Genome Analysis Via Tight Integration of Basecalling and Read Mapping
17 pages
Blast N Fasta
No ratings yet
Blast N Fasta
55 pages
TN Automated DNA Library Preparation For PacBio Long-Read Whole Genome Sequencing With The DreamPrep NGS Compact 402700
No ratings yet
TN Automated DNA Library Preparation For PacBio Long-Read Whole Genome Sequencing With The DreamPrep NGS Compact 402700
4 pages
LO3 Development of DNA Sequencing Technologies
No ratings yet
LO3 Development of DNA Sequencing Technologies
11 pages
Next Generation Sequencing Analysis Lecture 04.
No ratings yet
Next Generation Sequencing Analysis Lecture 04.
32 pages
Supplyment - Improved Data Analysis of Nanopore Minion
No ratings yet
Supplyment - Improved Data Analysis of Nanopore Minion
40 pages
Deep Simulator
No ratings yet
Deep Simulator
10 pages
EBTY348L - Comp Genomics Lectures - Even Sem - 2024-25 - Set 2
No ratings yet
EBTY348L - Comp Genomics Lectures - Even Sem - 2024-25 - Set 2
29 pages
3 - Introduction (SEQU ANAL of PCR Products 9 9 12
No ratings yet
3 - Introduction (SEQU ANAL of PCR Products 9 9 12
42 pages
Assignment - Com-Rc-5210
No ratings yet
Assignment - Com-Rc-5210
2 pages
Sequencing Quality Control
No ratings yet
Sequencing Quality Control
104 pages
G4™ Best Practices and Quality Control Guide
No ratings yet
G4™ Best Practices and Quality Control Guide
11 pages
Application of Deep Learning Technique in Next Generation Sequence Experiments
No ratings yet
Application of Deep Learning Technique in Next Generation Sequence Experiments
21 pages
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
No ratings yet
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
9 pages
List Spare Part NCR BSB - 6622 - 6622e - Rev1
No ratings yet
List Spare Part NCR BSB - 6622 - 6622e - Rev1
56 pages
Sam2bam High-Performance Framework For NGS Data Preprocessing Tools
No ratings yet
Sam2bam High-Performance Framework For NGS Data Preprocessing Tools
11 pages
Summary of Sequencing Updated
No ratings yet
Summary of Sequencing Updated
11 pages
GATKwr17-01-Intro To Variant Discovery
No ratings yet
GATKwr17-01-Intro To Variant Discovery
39 pages
Poster Template
No ratings yet
Poster Template
21 pages
CS IMMIGRATION LTD. Financial Statements 2023
No ratings yet
CS IMMIGRATION LTD. Financial Statements 2023
7 pages
School Action Plan For Literacy Catch-Up Sessions
No ratings yet
School Action Plan For Literacy Catch-Up Sessions
7 pages
University Chemistry 1st Edition Peter E Siska Ebook and TestBank Bundle Fast Access
No ratings yet
University Chemistry 1st Edition Peter E Siska Ebook and TestBank Bundle Fast Access
325 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Advanced Java Interview Questions and Answers
From Everand
Advanced Java Interview Questions and Answers
Jaishree Soni
No ratings yet

Squigulator Sup

Uploaded by

Squigulator Sup

Uploaded by

SUPPLEMENTARY METHODS

Generating the experimental NA12878 dataset

Generating the simulated NA12878 dataset

Details of analysis workflow and evaluation with RTG

minimap2 -x map-ont -a -t32 --secondary=no hg38noAlt.fa reads.fastq > reads.sam

run_clair3.sh --threads=32 --include_all_ctgs --bam_fn=reads.bam --ref_fn=hg38noAlt.fa --platform=ont

rtg RTG_MEM=32G vcfeval -b highconf_PGandRTGphasetransfer.vcf.gz -c merge_output.vcf.gz -t

Details of parameter exploration experiment

Details for DeepSimulator comparison

## for context-independent mode:

The modified deep_simulator.sh scripts can be found here: https://github.com/Psy-

Simulated NA12787 SNV

10000 Guppy basecall

b Impact of changes in dwell-time mean on basecalling accuracy

c Impact of changes in dwell-time mean on SNV detection

sequences 135,083 134,999

reads mapped 134,001 134,987

reads unmapped 1,082 12

reads MQ0 661 153

total length 1,458,924,348 1,430,013,633

bases mapped (cigar) 1,491,898,469 1,430,001,381

mismatches 154,930,196 76,290,809

error rate 1.04E-01 5.34E-02

average length 10800 10592

maximum length 187345 87341

average quality 20.1 18

insertions (1-base) 11,508,645 10,218,115

deletions (1-base) 16,023,165 16,965,569

Nanopol None 32743 1532 32735 1719 0.9553 0.9501 0.9527

6.48 33804 255 33807 658 0.9925 0.9809 0.9867

Nanopol None 33418 317 33409 1044 0.9906 0.9697 0.98

Squigulator Deep Simulator (context Deep Simulator (context

Execution time 156.194 3939.58 seconds 130.8 hours

Peak RAM usage 0.477 GB 2.117 GB 49.39 GB

You might also like