0% found this document useful (0 votes)

20 views50 pages

Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views50 pages

Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Gamgee: A C++14

library for genomic data

processing and analysis

Mauricio Carneiro, PhD!

Group Lead, Computational Technology Development
Broad Institute
Talk breakdown
• An overview of genetics data and how complex
disease research became a big data problem

• The first C++ example that steered us away from Java.

• Gamgee: the C++14 library memory model and

examples

• Performance comparisons with the old Java framework.

• Discussion of C++11/14 features used in the library

and how they affected development
To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#

Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)
Improving human health in 5 easy steps

Disease Many simple and complex human diseases are

genetics heritable. Pick one.

Large scale Affected and unaffected individuals differ

sequencing systematically in their genetic composition

Association These systematic differences can be identified by

studies comparing affected and unaffected individuals

Functional These associated variants give insight into the

studies biological mechanisms of disease

Therapeutics These insights can be used to intervene in the

and drugs disease process itself
The%Importance%of%Scale…Early%Success%Stories%(at%1,000s%of%exomes)%

Type%2%Diabetes% Schizophrenia%
• 5,000%exomes%%
• Pathways%%
• 13,000%exomes%%
• Ac:vity2regulated%cytoskeletal%
• SLC30A8%
(ARC)%of%post2synap:c%density%
%%(Beta2cell2speciﬁc%Zn++%transporter)%
complex%(PSD)%
• 32fold%protec:on%against%T2D!%
• 1"LoF""per"1500"people" • Voltage2gated%Ca++%Channel%
• 13221%%risk%in%carriers%
%
• Collec+on"of"rare"disrup+ve"muta+ons"
(~1/10,000"carrier"frequency)%
%
Coronary%Heart%Disease% Early%Heart%A9ack%
• 5,000%exomes%%
• 3,700%exomes%% • APOA5%
• APOC3% • 22%%risk%in%carriers%
• 2.52fold%protec:on%from%CHD% • 0.5%"Rare"disrup+ve"/"deleterious"alleles"
• 4"rare"disrup+ve"muta+ons"(~1"in"200" %
carrier"frequency)"

%
Broad Institute in 2013
50! 10!
HiSeqs MiSeqs

2! 14!
NextSeqs HiSeq X

6.5! 427!
Pb of data projects

180! 2.1!
people Tb/day

* we also own 1 Pacbio RS and 4 Ion Torrent for experimental use

Broad Institute in 2013
44,130! 2,484!
exomes exome express

2,247! 2,247!
genomes assemblies

8,189! 9,788!
RNA 16S

47,764! 228!
arrays cell lines
Terabases of Data Produced by Year
2100
projected 300
2,064
Petabytes

1575
Terabases

1050

660
525

362.4
302.8
22.8 153.8
0
2009 2010 2011 2012 2013 2014
…and these numbers will continue
to grow faster than Moore’s law
GATK is both a toolkit and a programming framework,
enabling NGS analysis by scientists worldwide
Toolkit & framework packages

Toolkit

Best practices
for variant
discovery

Framework
MuTect, XHMM, GenomeSTRiP, ...
Tools developed on top of the GATK framework by other groups

Extensive online documentation & user support forum serving >10K users worldwide

http://www.broadinstitute.org/gatk
Workshop series educates local and worldwide audiences

Past: Format
• Dec 4-‐5 2012, Boston
• July 9-‐10 2013, Boston
• Lecture series (general audience)
• July 22-‐23 2013, Israel • Hands-‐on sessions (for beginners)
• Oct 21-‐22 2013, Boston !
• March 3-‐5 2014, Thailand
• June 6-‐9 2014, Belgium
Portfolio of workshop modules
Upcoming: • GATK Best Practices for Variant Calling
• Sep 17-‐18 2014, Philadelphia • Building Analysis Pipelines with Queue
• Oct 18-‐29 2014, San Diego
• Third-‐party Tools:
o GenomeSTRiP • High levels of satisfaction
reported by users in polls
o XHMM
• Detailed feedback helps
improve further iterations
Tutorial materials, slide decks and
videos all available online
through the GATK website, YouTube
and iTunesU
We have defined the best practices
for sequencing data processing

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#

Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)

The motivating example

Joint genotyping is an important
step in Variant Discovery

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

The ideal database for RVAS and CVAS
studies is a complete mutation matrix
All case and control samples

Genotypes:
~3M variants Site Variant Sample 1 Sample 2 … Sample N 0/0 ref
0/1 het
1/1 hom-‐alt
0/0 0/1 0/0
SNP 1:1000 A/C …
0,10,100 20,0,200 0,100,255

0/0 0/0 1/0

Indel 1:1050 T/TC …
0,10,100 0,20,200 255,0,255
Likelihoods:
0/0 0/1 0/0
SNP 1:1100 T/G … A/B/C phred-‐scaled
0,10,100 20,0,200 0,100,255
probability of hom (A),
het (B), hom-‐alt (C)
… … … … … … genotypes given NGS
data

SNP 0/1 0/1 1/1

X:1234 G/T …
10,0,100 20,0,200 255,100,0
Identifying mutations in a genome is a
simple “find the differences” problem
Unfortunately, real data
doesn’t look that simple
Variant calling is a large-scale
bayesian modeling problem
prior Likelihood

Diploid

Sample-associated reads Genotype likelihoods

Allele frequency
Individual 1

Individual 2 SNPs
Joint estimate and
Indels

Individual N
Genotype frequencies

DePristo et al. Nature Genetics (2011)

Understanding the
Haplotype Caller
h

R
]]

r
H

1. Active region traversal 3. Pair-‐Hmm evaluation of

identifies the regions that need all reads against all
to be reassembled haplotypes
(scales exponentially)

2. Local de-‐novo assembly

4. Genotyping
builds the most likely
using the exact model
haplotypes for evaluation

7.6 cpu/days per genome

Pair-HMM is the biggest culprit for the
low performance of the Haplotype Caller

Stage Time Runtime %

Assembly 2,598s 13%

Pair-HMM 14,225s 70%

Traversal +
3,379s 17%
Genotyping

times are for chromosome 20 on a single core

Understanding the Pair-HMM

]] r

H
Data dependencies of each cell in
each of the three matrices (states)

M I D
Heterogeneous compute speeds
up variant calling significantly
Technology Hardware Runtime! Improvement
- Java (gatk 2.8) 10,800 -
- C++ (baseline) 1,267 9x
FPGA Convey Computers HC2 834 13x
AVX Intel Xeon 1-core 309 35x
GPU NVidia GeForce GTX 670 288 38x
GPU NVidia GeForce GTX 680 274 40x
GPU NVidia GeForce GTX 480 190 56x
GPU NVidia GeForce GTX Titan 80 135x
GPU NVidia Tesla K40 70 154x
The rest of the pipeline is
also not scaling well

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

It takes 2 days to process a
single genome!
step threads time
BWA 24 7
samtools view 1 2
sort + index 1 3
MarkDuplicates 1 11
RealignTargets 24 1
IndelRealigner 24 6.5
BaseRecalibrator 24 1.3
PrintReads + index 24 12.3
Total 44
Processing is a big cost on
whole genome sequencing
100

20
And it is never I/O bound
The GATK java codebase
has severe limitations
• More than 70% of the instructions in the current GATK pipeline are
memory access — the processor is just waiting.

• Excessive use of strings, maps and sets to handle basic data

structures that are frequently used in the codebase.

• Java makes it extremely difficult to explore memory contiguity in its

data structures.

• Java floating point model is incompatible with modern x86

hardware.

• Java does not offer access to the hardware for optimizations even
when desired. As a result, we are forced to underutilize modern
hardware.
A typical GATK-Java Data Structure:
A Map-of-Maps-of-Maps

Map<String,PerReadAlleleLikelihoodMap> map;

public class PerReadAlleleLikelihoodMap {

protected Map<GATKSAMRecord,
Map<Allele, Double>> likelihoodReadMap
= new LinkedHashMap<>();
...

No data locality – most lookups will consist of a series of cache misses

To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#

Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)

How we are using C++ to

address these issues
Gamgee memory model
Sam
shared_ptr

shared
raw data

shared_ptr shared_ptr

Bases Cigar
shared_ptr

Quals

shared raw
s
e

e
gs

gs
…
se
s

ga
al
m

at
po

data
fla

qu
na

ta
ba

m
ci
in-memory representation is the same as on-disk binary representation
Gamgee memory model
Variant

Genotypes sample Alleles

raw data

site
raw data

IndividualFields Filters

SharedFields

site
s

N
1

5
…
le

fo
te

raw data
le

in
fil
al
no

sample

fN
…
f1

f6
ge

raw data

in-memory representation is the same as on-disk binary representation

VariantBuilder is optimized to preserve data locality and avoid dynamic
allocation as much as possible when building records
The rare field values that don't fit
are separately allocated

std::vector<VariantBuilderDataField>

Small, inline, fixed-‐size buffers accommodate

typical field values, avoiding per-‐field dynamic
allocations and promoting data locality

• Same idea as Short String Optimization (SSO) in std::string

• Almost impossible to achieve in Java
Time to create 3,000,000 variant records in VariantBuilder,
with and without data locality optimizations

} 2x
Reading BAM files is 17x
faster in gamgee
400

300

2gb
200

100

●
runtime in seconds

2mb
10

0
●

20000

56gb (wex)
15000

10000

5000

0
foghorn
gatk (c++) gatkgatk
(java)
Reading variant files is
much faster in gamgee

2GB (1KG) GATK C++ GATK Java

Text Variant File

32.71s 137.57s
(VCF)

Binary Variant File

4.61s 242.33s
(BCF)

the new memory model makes the

binary version of the file extremely
fast to read and write
MarkDuplicates is 5x faster

GATK C++ new Picard (java) old Picard (java)

Exome 4m 20m 2h23m

Genome 1h15m 4h47m 11h06m

exact same implementation in

Java after our C++ version was
presented
To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#

Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)

C++11/14
AAA makes it easy to
change interfaces
Gamgee library public API code:
// first implementation quick and dirty!
vector<vector<int32_t>> integer_individual_field(const string& tag) const;!
vector<Genotype> genotypes() const;!
!
// after refactor -- avoid unnecessary copies of shared data!
IndividualField<IndividualFieldValue<int32_t>> integer_individual_field(const string& tag) const;!
IndividualField<Genotype> genotypes() const;

Client code written before API change never had to change:

// count variants, skip low quality genotypes!
for (const auto& record : svr) {!
const auto quals = record.integer_individual_field("GQ");!
Diligent use of auto has
const auto genotypes = record.genotypes();! already saved us from
for (auto i = 0u; i != record.n_samples(); ++i)!
if (!missing(quals[i][0]) && quals[i][0] >= m_min_qual && !
modifying client code as
(genotypes[i].het() || genotypes[i].hom_var())) ! the library changes
{! underneath them.
nvar[i]++;!
}! — Thank’s Herb!
}
Smart pointers make interfacing
with C libraries manageable
class Sam {!
private:!
std::shared_ptr<bam1_t> m_body; !
!
public:!
Cigar cigar() const { return Cigar{m_body}; }!
ReadBases bases() const { return ReadBases{m_body}; }!
BaseQuals base_quals() const { return BaseQuals{m_body}; }!
};

Sharing the pointers allocated in the C-library across different objects is taken care of by
the shared_ptr
Writing tools to perform operations
on variants is very simple

percent missing.cpp
#include "gamgee/gamgee.h"!
#include <iostream>!
!
void main() {!
for (const auto& record : SingleVariantReader{“file.bcf”}) {
const auto g_quals = record.integer_individual_field("GQ"); !
const auto n_bad_gs = count_if(g_quals.begin(), g_quals.end(), !
[&](const auto& x) { return missing(x[0]) ? true : x[0] < m_min_qual; });!
const auto percent_miss = double(n_bad_gs) / g_quals.size() * 100;!
cout << percent_miss << endl;!
}!
}

see http://broadinstitute.github.io/gamgee/doxygen/ for the full VARIANT API

Writing tools to perform operations
on read data is very simple
insert_size_distribution.cpp

#include "gamgee/gamgee.h"!
#include <iostream>!
!
constexpr auto EXPECTED_MAX_INSERT_SIZE = 5’000u;!
!
void main() {!
for (const auto& record : SingleSamReader{“input.bam”}) {!
auto abq = 0.0;!
const auto bqs = record.base_quals();!
accumulate(bqs.begin(), bqs.end(), [&abq](const auto q) {abq += q;}!
cout << abq / bqs.size() << endl;!
}!
}

see http://broadinstitute.github.io/gamgee/doxygen/ for the full SAM API

select_if enables functional
style programming across samples
variant.h
template <class VALUE, template<class> class ITER>
static boost::dynamic_bitset<> select_if(
const ITER<VALUE>& first, !
const ITER<VALUE>& last, !
const std::function<bool (const decltype(*first)& value)> pred) !
{!
const auto n_samples = last - first; !
auto selected_samples = boost::dynamic_bitset<>(n_samples);!
auto it = first;!
for (auto i = 0; i != n_samples; ++i) !
selected_samples[i] = pred(*it++);!
return selected_samples;!
}

applies a predicate over a Container and selects those that pass in a dynamic bitset
select_if statements make it trivial to
parallelize batch operations over samples

indel_length.cpp
auto select_high_quality_variants(const Variant& var, const int32_t q) {!
const auto quals = var.integer_individual_field("GQ");!
const auto genotypes = var.genotypes();!
!
const auto pass_qual = select_if(quals.begin(), quals.end(), !
[&q](const auto& gq) { return gq[0] > q; }); !
!
const auto is_var = select_if(genotypes.begin(), genotypes.end(), !
[](const auto& g) { return !g.missing() && !g.hom_ref(); }); !
!
return pass_qual & is_var;!
}

multiple select_if operations can be easily parallelized with std::async

A lambda configurable class
for locus level operations
locus_coverage.h
class LocusCoverage {!
public:!
LocusCoverage(!
(1) const std::function<uint32_t (!
const std::vector<uint32_t>& locus_coverage, !
const uint32_t chr, !
const uint32_t start, !
const uint32_t stop ) >& window_op,!
!
(2) const std::function<uint32_t (const uint32_t)>& locus_op = !
[](const auto){return 1;} !
);!
!
void add_read(const Sam& read);!
void flush() const;!
...!
};
Coverage distribution tool: functional style
coverage_distribution.cpp
using Histogram = std::vector<uint32_t>;!
constexpr auto MAX_COV = 50’000u;!
!
void main() {!
auto hist = Histogram(MAX_COV,0u);!
!
auto window_op = [&hist](const auto& lcov, const auto, !
const auto start, const auto stop) !
{!
std::for_each(lcov.begin() + start, !
lcov.begin() + stop + 1, !
[&hist](const auto& coverage) !
{!
++hist[min(coverage,MAX_COV-1)]; !
}!
);!
return stop;!
};!
!
auto reader = SingleSamReader{“file.bam”};!
auto state = LocusCoverage{window_op};!
!
for_each(reader.begin(), reader.end(), !
[&state](const auto& read) { if (!read.unmapped()) state.add_read(read); });!
!
output_coverage_histogram(hist);!
}
The future of the GATK
MIT License

gamgee
frameworks
Libraries

+
GATK tool developer GATK tool developer
framework framework
c++ java
Toolkits

GATK License

c++ java
+
Research tools need this scalability for
the next wave of scientific advances

Data Processing from DNA to Variants! Variant analysis and association studies
ready for ~1 million genomes !
(will need more work to reach tens-hundreds of millions) fails today at just a few thousand genomes
Post-‐calling pipeline standardization
and scaling is the next big challenge
• Tools are not generalized and
performance does not scale.
(typically written in matlab, R, PERL
and Python…)

• Most code is written by one grad

student/postdoc and is no longer
maintained.

• Not standardized.

• Analyses are very often unrepeatable.

• Complementary data types are not

standardized (e.g. phenotypic data).
This is the work of many…
the team Broad colleagues
Eric Banks
Ryan Poplin
Khalid Shakir
David Roazen!
Joel Thibault
Geraldine VanDerAuwera
Ami Levy-Moonshine
Valentin Rubio
Bertrand Haas
Laura Gauthier Heng Li!
Christopher Wheelan Daniel MacArthur
Sheila Chandran Timothy Fennel
Steven McCarrol
Mark Daly
Sheila Fisher
collaborators Stacey Gabriel
David Altshuler
Menachem Fromer
Paolo Narvaez
Diego Nehab

Ford
No ratings yet
Ford
3 pages
XN Series: Case Interpretation
100% (2)
XN Series: Case Interpretation
41 pages
BioInformatics For Newbies Dantelan
No ratings yet
BioInformatics For Newbies Dantelan
46 pages
GATKwr17-01-Intro To Variant Discovery
No ratings yet
GATKwr17-01-Intro To Variant Discovery
39 pages
Titus Brown - How To Interpret Your Own Genome Using (Mostly) Python
No ratings yet
Titus Brown - How To Interpret Your Own Genome Using (Mostly) Python
42 pages
The Variant Call Format and Vcftools: Example
No ratings yet
The Variant Call Format and Vcftools: Example
1 page
Assignment I
No ratings yet
Assignment I
4 pages
2015 PAG Variant PDF
No ratings yet
2015 PAG Variant PDF
65 pages
Bioinformatics Workshops
No ratings yet
Bioinformatics Workshops
49 pages
Biogenome Euformatics Webinar 2024-09-24
No ratings yet
Biogenome Euformatics Webinar 2024-09-24
25 pages
COMP90016 2023 06 Data Sources
No ratings yet
COMP90016 2023 06 Data Sources
64 pages
VCFv4 2
No ratings yet
VCFv4 2
29 pages
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
No ratings yet
Gatk Pipeline Presentation: From Fastq Data To High Confident Variants
8 pages
COMP90016 2023 08 Variant Calling II
No ratings yet
COMP90016 2023 08 Variant Calling II
41 pages
The Variant Call Format (VCF) Version 4.2 Specification
No ratings yet
The Variant Call Format (VCF) Version 4.2 Specification
29 pages
Genomic Analysis Humans
No ratings yet
Genomic Analysis Humans
15 pages
The Variant Call Format and VCFtools
No ratings yet
The Variant Call Format and VCFtools
3 pages
NGS - From Seq2var
No ratings yet
NGS - From Seq2var
60 pages
NIHMS753481 Supplement Supplemental Data
No ratings yet
NIHMS753481 Supplement Supplemental Data
124 pages
1000 Genomes Reference
No ratings yet
1000 Genomes Reference
54 pages
Structural Variation in The Human Genome: Michael Snyder March 2, 2010
No ratings yet
Structural Variation in The Human Genome: Michael Snyder March 2, 2010
80 pages
Notes Applications of Molecular Techniques (Supplementation)
No ratings yet
Notes Applications of Molecular Techniques (Supplementation)
5 pages
Journal - Pcbi.1009123 SPECTRUM VCF
No ratings yet
Journal - Pcbi.1009123 SPECTRUM VCF
14 pages
RIP Tutorials Bioinformatics
No ratings yet
RIP Tutorials Bioinformatics
19 pages
Bioinformatics Analysis of Whole Exome Sequencing Data: Peter J. Ulintz, Weisheng Wu, and Chris M. Gates
No ratings yet
Bioinformatics Analysis of Whole Exome Sequencing Data: Peter J. Ulintz, Weisheng Wu, and Chris M. Gates
42 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
NGS ToolsFormats r1 BDG
No ratings yet
NGS ToolsFormats r1 BDG
32 pages
Human Genetic Variation
No ratings yet
Human Genetic Variation
36 pages
Rabbani 2016
No ratings yet
Rabbani 2016
28 pages
GATK TUTORIAL:: Variant Callset Evaluation & Filtering: Appendix
No ratings yet
GATK TUTORIAL:: Variant Callset Evaluation & Filtering: Appendix
16 pages
Genome
No ratings yet
Genome
26 pages
RM Review
No ratings yet
RM Review
5 pages
VariantMaster User Manual v1.01
No ratings yet
VariantMaster User Manual v1.01
12 pages
A Universal SNP and Small-Indel Variant Caller Using Deep Neural Networks
No ratings yet
A Universal SNP and Small-Indel Variant Caller Using Deep Neural Networks
6 pages
Lecture Slides Human Variant Calling
No ratings yet
Lecture Slides Human Variant Calling
55 pages
Protocol To Analyze Population Structure and Migra
No ratings yet
Protocol To Analyze Population Structure and Migra
21 pages
Dna PDF
No ratings yet
Dna PDF
4 pages
123VCF - An Intuitive and Efficient Tool For Filtering VCF Files
No ratings yet
123VCF - An Intuitive and Efficient Tool For Filtering VCF Files
7 pages
3 RNAseq-Mapping LO
No ratings yet
3 RNAseq-Mapping LO
98 pages
09.05.23 - Sequencing Technology and Development - Canvas
No ratings yet
09.05.23 - Sequencing Technology and Development - Canvas
31 pages
2023 12 20 23300308v3 Full
No ratings yet
2023 12 20 23300308v3 Full
28 pages
Lecture1-4 525 W16 Large
No ratings yet
Lecture1-4 525 W16 Large
80 pages
2023 Genetic Testing
No ratings yet
2023 Genetic Testing
36 pages
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
From Everand
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
Prashanth Harish Southekal
No ratings yet
Titanic: Mohit Kothari Roger Tanuatmadja Gautam Akiwate
No ratings yet
Titanic: Mohit Kothari Roger Tanuatmadja Gautam Akiwate
18 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
On the process of maintenance decision-making in the offshore operational environment of the oil and gas industry: a systems approach
From Everand
On the process of maintenance decision-making in the offshore operational environment of the oil and gas industry: a systems approach
Mario Marcondes Machado
No ratings yet
A Practical Guide To Filtering and Prioritizing Genetic Variants
No ratings yet
A Practical Guide To Filtering and Prioritizing Genetic Variants
12 pages
Article
No ratings yet
Article
8 pages
Balamurugan
No ratings yet
Balamurugan
17 pages
GATKwr17-09-Somatic SNVs and Indels
No ratings yet
GATKwr17-09-Somatic SNVs and Indels
23 pages
Towards Personal Genomics: Tools For Navigating The Genome of An Individual
No ratings yet
Towards Personal Genomics: Tools For Navigating The Genome of An Individual
23 pages
Lab03 - Lab Manual
No ratings yet
Lab03 - Lab Manual
16 pages
Sequencing Analysis Tools2
No ratings yet
Sequencing Analysis Tools2
6 pages
Towards best practice in the Archetype Development Process
From Everand
Towards best practice in the Archetype Development Process
Alberto Moreno Conde
No ratings yet
Practical2 2025
No ratings yet
Practical2 2025
6 pages
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Latest Updates in Biotech Sector - Biotechnology
No ratings yet
Latest Updates in Biotech Sector - Biotechnology
9 pages
GWAStutorial 23feb
No ratings yet
GWAStutorial 23feb
3 pages
Drineas CMU 2012
No ratings yet
Drineas CMU 2012
59 pages
Stepwise ABC System For Classi Cation of Any Type of Genetic Variant
No ratings yet
Stepwise ABC System For Classi Cation of Any Type of Genetic Variant
10 pages
Algorithms in Computational Biology and Genome PPT2
No ratings yet
Algorithms in Computational Biology and Genome PPT2
9 pages
STL Algorithms in Action - Michael VanLoon - CppCon 2015
No ratings yet
STL Algorithms in Action - Michael VanLoon - CppCon 2015
99 pages
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
No ratings yet
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
44 pages
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
No ratings yet
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
118 pages
QT - Modern User Interfaces For C++ - Milian Wolff - CppCon 2015
No ratings yet
QT - Modern User Interfaces For C++ - Milian Wolff - CppCon 2015
43 pages
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
No ratings yet
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
20 pages
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
No ratings yet
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
137 pages
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
No ratings yet
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
64 pages
Functional Design Explained - David Sankel - CppCon 2015
No ratings yet
Functional Design Explained - David Sankel - CppCon 2015
43 pages
Reactive Stream Processing Rx4DDS - Sumant Tambe - CppCon 2015
No ratings yet
Reactive Stream Processing Rx4DDS - Sumant Tambe - CppCon 2015
51 pages
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
No ratings yet
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
35 pages
Functional Programming - Functors and Monads - Michał Dominiak - CppCon 2015
100% (1)
Functional Programming - Functors and Monads - Michał Dominiak - CppCon 2015
19 pages
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
No ratings yet
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
69 pages
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
No ratings yet
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
241 pages
C++ On The Web - JF Bastien - CppCon 2015
No ratings yet
C++ On The Web - JF Bastien - CppCon 2015
24 pages
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
No ratings yet
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
79 pages
C++ Metaprogramming - Fedor Pikus - CppCon 2015
100% (1)
C++ Metaprogramming - Fedor Pikus - CppCon 2015
76 pages
Introducing Brigand - Edouard Alligand and Joel Falcou - CppCon 2015
No ratings yet
Introducing Brigand - Edouard Alligand and Joel Falcou - CppCon 2015
9 pages
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
No ratings yet
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
43 pages
Using C++ To Connect To Web Services - Steve Gates - CppCon 2014
No ratings yet
Using C++ To Connect To Web Services - Steve Gates - CppCon 2014
40 pages
C++ in The Telecom Industry - Yani Miguel - CppCon 2015
No ratings yet
C++ in The Telecom Industry - Yani Miguel - CppCon 2015
13 pages
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
No ratings yet
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
283 pages
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
No ratings yet
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
131 pages
Where Did My Performance Go - Fedor Pikus - CppCon 2014
No ratings yet
Where Did My Performance Go - Fedor Pikus - CppCon 2014
66 pages
Being Smart About Pointers - Michael VanLoon - CppCon 2015
No ratings yet
Being Smart About Pointers - Michael VanLoon - CppCon 2015
47 pages
The Canonical Class - Michael Caisse - CppCon 2014
No ratings yet
The Canonical Class - Michael Caisse - CppCon 2014
138 pages
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
No ratings yet
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
71 pages
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
No ratings yet
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
81 pages
Types Don't Know # - Howard Hinnant - CppCon 2014
No ratings yet
Types Don't Know # - Howard Hinnant - CppCon 2014
95 pages
STL Features and Implementation Techniques - Stephan T. Lavavej - CppCon 2014
No ratings yet
STL Features and Implementation Techniques - Stephan T. Lavavej - CppCon 2014
47 pages
Rebuilding Boost Date-Time For C++11 - Jeff Garland - CppCon 2014
No ratings yet
Rebuilding Boost Date-Time For C++11 - Jeff Garland - CppCon 2014
56 pages
Put Your Defenses To The Test: Metasploit
No ratings yet
Put Your Defenses To The Test: Metasploit
2 pages
M8a96.k1.5n.l1 1049744 20271115
No ratings yet
M8a96.k1.5n.l1 1049744 20271115
3 pages
Hurricane Pumps - 4SD2 - 14 Single Phase Borehole Pump 220V 0 PDF
No ratings yet
Hurricane Pumps - 4SD2 - 14 Single Phase Borehole Pump 220V 0 PDF
2 pages
Ficha Kbl12260 Kaise Dmu Energy Chile
No ratings yet
Ficha Kbl12260 Kaise Dmu Energy Chile
2 pages
Coke Wave
No ratings yet
Coke Wave
3 pages
Assignment On HOPE Exhibition
No ratings yet
Assignment On HOPE Exhibition
3 pages
Athletics Year 3 4 PDF
No ratings yet
Athletics Year 3 4 PDF
22 pages
RDL Layout Sample
No ratings yet
RDL Layout Sample
136 pages
Adv Funct Materials - 2025 - Ding - Bulk Passivation of Perovskite Films With Phthalocyanine Derivative For Enhanced
No ratings yet
Adv Funct Materials - 2025 - Ding - Bulk Passivation of Perovskite Films With Phthalocyanine Derivative For Enhanced
10 pages
CHAPTER 1 2a
No ratings yet
CHAPTER 1 2a
10 pages
Dating Format
No ratings yet
Dating Format
39 pages
Unit Weight Test
No ratings yet
Unit Weight Test
4 pages
Journal of Pharmaceutical and Biomedical Analysis
No ratings yet
Journal of Pharmaceutical and Biomedical Analysis
7 pages
Toyota Sas Broshure
100% (1)
Toyota Sas Broshure
8 pages
Pharmacology - The Drug Development Process
No ratings yet
Pharmacology - The Drug Development Process
3 pages
BORB
No ratings yet
BORB
26 pages
Biology Course Outline
No ratings yet
Biology Course Outline
10 pages
Sung Thu Tinh Dien - ONYX - 16
No ratings yet
Sung Thu Tinh Dien - ONYX - 16
8 pages
2011 Sesame Seeds in Poland
100% (1)
2011 Sesame Seeds in Poland
3 pages
GC7 Final Test
100% (1)
GC7 Final Test
30 pages
Operating Standards Corrugating Process Troubleshooting V 2010 1.0 FINAL
No ratings yet
Operating Standards Corrugating Process Troubleshooting V 2010 1.0 FINAL
81 pages
PCL Reconstruction
No ratings yet
PCL Reconstruction
26 pages
Schneider Electric - Acti-9-iK60 - A9K27216
No ratings yet
Schneider Electric - Acti-9-iK60 - A9K27216
3 pages
Protein in The Diet - Resource
No ratings yet
Protein in The Diet - Resource
1 page
Pastry Deco - 10 Vegan Pastry Decoration Recipes (Rommel Reyes)
100% (6)
Pastry Deco - 10 Vegan Pastry Decoration Recipes (Rommel Reyes)
37 pages
RFQ Email - 23.03.2024
No ratings yet
RFQ Email - 23.03.2024
2 pages
5 12
No ratings yet
5 12
8 pages
ID Request Form
No ratings yet
ID Request Form
1 page

Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014

Uploaded by

Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014

Uploaded by

Gamgee: A C++14

library for genomic data

Mauricio Carneiro, PhD!

• The first C++ example that steered us away from Java.

• Gamgee: the C++14 library memory model and

• Performance comparisons with the old Java framework.

• Discussion of C++11/14 features used in the library

Rare Variant vs#

Disease Many simple and complex human diseases are

Large scale Affected and unaffected individuals differ

Association These systematic differences can be identified by

Functional These associated variants give insight into the

Therapeutics These insights can be used to intervene in the

* we also own 1 Pacbio RS and 4 Ion Torrent for experimental use

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

Rare Variant vs#

The motivating example

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

0/0 0/0 1/0

SNP 0/1 0/1 1/1

Sample-associated reads Genotype likelihoods

DePristo et al. Nature Genetics (2011)

1. Active region traversal 3. Pair-­‐Hmm evaluation of

2. Local de-­‐novo assembly

7.6 cpu/days per genome

Stage Time Runtime %

Assembly 2,598s 13%

Pair-HMM 14,225s 70%

times are for chromosome 20 on a single core

Auwera, GA et al. Current Protocols in Bioinformatics (2013)

• Excessive use of strings, maps and sets to handle basic data

• Java makes it extremely difficult to explore memory contiguity in its

• Java floating point model is incompatible with modern x86

public class PerReadAlleleLikelihoodMap {

No data locality – most lookups will consist of a series of cache misses

Rare Variant vs#

How we are using C++ to

Genotypes sample Alleles

in-memory representation is the same as on-disk binary representation

Small, inline, fixed-­‐size buffers accommodate

• Same idea as Short String Optimization (SSO) in std::string

2GB (1KG) GATK C++ GATK Java

Text Variant File

Binary Variant File

the new memory model makes the

GATK C++ new Picard (java) old Picard (java)

Exome 4m 20m 2h23m

Genome 1h15m 4h47m 11h06m

exact same implementation in

Rare Variant vs#

Client code written before API change never had to change:

see http://broadinstitute.github.io/gamgee/doxygen/ for the full VARIANT API

see http://broadinstitute.github.io/gamgee/doxygen/ for the full SAM API

multiple select_if operations can be easily parallelized with std::async

• Most code is written by one grad

• Analyses are very often unrepeatable.

• Complementary data types are not

You might also like

1. Active region traversal 3. Pair-‐Hmm evaluation of

2. Local de-‐novo assembly

Small, inline, fixed-‐size buffers accommodate