0% found this document useful (0 votes)
20 views50 pages

Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views50 pages

Gamgee - A C++14 Library For Genomics Data Processing and Analysis - Mauricio Carneiro - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Gamgee: A C++14

library for genomic data


processing and analysis

Mauricio Carneiro, PhD!


Group Lead, Computational Technology Development
Broad Institute
Talk breakdown
• An overview of genetics data and how complex
disease research became a big data problem

• The first C++ example that steered us away from Java.

• Gamgee: the C++14 library memory model and


examples

• Performance comparisons with the old Java framework.

• Discussion of C++11/14 features used in the library


and how they affected development
To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#


Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)
Improving human health in 5 easy steps

Disease Many simple and complex human diseases are


genetics heritable. Pick one.

Large scale Affected and unaffected individuals differ


sequencing systematically in their genetic composition

Association These systematic differences can be identified by


studies comparing affected and unaffected individuals

Functional These associated variants give insight into the


studies biological mechanisms of disease

Therapeutics These insights can be used to intervene in the


and drugs disease process itself
The%Importance%of%Scale…Early%Success%Stories%(at%1,000s%of%exomes)%

Type%2%Diabetes% Schizophrenia%
• 5,000%exomes%%
• Pathways%%
• 13,000%exomes%%
• Ac:vity2regulated%cytoskeletal%
• SLC30A8%
(ARC)%of%post2synap:c%density%
%%(Beta2cell2specific%Zn++%transporter)%
complex%(PSD)%
• 32fold%protec:on%against%T2D!%
• 1"LoF""per"1500"people" • Voltage2gated%Ca++%Channel%
• 13221%%risk%in%carriers%
%
• Collec+on"of"rare"disrup+ve"muta+ons"
(~1/10,000"carrier"frequency)%
%
Coronary%Heart%Disease% Early%Heart%A9ack%
• 5,000%exomes%%
• 3,700%exomes%% • APOA5%
• APOC3% • 22%%risk%in%carriers%
• 2.52fold%protec:on%from%CHD% • 0.5%"Rare"disrup+ve"/"deleterious"alleles"
• 4"rare"disrup+ve"muta+ons"(~1"in"200" %
carrier"frequency)"

%
Broad Institute in 2013
50! 10!
HiSeqs MiSeqs

2! 14!
NextSeqs HiSeq X

6.5! 427!
Pb of data projects

180! 2.1!
people Tb/day

* we also own 1 Pacbio RS and 4 Ion Torrent for experimental use


Broad Institute in 2013
44,130! 2,484!
exomes exome express

2,247! 2,247!
genomes assemblies

8,189! 9,788!
RNA 16S

47,764! 228!
arrays cell lines
Terabases of Data Produced by Year
2100
projected 300
2,064
Petabytes

1575
Terabases

1050

660
525

362.4
302.8
22.8 153.8
0
2009 2010 2011 2012 2013 2014
…and these numbers will continue
to grow faster than Moore’s law
GATK  is  both  a  toolkit  and  a  programming  framework,  
enabling  NGS  analysis  by  scientists  worldwide
Toolkit  &  framework  packages  

Toolkit

Best  practices  
for  variant  
discovery

Framework
MuTect,  XHMM,  GenomeSTRiP,  ...
Tools  developed  on  top  of  the  GATK  framework  by  other  groups

Extensive  online  documentation  &  user  support  forum  serving  >10K  users  worldwide

http://www.broadinstitute.org/gatk
Workshop  series  educates  local  and  worldwide  audiences

Past:   Format    
• Dec  4-­‐5  2012,              Boston  
• July  9-­‐10  2013,          Boston  
•  Lecture  series  (general  audience)    
• July  22-­‐23  2013,      Israel   •  Hands-­‐on  sessions  (for  beginners)    
• Oct  21-­‐22  2013,      Boston   !
• March  3-­‐5  2014,    Thailand  
• June  6-­‐9  2014,            Belgium  
Portfolio  of  workshop  modules  
Upcoming:   •  GATK  Best  Practices  for  Variant  Calling  
• Sep  17-­‐18  2014,      Philadelphia   •  Building  Analysis  Pipelines  with  Queue  
• Oct  18-­‐29  2014,      San  Diego
•  Third-­‐party  Tools:    
o  GenomeSTRiP     •  High  levels  of  satisfaction  
reported  by  users  in  polls  
o  XHMM
•  Detailed  feedback  helps  
improve  further  iterations
Tutorial  materials,  slide  decks  and  
videos  all  available  online  
through  the  GATK  website,  YouTube  
and  iTunesU
We have defined the best practices
for sequencing data processing

Auwera, GA et al. Current Protocols in Bioinformatics (2013)


To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#


Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)

The motivating example


Joint genotyping is an important
step in Variant Discovery

Auwera, GA et al. Current Protocols in Bioinformatics (2013)


The ideal database for RVAS and CVAS
studies is a complete mutation matrix
All  case  and  control  samples

Genotypes:  
~3M  variants Site Variant Sample 1 Sample 2 … Sample N 0/0  ref  
0/1  het  
1/1  hom-­‐alt
0/0 0/1 0/0
SNP 1:1000 A/C …
0,10,100 20,0,200 0,100,255

0/0 0/0 1/0


Indel 1:1050 T/TC …
0,10,100 0,20,200 255,0,255
Likelihoods:  
0/0 0/1 0/0
SNP 1:1100 T/G … A/B/C  phred-­‐scaled  
0,10,100 20,0,200 0,100,255
probability    of  hom  (A),  
het  (B),  hom-­‐alt  (C)  
… … … … … … genotypes  given  NGS  
data

SNP 0/1 0/1 1/1


X:1234 G/T …
10,0,100 20,0,200 255,100,0
Identifying mutations in a genome is a
simple “find the differences” problem
Unfortunately, real data
doesn’t look that simple
Variant calling is a large-scale
bayesian modeling problem
prior Likelihood

Diploid

Sample-associated reads Genotype likelihoods


Allele frequency
Individual 1

Individual 2 SNPs
Joint estimate and
Indels

Individual N
Genotype frequencies

DePristo et al. Nature Genetics (2011)


Understanding the
Haplotype Caller
h

R
]]

r
H

1.  Active  region  traversal   3.  Pair-­‐Hmm  evaluation  of  


identifies  the  regions  that  need   all  reads  against  all  
to  be  reassembled haplotypes  
(scales  exponentially)

2.  Local  de-­‐novo  assembly  


4.  Genotyping  
builds  the  most  likely  
using  the  exact  model
haplotypes  for  evaluation

7.6 cpu/days per genome


Pair-HMM is the biggest culprit for the
low performance of the Haplotype Caller

Stage Time Runtime %

Assembly 2,598s 13%

Pair-HMM 14,225s 70%

Traversal +
3,379s 17%
Genotyping

times are for chromosome 20 on a single core


Understanding the Pair-HMM

]] r

H
Data dependencies of each cell in
each of the three matrices (states)

M I D
Heterogeneous compute speeds
up variant calling significantly
Technology Hardware Runtime! Improvement
- Java (gatk 2.8) 10,800 -
- C++ (baseline) 1,267 9x
FPGA Convey Computers HC2 834 13x
AVX Intel Xeon 1-core 309 35x
GPU NVidia GeForce GTX 670 288 38x
GPU NVidia GeForce GTX 680 274 40x
GPU NVidia GeForce GTX 480 190 56x
GPU NVidia GeForce GTX Titan 80 135x
GPU NVidia Tesla K40 70 154x
The rest of the pipeline is
also not scaling well

Auwera, GA et al. Current Protocols in Bioinformatics (2013)


It takes 2 days to process a
single genome!
step threads time
BWA 24 7
samtools view 1 2
sort + index 1 3
MarkDuplicates 1 11
RealignTargets 24 1
IndelRealigner 24 6.5
BaseRecalibrator 24 1.3
PrintReads + index 24 12.3
Total 44
Processing is a big cost on
whole genome sequencing
100

80

60

40

20
And it is never I/O bound
The GATK java codebase
has severe limitations
• More than 70% of the instructions in the current GATK pipeline are
memory access — the processor is just waiting.

• Excessive use of strings, maps and sets to handle basic data


structures that are frequently used in the codebase.

• Java makes it extremely difficult to explore memory contiguity in its


data structures.

• Java floating point model is incompatible with modern x86


hardware.

• Java does not offer access to the hardware for optimizations even
when desired. As a result, we are forced to underutilize modern
hardware.
A typical GATK-Java Data Structure:
A Map-of-Maps-of-Maps

Map<String,PerReadAlleleLikelihoodMap> map;

public class PerReadAlleleLikelihoodMap {


protected Map<GATKSAMRecord,
Map<Allele, Double>> likelihoodReadMap
= new LinkedHashMap<>();
...

No data locality – most lookups will consist of a series of cache misses


To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#


Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)

How we are using C++ to


address these issues
Gamgee memory model
Sam
shared_ptr

shared
raw data

shared_ptr shared_ptr

Bases Cigar
shared_ptr

Quals

shared raw
s
e

e
gs

gs

se
s

ga
al
m

at
po

data
fla

qu
na

ta
ba

m
ci
in-memory representation is the same as on-disk binary representation
Gamgee memory model
Variant

Genotypes sample Alleles


raw data

site
raw data

IndividualFields Filters

SharedFields

site
s

rs

N
1

5

le

fo

fo

fo

fo

fo

fo
te

raw data
le

in

in

in

in

in

in
fil
al
no

sample

fN

f1

f2

f3

f4

f5

f6
ge

raw data

in-memory representation is the same as on-disk binary representation


VariantBuilder  is  optimized  to  preserve  data  locality  and  avoid  dynamic  
allocation  as  much  as  possible  when  building  records
The  rare  field  values  that  don't  fit  
are  separately  allocated

std::vector<VariantBuilderDataField>

Small,  inline,  fixed-­‐size  buffers  accommodate  


typical  field  values,  avoiding  per-­‐field  dynamic  
allocations  and  promoting  data  locality

• Same  idea  as  Short  String  Optimization  (SSO)  in  std::string  


• Almost  impossible  to  achieve  in  Java
Time  to  create  3,000,000  variant  records  in  VariantBuilder,  
with  and  without  data  locality  optimizations

} 2x
Reading BAM files is 17x
faster in gamgee
400

300

2gb
200

100


runtime in seconds

20

2mb
10

0

20000

56gb (wex)
15000

10000

5000

0
foghorn
gatk (c++) gatkgatk
(java)
Reading variant files is
much faster in gamgee

2GB (1KG) GATK C++ GATK Java

Text Variant File


32.71s 137.57s
(VCF)

Binary Variant File


4.61s 242.33s
(BCF)

the new memory model makes the


binary version of the file extremely
fast to read and write
MarkDuplicates is 5x faster

GATK C++ new Picard (java) old Picard (java)

Exome 4m 20m 2h23m

Genome 1h15m 4h47m 11h06m

exact same implementation in


Java after our C++ version was
presented
To fully understand one genome we need
hundreds of thousands of genomes

Rare Variant vs#


Association Study
(RVAS)

Common Variant
Association Study vs#
(CVAS)

C++11/14
AAA makes it easy to
change interfaces
Gamgee library public API code:
// first implementation quick and dirty!
vector<vector<int32_t>> integer_individual_field(const string& tag) const;!
vector<Genotype> genotypes() const;!
!
// after refactor -- avoid unnecessary copies of shared data!
IndividualField<IndividualFieldValue<int32_t>> integer_individual_field(const string& tag) const;!
IndividualField<Genotype> genotypes() const;

Client code written before API change never had to change:


// count variants, skip low quality genotypes!
for (const auto& record : svr) {!
const auto quals = record.integer_individual_field("GQ");!
Diligent use of auto has
const auto genotypes = record.genotypes();! already saved us from
for (auto i = 0u; i != record.n_samples(); ++i)!
if (!missing(quals[i][0]) && quals[i][0] >= m_min_qual && !
modifying client code as
(genotypes[i].het() || genotypes[i].hom_var())) ! the library changes
{! underneath them.
nvar[i]++;!
}! — Thank’s Herb!
}
Smart pointers make interfacing
with C libraries manageable
class Sam {!
private:!
std::shared_ptr<bam1_t> m_body; !
!
public:!
Cigar cigar() const { return Cigar{m_body}; }!
ReadBases bases() const { return ReadBases{m_body}; }!
BaseQuals base_quals() const { return BaseQuals{m_body}; }!
};

Sharing the pointers allocated in the C-library across different objects is taken care of by
the shared_ptr
Writing tools to perform operations
on variants is very simple

percent missing.cpp
#include "gamgee/gamgee.h"!
#include <iostream>!
!
void main() {!
for (const auto& record : SingleVariantReader{“file.bcf”}) {
const auto g_quals = record.integer_individual_field("GQ"); !
const auto n_bad_gs = count_if(g_quals.begin(), g_quals.end(), !
[&](const auto& x) { return missing(x[0]) ? true : x[0] < m_min_qual; });!
const auto percent_miss = double(n_bad_gs) / g_quals.size() * 100;!
cout << percent_miss << endl;!
}!
}

see http://broadinstitute.github.io/gamgee/doxygen/ for the full VARIANT API


Writing tools to perform operations
on read data is very simple
insert_size_distribution.cpp

#include "gamgee/gamgee.h"!
#include <iostream>!
!
constexpr auto EXPECTED_MAX_INSERT_SIZE = 5’000u;!
!
void main() {!
for (const auto& record : SingleSamReader{“input.bam”}) {!
auto abq = 0.0;!
const auto bqs = record.base_quals();!
accumulate(bqs.begin(), bqs.end(), [&abq](const auto q) {abq += q;}!
cout << abq / bqs.size() << endl;!
}!
}

see http://broadinstitute.github.io/gamgee/doxygen/ for the full SAM API


select_if enables functional
style programming across samples
variant.h
template <class VALUE, template<class> class ITER>
static boost::dynamic_bitset<> select_if(
const ITER<VALUE>& first, !
const ITER<VALUE>& last, !
const std::function<bool (const decltype(*first)& value)> pred) !
{!
const auto n_samples = last - first; !
auto selected_samples = boost::dynamic_bitset<>(n_samples);!
auto it = first;!
for (auto i = 0; i != n_samples; ++i) !
selected_samples[i] = pred(*it++);!
return selected_samples;!
}

applies a predicate over a Container and selects those that pass in a dynamic bitset
select_if statements make it trivial to
parallelize batch operations over samples

indel_length.cpp
auto select_high_quality_variants(const Variant& var, const int32_t q) {!
const auto quals = var.integer_individual_field("GQ");!
const auto genotypes = var.genotypes();!
!
const auto pass_qual = select_if(quals.begin(), quals.end(), !
[&q](const auto& gq) { return gq[0] > q; }); !
!
const auto is_var = select_if(genotypes.begin(), genotypes.end(), !
[](const auto& g) { return !g.missing() && !g.hom_ref(); }); !
!
return pass_qual & is_var;!
}

multiple select_if operations can be easily parallelized with std::async


A lambda configurable class
for locus level operations
locus_coverage.h
class LocusCoverage {!
public:!
LocusCoverage(!
(1) const std::function<uint32_t (!
const std::vector<uint32_t>& locus_coverage, !
const uint32_t chr, !
const uint32_t start, !
const uint32_t stop ) >& window_op,!
!
(2) const std::function<uint32_t (const uint32_t)>& locus_op = !
[](const auto){return 1;} !
);!
!
void add_read(const Sam& read);!
void flush() const;!
...!
};
Coverage distribution tool: functional style
coverage_distribution.cpp
using Histogram = std::vector<uint32_t>;!
constexpr auto MAX_COV = 50’000u;!
!
void main() {!
auto hist = Histogram(MAX_COV,0u);!
!
auto window_op = [&hist](const auto& lcov, const auto, !
const auto start, const auto stop) !
{!
std::for_each(lcov.begin() + start, !
lcov.begin() + stop + 1, !
[&hist](const auto& coverage) !
{!
++hist[min(coverage,MAX_COV-1)]; !
}!
);!
return stop;!
};!
!
auto reader = SingleSamReader{“file.bam”};!
auto state = LocusCoverage{window_op};!
!
for_each(reader.begin(), reader.end(), !
[&state](const auto& read) { if (!read.unmapped()) state.add_read(read); });!
!
output_coverage_histogram(hist);!
}
The future of the GATK
MIT License

gamgee
frameworks
Libraries

+
GATK tool developer GATK tool developer
framework framework
c++ java
Toolkits

GATK License

c++ java
+
Research tools need this scalability for
the next wave of scientific advances

Data Processing from DNA to Variants! Variant analysis and association studies
ready for ~1 million genomes !
(will need more work to reach tens-hundreds of millions) fails today at just a few thousand genomes
Post-­‐calling  pipeline  standardization  
and  scaling  is  the  next  big  challenge
• Tools are not generalized and
performance does not scale.
(typically written in matlab, R, PERL
and Python…)

• Most code is written by one grad


student/postdoc and is no longer
maintained.

• Not standardized.

• Analyses are very often unrepeatable.

• Complementary data types are not


standardized (e.g. phenotypic data).
This  is  the  work  of  many…
the team Broad colleagues
Eric Banks
Ryan Poplin
Khalid Shakir
David Roazen!
Joel Thibault
Geraldine VanDerAuwera
Ami Levy-Moonshine
Valentin Rubio
Bertrand Haas
Laura Gauthier Heng Li!
Christopher Wheelan Daniel MacArthur
Sheila Chandran Timothy Fennel
Steven McCarrol
Mark Daly
Sheila Fisher
collaborators Stacey Gabriel
David Altshuler
Menachem Fromer
Paolo Narvaez
Diego Nehab

You might also like