GATKwr17-01-Intro To Variant Discovery
GATKwr17-01-Intro To Variant Discovery
http://software.broadinstitute.org/gatk/
Discover
“variants”
rela)ve
to
a
reference
genome
• Other
organisms
• Many
have
a
fully
assembled
reference
available
• Many
s)ll
do
not
-‐>
SOL
Different
types
of
variants
Germline
Soma%c
Whole
Genome
Variant site
Exome
(+
gene
panels)
Part
1
BEST
PRACTICES
Data
genera)on
1:
Library
prep
Varies
depending
on
experimental
design
DNA
[…]
Fragments
RT-‐PCR
RNA
Library
prep
Data
genera)on
2:
Sequence
the
library
Library preps
Bitesizebio.com
Flowcell
Enormous
pile
of
short
reads
Lanes
Illumina
Example
record
• @EAS54_6_R1_2_1_413_324
• CCCTTCTTGTCTTCAGCGTTTCTCC Error
• +
• ;;3;;;;;;;;;;;;7;;;;;;;88
Phred
value
=
−10
*
log10(ε)
-‐>
ASCII
code
translates
to
Phred-‐scale
Q
scores
ANALYSIS
WORKFLOWS
Data
pre-‐processing
Mapping
and
alignment
algorithms
Enormous
pile
of
short
reads
from
HTS
Reference
genome
Reads
mapped
to
reference
Output
format:
Sequence/Binary
Alignment
Map
(SAM/BAM)
• Added
mapping
info
summarizes
posi%on,
quality,
and
structure
for
each
read
h`p://samtools.github.io/hts-‐specs/SAMv1.pdf
CIGAR
summarizes
alignment
structure
Reference genome
BWA
MEM
Raw
mapped
BAM
Mapped,
cleaned,
sorted
BAM
Reference
Mapped
reads
Picard MarkDuplicates
AFTER
10
10
5
5
Empirical − Reported Quality
0
−5
−5
original
recalibrated
−10
−10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG
Dinuc Dinuc
Special handling for RNAseq splice junctions
Variant
Discovery
Germline
SNPs
&
Indels
Soma%c
SNVs
&
Indels
Soma%c
CNVs
Discovery
of
germline
short
variants
is
done
on
cohorts
##fileformat=VCFv4.1
##reference=1000GenomesPilot-NCBI36
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
Header
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/0:48:1 1/0:48:8 1/1:43:5
Records
20 1230237 . T . 47 PASS DP=13 GT:GQ:DP 0/0:54:7 0/0:48:4 0/0:61:2
20 1234567 . GT G 50 PASS DP=9 GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
• Analyzed
individually:
– No
call
for
either
sample
– Very
different
reasons!
SAMPLE
A
• In
joint
analysis
with
other
samples:
– Hom-‐ref
call
and
no-‐call
genotypes
emi`ed
SAMPLE
B
Tradi)onal
mul%-‐sample
calling
approach
:
very
inefficient
Compute
requirements
scale
very
badly
with
number
of
samples!!!
Want
to
add
new
samples?
Got
to
re-‐run
pipeline
from
scratch!
The
N+1
problem!
GVCF
workflow
solves
both
problems,
yields
same
results
Compute
requirements
scale
linearly
with
number
of
samples
Want
to
add
a
new
sample?
Just
call
it
by
itself
then
re-‐genotype
the
cohort
at
will!
Best
Prac)ces
workflows
by
variant
type
Germline
SNPs
&
Indels
Soma%c
SNVs
&
Indels
Soma%c
CNVs
Key
challenges
:
tumor
heterogeneity
and
contamina)on
normal
dysplas%c
*
cancer
**"
**
*
*
Tumor
Sample
Tissue-‐adjacent
Normal
Blood
Normal
signal
signal
heterogeneity
contamina%on
noise noise
+
AF
expected
to
follow
ploidy
+
no
reliance
on
ploidy
for
AF
Best
Prac)ces
workflows
by
variant
type
Germline
SNPs
&
Indels
Soma%c
SNVs
&
Indels
Soma%c
CNVs
Copy
number:
it’s
all
about
coverage
PROJECT
DEPENDENT
h`ps://sowware.broadins)tute.org/gatk/best-‐prac)ces/
Part
3
Implementa)on
BAMs
Germline
SNPs
&
Indels
“FireHose”
Pipelines
implemented
on
Google
Cloud
BAMs
WGS
Germline
SNPs
&
Indels
hkps://so^ware.broadins%tute.org/firecloud/
GOTC
WDL
script
shared
at
h`ps://github.com/broadins)tute/wdl/tree/develop/scripts/broad_pipelines
Write
your
own
pipelines
in
WDL!
Tired of these op=ons for wri=ng pipelines? Meet the Workflow Descrip=on Language