0% found this document useful (0 votes)
23 views39 pages

GATKwr17-01-Intro To Variant Discovery

The document outlines best practices for variant discovery, focusing on genetic changes relative to a reference genome, including germline and somatic variants. It details the steps for data generation, including library preparation and sequencing, followed by analysis workflows for variant discovery and refinement. The document emphasizes the importance of data pre-processing and specific workflows tailored to different types of variants.

Uploaded by

danyalhamzah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views39 pages

GATKwr17-01-Intro To Variant Discovery

The document outlines best practices for variant discovery, focusing on genetic changes relative to a reference genome, including germline and somatic variants. It details the steps for data generation, including library preparation and sequencing, followed by analysis workflows for variant discovery and refinement. The document emphasizes the importance of data pre-processing and specific workflows tailored to different types of variants.

Uploaded by

danyalhamzah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

GATK

 Best  Prac)ces  for  Variant  Discovery  

Introduc)on  to  Variant  Discovery  


All  you  need  to  know  to  get  started  

http://software.broadinstitute.org/gatk/
Discover  “variants”  rela)ve  to  a  reference  genome  

• Gene)c  changes  in  individuals  rela%ve  to  a  reference  genome  


• Germline  (inherited)  
• Soma)c  (cancer)  

• Reference  genome  =  a  standardized  genomic  sequence  

• Human  genome  reference  sequence  


• Current  standard:  hg19  /  b37  
• New  standard  (on  the  rise):  hg38  

• Other  organisms    
• Many  have  a  fully  assembled  reference  available  
• Many  s)ll  do  not  -­‐>  SOL  
Different  types  of  variants  

Germline  

Soma%c  

SNP/SNV   Indel   CNV/CNA   SV  


Different  types  of  experimental  design  

Intergenic Exon I Intron I Exon II Intergenic

Whole    
Genome    
Variant site

Exome  
(+  gene  panels)  
Part  1  

BEST  PRACTICES  
Data  genera)on  1:  Library  prep  
Varies  depending  on  experimental  design  
DNA  
[…]  
Fragments  
RT-­‐PCR  

RNA  

1) Extract  nucleic  acids  from  blood,  )ssue,  saliva  


2) RNA:  Make  cDNA  
3) Shear  dsDNA  into  fragments  
4) A`ach  fragments  to  adapters  (-­‐>  library)  

Library  prep  
Data  genera)on  2:  Sequence  the  library  

Library  preps  

Bitesizebio.com  
Flowcell  
Enormous  pile  of  
short  reads  

Lanes  

Illumina  

HTS  machine  processes  a  flowcell  containing  lanes;    


each  lane  cons)tutes  a  read  group  (RG)    
(unless  mul)plexed)  
Raw  sequence:  typically  in  FASTQ  format  

• Sequence  Name  (read  name,  group,  etc.)  


Accuracy  
• Sequence    
• +  (op)onal:  Sequence  name  again)  
20  and  higher  is    
• Associated  quality  score       generally  a  good  score  
 

Example  record  
 
• @EAS54_6_R1_2_1_413_324
• CCCTTCTTGTCTTCAGCGTTTCTCC Error  
• +
• ;;3;;;;;;;;;;;;7;;;;;;;88
Phred  value  =  −10  *  log10(ε)  
-­‐>  ASCII  code  translates  to  Phred-­‐scale  Q  scores  
 

90%  confidence  (10%  error  rate)  =  Q10  


99%  confidence  (1%  error  rate)  =  Q20  
99.9%  confidence  (.1%  error  rate)  =  Q30  
Format  specifica)on  h`p://maq.sourceforge.net/fastq.shtml  
Part  2  

ANALYSIS  WORKFLOWS  
Data  pre-­‐processing  

Data   Variant   Callset    


Pre-­‐Processing   Discovery   Refinement  

FASTQ  -­‐>  BAM   BAM  -­‐>  VCF  


Step  1:  Map  the  reads  produced  by  the  sequencer  to  the  reference    

Mapping  and  
alignment  
algorithms  

• BWA  for  DNA  


• STAR  for  RNAseq  

Enormous  
pile  of  short  
reads  from  
HTS   Reference  genome  
Reads  
mapped  to  
reference  
Output  format:  Sequence/Binary  Alignment  Map  (SAM/BAM)  

HEADER  containing  metadata  (sequence  dic)onary,  read  group  defini)ons  etc)  


RECORDS  containing  structured  read  informa)on  (1  line  per  read  record)  

read name position CIGAR read sequence metadata

SLX1:1:127:63:4 99 1 10052169 60 23M6N10M = 14 10 GAAGATACTGGTT 768832'48:::: SM:Z:JPTGBMN01 …

flags MAPQ mate quality scores


information

• Added  mapping  info  summarizes  posi%on,  quality,  and  structure  for  each  read  
 

h`p://samtools.github.io/hts-­‐specs/SAMv1.pdf  
CIGAR  summarizes  alignment  structure  

CIGAR  =  Concise  Idiosyncra%c  Gapped  Alignment  Report  

read1 99 ref 2 30 3M1D2M1I1M = 14 20 CATCTAG *


At  Broad:  Unmapped  BAM  instead  of  FASTQ  

Special  workflow  using  Picard  tools  for  improved  data  management    

Reference  genome  

BWA  MEM  
Raw  mapped  BAM   Mapped,  cleaned,    
sorted  BAM  

Unmapped  BAM   MergeBamAlignment  


Step  2:  Mark  duplicates  to  mi)gate  duplica)on  ar)facts  

Duplicates  =  non-­‐independent  measurements    


             of  a  sequence  fragment  
 
-­‐>  Must  be  removed  to  assess  support  for  alleles  correctly  

Reference  

Mapped  
reads  

Picard  MarkDuplicates  

=  sequencing  error  propagated  in  duplicates  


Step  3:  Local  realignment  around  indels  corrects  mapping  errors  

Several  consecu%ve  SNPs  


BEFORE   only  found  on  reads  ending  
on  the  right  of  the  
homopolymer  

Several  consecu%ve  SNPs  


only  found  on  reads  ending  
on  the  le^  of  the  
homopolymer   7bp  homopolymer  run  

AFTER  

Adding  a  1-­‐bp  inser%on    


brings  sanity  to  the    
en%re  alignment  
Op%onal  with  assembly-­‐
based  variant  callers    
Step  4:  Base  Recalibra)on  (BQSR)  corrects  for  machine  errors

• Sequencers  make  systema)c  errors  in  base  quality  scores  


• BQSR  corrects  the  quality  scores  (not  the  bases)  

Example  of  bias:  quali)es  reported  depending  on  nucleo)de  context      


RMSE = 4.188 RMSE = 0.281

10

10
5

5
Empirical − Reported Quality

Empirical − Reported Quality


0

0
−5

−5
original   recalibrated  
−10

−10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG

Dinuc Dinuc
Special handling for RNAseq splice junctions
Variant  Discovery  

Data   Variant   Callset    


Pre-­‐Processing   Discovery   Refinement  

FASTQ  -­‐>  BAM   BAM  -­‐>  VCF  


Best  Prac)ces  workflows  by  variant  type  

Germline  SNPs  &  Indels   Soma%c  SNVs  &  Indels   Soma%c  CNVs  
Discovery  of  germline  short  variants  is  done  on  cohorts  

• Single  genome  in  isola)on:  almost  never  useful  


• Family  or  popula)on  data    
add  valuable  informa)on  
– rarity  of  variants  
– de  novo  muta)ons  
– ethnic  background  
Visualiza)on  of  reads  at  a  probable  SNP  site  in  IGV  

Depth  of  coverage   Probable  C/T  SNP  

First  and  second  read  from  


the  same  fragment  

Non-­‐reference  bases  are  


colored;  reference  bases    
are  grey  
Individual  reads  
aligned  to  the  genome  
Reference  genome  
Short  variants  are  reported  in  VCF:  Variant  Call  Format    

##fileformat=VCFv4.1
##reference=1000GenomesPilot-NCBI36
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

Header  
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/0:48:1 1/0:48:8 1/1:43:5

Records  
20 1230237 . T . 47 PASS DP=13 GT:GQ:DP 0/0:54:7 0/0:48:4 0/0:61:2
20 1234567 . GT G 50 PASS DP=9 GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Format  specifica)on  in  


h`ps://samtools.github.io/hts-­‐specs/VCFv4.2.pdf  
Joint  analysis  from  early  stages  empowers  discovery  

Individual  callsets   Joint  callset  

Underpowered  analysis   Empowered  analysis  


Discovery  is  empowered  at  difficult  sites  

• Sample  #1  or  Sample  #N  alone:  


• weak  evidence  for  variant  
• may  miss  calling  the  variant    

• Both  samples  seen  together:  


• unlikely  to  be  ar%fact  
• call  the  variant  more  
confidently  
 
And  we  get  full  informa)on  at  all  sites  of  interest  

• Analyzed  individually:  
– No  call  for  either  sample  
– Very  different  reasons!  

SAMPLE  A  
• In  joint  analysis  with  other  
samples:  
– Hom-­‐ref  call  and  no-­‐call  
genotypes  emi`ed  

SAMPLE  B  
Tradi)onal  mul%-­‐sample  calling  approach  :  very  inefficient  

It  gives  us  the  right  answers,  but…  

Compute  requirements  
scale  very  badly  with  
number  of  samples!!!  
Want  to  add  new  samples?    
 
Got  to  re-­‐run  pipeline  from    
scratch!  The  N+1  problem!  
GVCF  workflow  solves  both  problems,  yields  same  results  

Compute  requirements  
scale  linearly  with    
number  of  samples   Want  to  add  a  new  sample?    
 
Just  call  it  by  itself  then  re-­‐genotype  the  
cohort  at  will!  
Best  Prac)ces  workflows  by  variant  type  

Germline  SNPs  &  Indels   Soma%c  SNVs  &  Indels   Soma%c  CNVs  
Key  challenges  :  tumor  heterogeneity  and  contamina)on  

normal  
     
dysplas%c   *      
cancer  
**"
**  
     
*  
     
*    

Tumor  Sample  
Tissue-­‐adjacent  Normal   Blood  Normal  

Adapted  from  h`ps://science.educa)on.nih.gov/supplements/nih1/cancer/guide/understanding1.html  


Amount  of  signal  may  be  comparable  to  noise  

Expecta%on  for     Expecta%on  for    


germline  variants   soma%c  variants  

signal  

signal  
heterogeneity  

contamina%on  

noise   noise  

+  AF  expected  to  follow  ploidy   +  no  reliance  on  ploidy  for  AF  
Best  Prac)ces  workflows  by  variant  type  

Germline  SNPs  &  Indels   Soma%c  SNVs  &  Indels   Soma%c  CNVs  
Copy  number:  it’s  all  about  coverage  

Collect  propor%onal  coverage     Normalize  to  remove  noise  

Iden%fy  segment  boundaries  


Callset  Refinement  

Data   Variant   Callset    


Pre-­‐Processing   Discovery   Refinement  

FASTQ  -­‐>  BAM   BAM  -­‐>  VCF  

PROJECT    
DEPENDENT  
h`ps://sowware.broadins)tute.org/gatk/best-­‐prac)ces/  
Part  3  

PIPELINES  AT  BROAD  


Pipelines  implemented  on  local  systems  at  Broad  

“The  Picard  Pipeline”  


Reads  

Implementa)on  

BAMs  

GATK  Best  Prac)ces  


Pre-­‐processing  

Germline  
SNPs  &  Indels  
“FireHose”  
Pipelines  implemented  on  Google  Cloud  

“Genomes  On  The  Cloud”   “FireCloud  /  Workbench”  


Reads  

BAMs  

WGS  Germline  
SNPs  &  Indels   hkps://so^ware.broadins%tute.org/firecloud/  
GOTC  WDL  script  shared  at    
h`ps://github.com/broadins)tute/wdl/tree/develop/scripts/broad_pipelines  
Write  your  own  pipelines  in  WDL!  

Tired  of  these  op=ons  for  wri=ng  pipelines?   Meet  the  Workflow  Descrip=on  Language  

Finally  a  workflow  language  meant  to  be  read    


(low-­‐tech)   and  wriken  by  humans  
Plain  Old  Scripts  

WDL  makes  it  easy  to:    


-­‐ Describe  analysis  tasks  
-­‐ Chain  tasks  into  workflows  
-­‐ Specify  advanced  behaviors  like  
paralleliza)on  
(high-­‐tech)  
Domain-­‐Specific  Languages  
hkps://so^ware.broadins%tute.org/wdl/  
 

You might also like